Hi Everybody, I'm doing a project where I have to read a large set of compress files (gz). I'm using python and streaming to achieve my goals. However, I have a problem, there are corrupt compress files that are killing my map/reduce jobs. My environment is the following: Hadoop-0.18.3 (CDH1)
Do you guys have some recommendations how to manage this case? How I can catch that exception using python so that my jobs don't fail? How I can identify these files using python and move them to a corrupt file folder? I really appreciate any recommendation Xavier