How to process only input files containing 100% valid rows

2013-04-18 Thread Matthias Scherer
Hi all, In my mapreduce job, I would like to process only whole input files containing only valid rows. If one map task processing an input split of a file detects an invalid row, the whole file should be "marked" as invalid and not processed at all. This input file will then be cleansed by ano

Re: How to process only input files containing 100% valid rows

2013-04-18 Thread Steve Lewis
With files that small it is much better to write a custom input format which checks the entire file and only passes records from good files. If you need Hadoop you are probably processing a large number of these files and an input format could easily read the entire file and handle it if it as as s

Re: How to process only input files containing 100% valid rows

2013-04-19 Thread Niels Basjes
How about a different approach: If you use the multiple output option you can process the valid lines in a normal way and put the invalid lines in a special separate output file. On Apr 18, 2013 9:36 PM, "Matthias Scherer" wrote: > Hi all, > > ** ** > > In my mapreduce job, I would like to pr

AW: How to process only input files containing 100% valid rows

2013-04-19 Thread Matthias Scherer
I have to add that we have 1-2 Billion of Events per day, split to some thousands of files. So pre-reading each file in the InputFormat should be avoided. And yes, we could use MultipleOutputs and write bad files to process each input file. But we (our Operations team) think that there is more

Re: How to process only input files containing 100% valid rows

2013-04-19 Thread Wellington Chevreuil
How about use a combiner to mark as dirty all rows from a dirty file, for instance, putting "dirty" flag as part of the key, then in the reducer you can simply ignore this rows and/or output the bad file name. It still will have to pass through the whole file, but at least avoids the case where yo

Re: AW: How to process only input files containing 100% valid rows

2013-04-19 Thread Nitin Pawar
Reject the entire file even if a single record is invalid? There has to be a eeal serious reason to take this approach If not in any case to check the file has all valid lines you are opening the files and parsing them. Why not then parse + separate incorrect lines as suggested in previous mails T

Re: AW: How to process only input files containing 100% valid rows

2013-04-19 Thread MARCOS MEDRADO RUBINELLI
Matthias, As far as I know, there are no guarantees on when counters will be updated during the job. One thing you can do is to write a metadata file along with your parsed events listing what files have errors and should be ignored in the next step of your ETL workflow. If you really don't wa