Can you be more specific? I didn't understand exactly what you need. I would say, though, that a customized Pig UDF should do the job.
With more info, I can try to give you a better idea of what I mean. Rodrigo. 2014-08-27 4:34 GMT+02:00 Amit Mittal <[email protected]>: > Hi All, > > I have a data set in text csv files and are compressed using gzip > compression. Each record is having around 100 fields. I need to filter the > data by applying various checks like "1. type of field", "2. nullable?", > "3. min & max length", "4. value belongs to predefined list", "5. value > substitution". In total there are around 200 checks in one data set. Like > this there are 5 data sets. > > If it would have been few checks only, I could have used simple Pig script > with filter/UDF or Map reduce program. However it is not a good way to > have all these checks in script/UDF/MR program. > > One way I can think of is to use a JSON or Java class to encapsulate all > these checks. Then invoke them dynamically using reflection API to filter > the record in UDF. However this may lead to performance issue and does not > seem to be optimized solution. > > Since this looks like a common use case, I request your opinion to > accomplish this. I can use MR/Pig/Hive to do this. > > Thanks > Amit >
