Hi All, I have a data set in text csv files and are compressed using gzip compression. Each record is having around 100 fields. I need to filter the data by applying various checks like "1. type of field", "2. nullable?", "3. min & max length", "4. value belongs to predefined list", "5. value substitution". In total there are around 200 checks in one data set. Like this there are 5 data sets.
If it would have been few checks only, I could have used simple Pig script with filter/UDF or Map reduce program. However it is not a good way to have all these checks in script/UDF/MR program. One way I can think of is to use a JSON or Java class to encapsulate all these checks. Then invoke them dynamically using reflection API to filter the record in UDF. However this may lead to performance issue and does not seem to be optimized solution. Since this looks like a common use case, I request your opinion to accomplish this. I can use MR/Pig/Hive to do this. Thanks Divyashree -- Regards Divyashree
