Hi All,

I have a data set in text csv files and are compressed using gzip
compression. Each record is having around 100 fields. I need to filter the
data by applying various checks like "1. type of field", "2. nullable?",
"3. min & max length", "4. value belongs to predefined list", "5. value
substitution". In total there are around 200 checks in one data set. Like
this there are 5 data sets.

If it would have been few checks only, I could have used simple Pig script
with filter/UDF or Map  reduce program. However it is not a good way to
have all these checks in script/UDF/MR program.

One way I can think of is to use a JSON or Java class to encapsulate all
these checks. Then invoke them dynamically using reflection API to filter
the record in UDF. However this may lead to performance issue and does not
seem to be optimized solution.

Since this looks like a common use case, I request your opinion to
accomplish this. I can use MR/Pig/Hive to do this.

Thanks
Divyashree





-- 
Regards
Divyashree

Reply via email to