Can you be more specific? I didn't understand exactly what you need. I
would say, though, that a customized Pig UDF should do the job.

With more info, I can try to give you a better idea of what I mean.

Rodrigo.


2014-08-27 4:34 GMT+02:00 Amit Mittal <[email protected]>:

> Hi All,
>
> I have a data set in text csv files and are compressed using gzip
> compression. Each record is having around 100 fields. I need to filter the
> data by applying various checks like "1. type of field", "2. nullable?",
> "3. min & max length", "4. value belongs to predefined list", "5. value
> substitution". In total there are around 200 checks in one data set. Like
> this there are 5 data sets.
>
> If it would have been few checks only, I could have used simple Pig script
> with filter/UDF or Map  reduce program. However it is not a good way to
> have all these checks in script/UDF/MR program.
>
> One way I can think of is to use a JSON or Java class to encapsulate all
> these checks. Then invoke them dynamically using reflection API to filter
> the record in UDF. However this may lead to performance issue and does not
> seem to be optimized solution.
>
> Since this looks like a common use case, I request your opinion to
> accomplish this. I can use MR/Pig/Hive to do this.
>
> Thanks
> Amit
>

Reply via email to