I am looking for a pointer to where I should place the following
functionality. I have a Web archive on a remote server, which provides
large data record sets fragmented into into 2GB, gzipped files.  Say I
have a FragReader that knows to unzip and extract records from a
fragment. And I have a FragRetriever that knows to get a new fragment
when a FragReader is done. The whole machinery is modeled as a Java
iterator that provides one continuous stream of tuples.

Where to I place this functionality in the Pig workflow? All I need are
pointers to the appropriate classes to extend or interfaces to
implement. Or maybe new documentation I missed.

Based on the UDF manual I started out building a slicer, but then
realized that this notion is no longer part of the model.

Thanks!

Andreas

Reply via email to