I am looking for a pointer to where I should place the following functionality. I have a Web archive on a remote server, which provides large data record sets fragmented into into 2GB, gzipped files. Say I have a FragReader that knows to unzip and extract records from a fragment. And I have a FragRetriever that knows to get a new fragment when a FragReader is done. The whole machinery is modeled as a Java iterator that provides one continuous stream of tuples.
Where to I place this functionality in the Pig workflow? All I need are pointers to the appropriate classes to extend or interfaces to implement. Or maybe new documentation I missed. Based on the UDF manual I started out building a slicer, but then realized that this notion is no longer part of the model. Thanks! Andreas
