Andreas, The slicers were a thin abstraction around Hadoop's InputFormats (which, given a path, figure out how to break it up into chunks and what class to use to read records out of those chunks) and RecordReaders (which do the record reading, unsurprisingly).
Start looking there. D On Sun, Jan 16, 2011 at 5:06 PM, Andreas Paepcke <[email protected]>wrote: > I am looking for a pointer to where I should place the following > functionality. I have a Web archive on a remote server, which provides > large data record sets fragmented into into 2GB, gzipped files. Say I > have a FragReader that knows to unzip and extract records from a > fragment. And I have a FragRetriever that knows to get a new fragment > when a FragReader is done. The whole machinery is modeled as a Java > iterator that provides one continuous stream of tuples. > > Where to I place this functionality in the Pig workflow? All I need are > pointers to the appropriate classes to extend or interfaces to > implement. Or maybe new documentation I missed. > > Based on the UDF manual I started out building a slicer, but then > realized that this notion is no longer part of the model. > > Thanks! > > Andreas >
