Thanks, Dmitriy, I ran across InputFormat, and I'll dig deeper. Andreas
On Sun, Jan 16, 2011 at 5:19 PM, Dmitriy Ryaboy <[email protected]> wrote: > Andreas, > The slicers were a thin abstraction around Hadoop's InputFormats (which, > given a path, figure out how to break it up into chunks and what class to > use to read records out of those chunks) and RecordReaders (which do the > record reading, unsurprisingly). > > Start looking there. > > D > > On Sun, Jan 16, 2011 at 5:06 PM, Andreas Paepcke <[email protected] > >wrote: > > > I am looking for a pointer to where I should place the following > > functionality. I have a Web archive on a remote server, which provides > > large data record sets fragmented into into 2GB, gzipped files. Say I > > have a FragReader that knows to unzip and extract records from a > > fragment. And I have a FragRetriever that knows to get a new fragment > > when a FragReader is done. The whole machinery is modeled as a Java > > iterator that provides one continuous stream of tuples. > > > > Where to I place this functionality in the Pig workflow? All I need are > > pointers to the appropriate classes to extend or interfaces to > > implement. Or maybe new documentation I missed. > > > > Based on the UDF manual I started out building a slicer, but then > > realized that this notion is no longer part of the model. > > > > Thanks! > > > > Andreas > > >
