Thanks, Dmitriy, I ran across InputFormat, and
I'll dig deeper.

Andreas


On Sun, Jan 16, 2011 at 5:19 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Andreas,
> The slicers were a thin abstraction around Hadoop's InputFormats (which,
> given a path, figure out how to break it up into chunks and what class to
> use to read records out of those chunks) and RecordReaders (which do the
> record reading, unsurprisingly).
>
> Start looking there.
>
> D
>
> On Sun, Jan 16, 2011 at 5:06 PM, Andreas Paepcke <[email protected]
> >wrote:
>
> > I am looking for a pointer to where I should place the following
> > functionality. I have a Web archive on a remote server, which provides
> > large data record sets fragmented into into 2GB, gzipped files.  Say I
> > have a FragReader that knows to unzip and extract records from a
> > fragment. And I have a FragRetriever that knows to get a new fragment
> > when a FragReader is done. The whole machinery is modeled as a Java
> > iterator that provides one continuous stream of tuples.
> >
> > Where to I place this functionality in the Pig workflow? All I need are
> > pointers to the appropriate classes to extend or interfaces to
> > implement. Or maybe new documentation I missed.
> >
> > Based on the UDF manual I started out building a slicer, but then
> > realized that this notion is no longer part of the model.
> >
> > Thanks!
> >
> > Andreas
> >
>

Reply via email to