Yes, that should work. I will use InputFormat.getNext from the SampleLoader to skip the records. Thanks, Thejas
On 11/3/09 6:39 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote: > We definitely want to avoid parsing every tuple when sampling. But do > we need to implement a special function for it? Pig will have access > to the InputFormat instance, correct? Can it not call > InputFormat.getNext the desired number of times (which will not parse > the tuple) and then call LoadFunc.getNext to get the next parsed tuple? > > Alan. > > On Nov 3, 2009, at 4:28 PM, Thejas Nair wrote: > >> In the new implementation of SampleLoader subclasses (used by order- >> by, >> skew-join ..) as part of the loader redesign, we are not only >> reading all >> the records input but also parsing them as pig tuples. >> >> This is because the SampleLoaders are wrappers around the actual input >> loaders specified in the query. We can make things much faster by >> having a >> skipNext() function (or skipNext(int numSkip) ) which will avoid >> parsing the >> record into a pig tuple. >> LoadFunc could optionally implement this (easy to implement) >> function (which >> will be part of an interface) for improving speed of queries such as >> order-by. >> >> -Thejas >> >