[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ashutosh Chauhan reopened PIG-820: ---------------------------------- Samplable interface introduced as a part of this patch enforces the contract of implementing getPosition() and next() on the loaders implementing it. An additional requirement for a loader to be a sampler is that they should correctly handle getNext() without knowing the position in the file. Current patch doesn't include this contract as a part of interface. That should be a part of the interface. Reopening the jira because of this issue. > PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume > another loader > ----------------------------------------------------------------------------------------- > > Key: PIG-820 > URL: https://issues.apache.org/jira/browse/PIG-820 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.3.0, 0.4.0 > Reporter: Alan Gates > Assignee: Ashutosh Chauhan > Fix For: 0.4.0 > > Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, > pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch > > > Currently a sampling job requires that data already be stored in > BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For > order by this > has mostly been acceptable, because users tend to use order by at the end of > their script where other MR jobs have already operated on the data and thus it > is already being stored in BinaryStorage. For pig scripts that just did an > order by, an entire MR job is required to read the data and write it out > in BinaryStorage format. > As we begin work on join algorithms that will require sampling, this > requirement to read the entire input and write it back out will not be > acceptable. > Join is often the first operation of a script, and thus is much more likely > to trigger this useless up front translation job. > Instead RandomSampleLoader can be changed to subsume an existing loader, > using the user specified loader to read the tuples while handling the skipping > between tuples itself. This will require the subsumed loader to implement a > Samplable Interface, that will look something like: > {code} > public interface SamplableLoader extends LoadFunc { > > /** > * Skip ahead in the input stream. > * @param n number of bytes to skip > * @return number of bytes actually skipped. The return semantics are > * exactly the same as {...@link java.io.InpuStream#skip(long)} > */ > public long skip(long n) throws IOException; > > /** > * Get the current position in the stream. > * @return position in the stream. > */ > public long getPosition() throws IOException; > } > {code} > The MRCompiler would then check if the loader being used to load data > implemented the SamplableLoader interface. If so, rather than create an > initial MR > job to do the translation it would create the sampling job, having > RandomSampleLoader use the user specified loader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.