Sending to the pig developers group On Wed, Oct 14, 2015 at 2:17 PM, Siddhi Mehta <[email protected]> wrote:
> Hello Everyone, > > Just wanted to follow up on the my earlier post and see if there are any > thoughts around the same. > I was planning to take a stab to implement the same. > > The approach I was planning to use for the same is > 1. Make the storer that wants error handling capability implement an > interface(ErrorHandlingStoreFunc). > 2. Using this interface the storer can define if the thresholds for > error.Each store func can determine what the threshold should be.For > example HbaseStorage can have a different threshold from ParquetStorage. > 3. Whenever the storer gets created in > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc() > we intercept the called and give it a wrappedStoreFunc > 4. Every put next calls now gets delegated to the actual storer via the > delegate and we can listen in for error on putNext() and take care of the > allowing the error if within threshold or re throwing from there. > 5. The client can get information about the threshold value from the > counters to know if there was any data dropped. > > Thougts? > > Thanks, > Siddhi > > > On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta <[email protected]> > wrote: > >> Hey Guys, >> >> Currently a Pig job fails when one record out of the billions records >> fails on STORE. >> This is not always desirable behavior when you are dealing with millions >> of records and only few fail. >> In certain use-cases its desirable to know how many such errors and have >> an accounting for the same. >> Is there a configurable limits that we can set for pig so that we can >> allow a threshold for bad records on STORE similar to the lines of the JIRA >> for LOAD PIG-3059 <https://issues.apache.org/jira/browse/PIG-3059> >> >> Thanks, >> Siddhi >> > >
