Hello Everyone, Just wanted to follow up on the my earlier post and see if there are any thoughts around the same. I was planning to take a stab to implement the same.
The approach I was planning to use for the same is 1. Make the storer that wants error handling capability implement an interface(ErrorHandlingStoreFunc). 2. Using this interface the storer can define if the thresholds for error.Each store func can determine what the threshold should be.For example HbaseStorage can have a different threshold from ParquetStorage. 3. Whenever the storer gets created in org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc() we intercept the called and give it a wrappedStoreFunc 4. Every put next calls now gets delegated to the actual storer via the delegate and we can listen in for error on putNext() and take care of the allowing the error if within threshold or re throwing from there. 5. The client can get information about the threshold value from the counters to know if there was any data dropped. Thougts? Thanks, Siddhi On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta <[email protected]> wrote: > Hey Guys, > > Currently a Pig job fails when one record out of the billions records > fails on STORE. > This is not always desirable behavior when you are dealing with millions > of records and only few fail. > In certain use-cases its desirable to know how many such errors and have > an accounting for the same. > Is there a configurable limits that we can set for pig so that we can > allow a threshold for bad records on STORE similar to the lines of the JIRA > for LOAD PIG-3059 <https://issues.apache.org/jira/browse/PIG-3059> > > Thanks, > Siddhi >
