[ https://issues.apache.org/jira/browse/PIG-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979276#comment-14979276 ]
Daniel Dai commented on PIG-4704: --------------------------------- The patch looks quite good. Some comments however: 1. I am thinking about using the same interface for LoadFunc/EvalFunc, how about changing OutputErrorHandler to ErrorHandler, and CounterBasedOutputErrorHandler to CounterBasedErrorHandler, and method signature accordingly 2. I don't see a reason we need to disable error handling, since StoreFunc func without ErrorHandler will not do error handling anyway 3. Config entries STORER_MIN_ERRORS_CONF_KEY/STORER_ERROR_THRESHOLD_CONF_KEY should be exposed in pig.properties/pig-default.properties, should we rename it into something like "pig.errors.min.records", "pig.error.threshold.percent"? 4. I'd like add a max error count threshold in config 5. Pig use space instead of tab for formatting > Customizable Error Handling for Storers in Pig > ----------------------------------------------- > > Key: PIG-4704 > URL: https://issues.apache.org/jira/browse/PIG-4704 > Project: Pig > Issue Type: Improvement > Reporter: Siddhi Mehta > Assignee: Siddhi Mehta > Attachments: PIG-4704.patch, PIG-4704.patch, PIG-4704_3.patch > > > On Thu, Oct 15, 2015 at 4:06 AM, Saggi Neumann <sa...@xplenty.com> wrote: > You may also check these for ideas. It would be good to have them > implemented: > https://wiki.apache.org/pig/PigErrorHandlingInScripts > https://issues.apache.org/jira/browse/PIG-2620 > -- > Saggi Neumann > Co-founder and CTO, Xplenty > M: +972-544-546102 > On Thu, Oct 15, 2015 at 12:17 AM, Siddhi Mehta <smehtau...@gmail.com> wrote: > > Hello Everyone, > > > > Just wanted to follow up on the my earlier post and see if there are any > > thoughts around the same. > > I was planning to take a stab to implement the same. > > > > The approach I was planning to use for the same is > > 1. Make the storer that wants error handling capability implement an > > interface(ErrorHandlingStoreFunc). > > 2. Using this interface the storer can define if the thresholds for > > error.Each store func can determine what the threshold should be.For > > example HbaseStorage can have a different threshold from ParquetStorage. > > 3. Whenever the storer gets created in > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc() > > we intercept the called and give it a wrappedStoreFunc > > 4. Every put next calls now gets delegated to the actual storer via the > > delegate and we can listen in for error on putNext() and take care of the > > allowing the error if within threshold or re throwing from there. > > 5. The client can get information about the threshold value from the > > counters to know if there was any data dropped. > > > > Thougts? > > > > Thanks, > > Siddhi > > > > > > On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta <smehtau...@gmail.com> > > wrote: > > > > > Hey Guys, > > > > > > Currently a Pig job fails when one record out of the billions records > > > fails on STORE. > > > This is not always desirable behavior when you are dealing with millions > > > of records and only few fail. > > > In certain use-cases its desirable to know how many such errors and have > > > an accounting for the same. > > > Is there a configurable limits that we can set for pig so that we can > > > allow a threshold for bad records on STORE similar to the lines of the > > JIRA > > > for LOAD PIG-3059 <https://issues.apache.org/jira/browse/PIG-3059> > > > > > > Thanks, > > > Siddhi > > > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)