Created a jira for it. https://issues.apache.org/jira/browse/PIG-4704
On Thu, Oct 15, 2015 at 12:24 PM, Siddhi Mehta <[email protected]> wrote: > Thanks Saggi and Prashant for the suggestion. > > I would love to but I don't have time to work on a larger feature as that. > > I will start with handling it for stores and then expand it soon after. > > On Thu, Oct 15, 2015 at 4:06 AM, Saggi Neumann <[email protected]> wrote: > >> You may also check these for ideas. It would be good to have them >> implemented: >> >> https://wiki.apache.org/pig/PigErrorHandlingInScripts >> https://issues.apache.org/jira/browse/PIG-2620 >> >> -- >> >> Saggi Neumann >> >> Co-founder and CTO, Xplenty >> >> M: +972-544-546102 >> >> On Thu, Oct 15, 2015 at 12:17 AM, Siddhi Mehta <[email protected]> >> wrote: >> >> > Hello Everyone, >> > >> > Just wanted to follow up on the my earlier post and see if there are any >> > thoughts around the same. >> > I was planning to take a stab to implement the same. >> > >> > The approach I was planning to use for the same is >> > 1. Make the storer that wants error handling capability implement an >> > interface(ErrorHandlingStoreFunc). >> > 2. Using this interface the storer can define if the thresholds for >> > error.Each store func can determine what the threshold should be.For >> > example HbaseStorage can have a different threshold from ParquetStorage. >> > 3. Whenever the storer gets created in >> > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc() >> > we intercept the called and give it a wrappedStoreFunc >> > 4. Every put next calls now gets delegated to the actual storer via the >> > delegate and we can listen in for error on putNext() and take care of >> the >> > allowing the error if within threshold or re throwing from there. >> > 5. The client can get information about the threshold value from the >> > counters to know if there was any data dropped. >> > >> > Thougts? >> > >> > Thanks, >> > Siddhi >> > >> > >> > On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta <[email protected]> >> > wrote: >> > >> > > Hey Guys, >> > > >> > > Currently a Pig job fails when one record out of the billions records >> > > fails on STORE. >> > > This is not always desirable behavior when you are dealing with >> millions >> > > of records and only few fail. >> > > In certain use-cases its desirable to know how many such errors and >> have >> > > an accounting for the same. >> > > Is there a configurable limits that we can set for pig so that we can >> > > allow a threshold for bad records on STORE similar to the lines of the >> > JIRA >> > > for LOAD PIG-3059 <https://issues.apache.org/jira/browse/PIG-3059> >> > > >> > > Thanks, >> > > Siddhi >> > > >> > >> > >
