Created a jira for it.
https://issues.apache.org/jira/browse/PIG-4704


On Thu, Oct 15, 2015 at 12:24 PM, Siddhi Mehta <[email protected]> wrote:

> Thanks Saggi and Prashant for the suggestion.
>
> I would love to but I don't have time to work on a larger feature as that.
>
> I will start with handling it for stores and then expand it soon after.
>
> On Thu, Oct 15, 2015 at 4:06 AM, Saggi Neumann <[email protected]> wrote:
>
>> You may also check these for ideas. It would be good to have them
>> implemented:
>>
>> https://wiki.apache.org/pig/PigErrorHandlingInScripts
>> https://issues.apache.org/jira/browse/PIG-2620
>>
>> --
>>
>> Saggi Neumann
>>
>> Co-founder and CTO, Xplenty
>>
>> M: +972-544-546102
>>
>> On Thu, Oct 15, 2015 at 12:17 AM, Siddhi Mehta <[email protected]>
>> wrote:
>>
>> > Hello Everyone,
>> >
>> > Just wanted to follow up on the my earlier post and see if there are any
>> > thoughts around the same.
>> > I was planning to take a stab to implement the same.
>> >
>> > The approach I was planning to use for the same is
>> > 1. Make the storer that wants error handling capability implement an
>> > interface(ErrorHandlingStoreFunc).
>> > 2. Using this interface the storer can define if the thresholds for
>> > error.Each store func can determine what the threshold should be.For
>> > example HbaseStorage can have a different threshold from ParquetStorage.
>> > 3. Whenever the storer gets created in
>> >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc()
>> > we intercept the called and give it a wrappedStoreFunc
>> > 4. Every put next calls now gets delegated to the actual storer via the
>> > delegate and we can listen in for error on putNext() and take care of
>> the
>> > allowing the error  if within threshold or re throwing from there.
>> > 5. The client can get information about the threshold value from  the
>> > counters to know if there was any data dropped.
>> >
>> > Thougts?
>> >
>> > Thanks,
>> > Siddhi
>> >
>> >
>> > On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta <[email protected]>
>> > wrote:
>> >
>> > > Hey Guys,
>> > >
>> > > Currently a Pig job fails when one record out of the billions records
>> > > fails on STORE.
>> > > This is not always desirable behavior when you are dealing with
>> millions
>> > > of records and only few fail.
>> > > In certain use-cases its desirable to know how many such errors and
>> have
>> > > an accounting for the same.
>> > > Is there a configurable limits that we can set for pig so that we can
>> > > allow a threshold for bad records on STORE similar to the lines of the
>> > JIRA
>> > > for LOAD PIG-3059 <https://issues.apache.org/jira/browse/PIG-3059>
>> > >
>> > > Thanks,
>> > > Siddhi
>> > >
>> >
>>
>
>

Reply via email to