Hello Everyone,

Just wanted to follow up on the my earlier post and see if there are any
thoughts around the same.
I was planning to take a stab to implement the same.

The approach I was planning to use for the same is
1. Make the storer that wants error handling capability implement an
interface(ErrorHandlingStoreFunc).
2. Using this interface the storer can define if the thresholds for
error.Each store func can determine what the threshold should be.For
example HbaseStorage can have a different threshold from ParquetStorage.
3. Whenever the storer gets created in
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc()
we intercept the called and give it a wrappedStoreFunc
4. Every put next calls now gets delegated to the actual storer via the
delegate and we can listen in for error on putNext() and take care of the
allowing the error  if within threshold or re throwing from there.
5. The client can get information about the threshold value from  the
counters to know if there was any data dropped.

Thougts?

Thanks,
Siddhi


On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta <[email protected]> wrote:

> Hey Guys,
>
> Currently a Pig job fails when one record out of the billions records
> fails on STORE.
> This is not always desirable behavior when you are dealing with millions
> of records and only few fail.
> In certain use-cases its desirable to know how many such errors and have
> an accounting for the same.
> Is there a configurable limits that we can set for pig so that we can
> allow a threshold for bad records on STORE similar to the lines of the JIRA
> for LOAD PIG-3059 <https://issues.apache.org/jira/browse/PIG-3059>
>
> Thanks,
> Siddhi
>

Reply via email to