[ 
https://issues.apache.org/jira/browse/PHOENIX-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989597#comment-14989597
 ] 

Josh Mahonin commented on PHOENIX-2367:
---------------------------------------

I like the idea of having this change be configurable.

It sounds like there may not be any performance overhead from this new change, 
but if there is a slight overhead, it would be nice to be able to bypass it if 
possible. Although certainly a good idea for CSV loading, from the perspective 
of a Spark/Pig/MapReduce user, the written data is frequently required to be 
well formed at that stage, having already gone through several steps of an 
execution pipeline.

> Change PhoenixRecordWriter to use execute instead of executeBatch
> -----------------------------------------------------------------
>
>                 Key: PHOENIX-2367
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2367
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Siddhi Mehta
>            Assignee: Siddhi Mehta
>
> Hey All,
> I wanted to add a notion of skipping invalid rows for PhoenixHbaseStorage 
> similar to how the CSVBulkLoad tool has an option of ignoring the bad rows.I 
> did some work on the apache pig code that allows Storers to have a notion of 
> Customizable/Configurable Errors PIG-4704.
> I wanted to plug this behavior for PhoenixHbaseStorage and propose certain 
> changes for the same.
> Current Behavior/Problem:
> PhoenixRecordWriter makes use of executeBatch() to process rows once batch 
> size is reached. If there are any client side validation/syntactical errors 
> like data not fitting the column size, executeBatch() throws an exception and 
> there is no-way to retrieve the valid rows from the batch and retry them. We 
> discard the whole batch or fail the job without errorhandling.
> With auto commit set to false execute() also servers the purpose of not 
> making any rpc calls  but does a bunch of validation client side and adds it 
> to the client cache of mutation.
> On conn.commit() we make a rpc call.
> Proposed Change
> To be able to use Configurable ErrorHandling and ignore only the failed 
> records instead of discarding the whole batch I want to propose changing the 
> behavior in PhoenixRecordWriter from execute to executeBatch() or having a 
> configuration to toggle between the 2 behaviors 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to