[jira] Commented: (HBASE-2898) MultiPut makes proper error handling impossible and leads to corrupted data

ryan rawson (JIRA) Tue, 03 Aug 2010 15:10:40 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895057#action_12895057
 ]


ryan rawson commented on HBASE-2898:
------------------------------------

At step 16, the return code of HRS#put(...) is the offset of which Put failed.  
In ye olde days, this worked well because the first Put that failed in any way 
aborted and the return code would let the client know where to continue.

The way it used to work is the first bad put we came across, we returned the 
index at which we failed.  If it was a WrongRegionException or a bad column 
family, either way we'd just stop and the caller would know which put failed 
(but not the reason why).  

If all you get is WRE, then the new code behaves much like the old code.  The 
first Put that runs into WRE will in step #16 cause a return code (ie: list 
offset) to return to the client.

But when you run into bad column families.... what seems to be happening is the 
those puts are marked as 'bad', then we process the rest of the puts.  In the 
HRS wrapper (ie: step 16) we return a failure even though perhaps all but 1 
succeeded.

Previously there was no way to know which put failed and why, which is why the 
implementation of multiput made sense, but now with the detailed error codes we 
can do something more.

> MultiPut makes proper error handling impossible and leads to corrupted data
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-2898
>                 URL: https://issues.apache.org/jira/browse/HBASE-2898
>             Project: HBase
>          Issue Type: Bug
>          Components: client, regionserver
>    Affects Versions: 0.89.20100621
>            Reporter: Benoit Sigoure
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> The {{MultiPut}} RPC needs to be completely rewritten.  Let's see why step by 
> step.
> # An HBase user calls any of the {{put}} methods on an {{HTable}} instance.
> # Eventually, {{HTable#flushCommits}} is invoked to actually send the edits 
> to the RegionServer(s).
> # This takes us to {{HConnectionManager#processBatchOfPuts}} where all edits 
> are sorted into one or more {{MultiPut}}.  Each {{MultiPut}} is aggregating 
> all the edits that are going to a particular RegionServer.
> # A thread pool is used to send all the {{MultiPut}} in parallel to their 
> respective RegionServer.  Let's follow what happens for a single {{MultiPut}}.
> # The {{MultiPut}} travels through the IPC code on the client and then 
> through the network and then through the IPC code on the RegionServer.
> # We're now in {{HRegionServer#multiPut}} where a new {{MultiPutResponse}} is 
> created.
> # Still in {{HRegionServer#multiPut}}.  Since a {{MultiPut}} is essentially a 
> map from region name to a list of {{Put}} for that region, there's a {{for}} 
> loop that executes each list of {{Put}} for each region sequentially.  Let's 
> follow what happens for a single list of {{Put}} for a particular region.
> # We're now in {{HRegionServer#put(byte[], List<Put>)}}.  Each {{Put}} is 
> associated with the row lock that was specified by the client (if any).  Then 
> the pairs of {{(Put, lock id)}} are handed to the right {{HRegion}}.
> # Now we're in {{HRegion#put(Pair<Put, Integer>[])}}, which immediately takes 
> us to {{HRegion#doMiniBatchPut}}.
> # At this point, let's assume that we're doing just 2 edits.  So the 
> {{BatchOperationInProgress}} that {{doMiniBatchPut}} has contains just 2 
> {{Put}}.
> # The {{while}} loop in {{doMiniBatchPut}} that's going to execute each 
> {{Put}} starts.
> # The first {{Put}} succeeds normally, so at the end of the first iteration, 
> its {{batchOp.retCodes}} is marked as {{OperationStatusCode.SUCCESS}}.
> # The second {{Put}} fails because an exception is thrown when appending the 
> edit to the {{WAL}}.  Its {{batchOp.retCodes}} is marked as 
> {{OperationStatusCode.FAILURE}}.
> # {{doMiniBatchPut}} is done and returns to {{HRegion#put(Pair<Put, 
> Integer>[])}}, which returns to {{HRegionServer#put(byte[], List<Put>)}}.
> # At this point, {{HRegionServer#put(byte[], List<Put>)}} does a {{for}} loop 
> and extracts the *first* failure out of the {{OperationStatusCode[]}}.  In 
> our case, it'll take the {{OperationStatusCode.FAILURE}} that came from the 
> second {{Put}}.
> # This error code is returned to {{HRegionServer#multiPut}}, which records in 
> the {{MultiPutResponse}} that there was a failure - but *we don't know for 
> which {{Put}}*
> So the client has no way of knowing which {{Put}} failed.  All it knows is 
> that at least one of them failed for a particular region.  Its best bet is to 
> retry all the {{Put}} for this region.  But this has an unintended 
> consequence.  The {{Put}} that were successful during the first run will be 
> *re-applied*.  This will unexpectedly create extra versions.  Now I realize 
> most people don't really care about versions, so they won't notice.  But 
> whoever relies on the versions for whatever reason will rightfully consider 
> this to be data corruption.
> {{MultiPut}} makes proper error handling impossible.  Since this RPC cannot 
> guarantee any atomicity other than at the individual {{Put}} level, it *must* 
> return to the client specific information about which {{Put}} failed in case 
> of a failure, so that the client can do proper error handling.
> This requires us to change the {{MultiPutResponse}} so that it can indicate 
> which {{Put}} specifically failed.  We could do this for instance by giving 
> the index of the {{Put}} along with its {{OperationStatusCode}}.  So in the 
> scenario above, the {{MultiPutResponse}} would essentially return something 
> like: "for that particular region, put #0 succeeded, put #1 failed".  If we 
> want to save a bit of space, we may want to omit the successes from the 
> response and only mention the failures - so a response that doesn't mention 
> any failure means that everything was successful.  Not sure whether that's a 
> good idea though.
> Since doing this require an incompatible RPC change, I propose that we take 
> the opportunity to rewrite the {{MultiPut}} RPC too.  Right now it's very 
> inefficient, it's just a hack on top of {{Put}}.  When {{MultiPut}} is 
> written to the wire, a lot of unnecessary duplicate data is sent out.  The 
> timestamp, the row key and the family are sent out to the wire N+1 times, 
> where N is the number of edits for a particular row, instead of just once (!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2898) MultiPut makes proper error handling impossible and leads to corrupted data

Reply via email to