[jira] [Commented] (HBASE-6295) Possible performance improvement in client batch operations: presplit and send in background

Sergey Shelukhin (JIRA) Wed, 15 May 2013 14:39:18 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-6295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658857#comment-13658857
 ]


Sergey Shelukhin commented on HBASE-6295:
-----------------------------------------

Can you post it on rb?

Also, there's still large scale (hundreds of lines) copy-pasted code shared 
between AsyncProcess and Process. If we don't get rid of Process fast (and I 
suspect realistically we won't) it can become a problem. Can at least some 
shared code be made shared?

Also, patch needs a little bit of rebasing.

{code}
  private R result;
{code}
Can you please update the main comment in this file on why this is necessary.

{code}
          Row row = it.next();
          if (row != null) {
{code}
Is it a legal condition?
{code} // to move to trace, {code}
Move to trace? :)

{code}
if (LOG.isTraceEnabled() && numAttempt > 0) {
{code}
is numAttempt the number of tries, or retries? The above "> 1" would seem to 
indicate the former, but this checks >0.

{code}
if (nextLog == 0){
            nextLog = EnvironmentEdgeManager.currentTimeMillis() + 3000;
          }
{code}
This can just be set before the start of the loop.

{code}
} else {
            if (EnvironmentEdgeManager.currentTimeMillis() > nextLog) {
...
            }
            nextLog = EnvironmentEdgeManager.currentTimeMillis() + 5000;
{code}
This will update nextLog in every iteration of the loop (after the first), so 
{code}if (EnvironmentEdgeManager.currentTimeMillis() > nextLog)
{code}
will never (well, almost never, technically) become true.
Only needs to be updated when logging.

{code}
          retriedErrors = new BatchErrors<Row>();
          RetriesExhaustedWithDetailsException exception = 
errors.makeException();
          errors  = new BatchErrors<Row>();
          retriedErrors = new BatchErrors<Row>();
{code}
Why does this do resetting of retriedErrors? And twice, too.


{code}

    /**
     * Methods and attributes to manage a batch process are grouped into this 
single class.
     * This allows, by creating a Process<R> per batch process to ensure 
multithread safety.
     *
     * This code should be move to HTable once processBatchCallback is not 
supported anymore in
     * the HConnection interface.
{code}
Javadoc for new class is also copy-pasted. Can you please write javadoc that 
explains what it does?

Code in HTable looks very non-thread-safe, I am assuming that is ok.

{code}
  private HConnectionManager.HConnectionImplementation.AsyncProcess<Object> ap;
{code}
Why is there just one AsyncProcess per table? I thought it was supposed to be 
per batch request?

{code}
      ap.submit(writeAsyncBuffer);
      while (previousSize == writeAsyncBuffer.size()) {
        try {
          Thread.sleep(1000);
        } catch (InterruptedException e) {
          throw new InterruptedIOException("Still not sent: " + 
writeAsyncBuffer.size() + " rows.");
        }
        ap.submit(writeAsyncBuffer);
      }
{code}
Why does it keep submitting the same buffer again and again?

{code}
 if (!clearBufferOnFail){
        if (ap.hasError()){
          ap.waitUntilDone();
          writeAsyncBuffer.addAll(ap.getFailedOperation());
        }
      }
      throw
{code}
What is this for? If put calls doPut, doPut calls backgroundFlushCommits, and 
this happens and puts some stuff into writeAsyncBuffer, 
exception will be thrown outside of put. What will the data be used for inside 
the buffer?
Since getWriteBuffer is removed and there's no way to get at this buffer.

Nit: Batch.java has some whitespace added at the end.

ZKUtil has some change in deleteNodeFailSilent that look unrelated.

                
> Possible performance improvement in client batch operations: presplit and 
> send in background
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6295
>                 URL: https://issues.apache.org/jira/browse/HBASE-6295
>             Project: HBase
>          Issue Type: Improvement
>          Components: Client, Performance
>    Affects Versions: 0.95.2
>            Reporter: Nicolas Liochon
>            Assignee: Nicolas Liochon
>              Labels: noob
>         Attachments: 6295.v1.patch, 6295.v2.patch, 6295.v3.patch, 
> 6295.v4.patch, 6295.v5.patch, 6295.v6.patch
>
>
> today batch algo is:
> {noformat}
> for Operation o: List<Op>{
>   add o to todolist
>   if todolist > maxsize or o last in list
>     split todolist per location
>     send split lists to region servers
>     clear todolist
>     wait
> }
> {noformat}
> We could:
> - create immediately the final object instead of an intermediate array
> - split per location immediately
> - instead of sending when the list as a whole is full, send it when there is 
> enough data for a single location
> It would be:
> {noformat}
> for Operation o: List<Op>{
>   get location
>   add o to todo location.todolist
>   if (location.todolist > maxLocationSize)
>     send location.todolist to region server 
>     clear location.todolist
>     // don't wait, continue the loop
> }
> send remaining
> wait
> {noformat}
> It's not trivial to write if you add error management: retried list must be 
> shared with the operations added in the todolist. But it's doable.
> It's interesting mainly for 'big' writes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6295) Possible performance improvement in client batch operations: presplit and send in background

Reply via email to