[
https://issues.apache.org/jira/browse/HADOOP-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515884
]
Hadoop QA commented on HADOOP-1468:
-----------------------------------
+1
http://issues.apache.org/jira/secure/attachment/12362641/patch.txt applied and
successfully tested against trunk revision r559886.
Test results:
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/476/testReport/
Console output:
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/476/console
> Add HBase batch update to reduce RPC overhead
> ---------------------------------------------
>
> Key: HADOOP-1468
> URL: https://issues.apache.org/jira/browse/HADOOP-1468
> Project: Hadoop
> Issue Type: New Feature
> Components: contrib/hbase
> Affects Versions: 0.15.0
> Reporter: Jim Kellerman
> Assignee: Jim Kellerman
> Fix For: 0.15.0
>
> Attachments: patch.txt, patch.txt
>
>
> On Wed, 2007-06-06 at 10:05 -0700, James Kennedy wrote:
> Hi,
> >
> > I'm noticing that since the HClient/HRegionServer interface only allows
> > for a per-column put(), there is a lot of RPC and some lease management
> > overhead when writing large amounts of data. For example:
> >
> > for (int i = 0; i < 10000; i++) {
> > Text rowKey = new Text(i+"");
> > long lock = client.startUpdate(rowKey);
> > client.put(lock, COL1, rowKey.getBytes());
> > client.put(lock, COL2, someValue.getBytes());
> > client.commit(lock);
> > }
> >
> > This code takes my machine (using a single HMaster/HRegionServer on
> > local filesystem) approximately 13 seconds to execute. When i measure
> > the execution time within HRegionServer.put() I get total time spent in
> > put() < 2 seconds. So it looks like there's definately overhead in the
> > RPC communication and serialization/deserialization between client and
> > server.
> >
> > To write 10000 rows, 10000 x (startUpdate=1 + #cols=2 + commit=1) =
> > 40000 RPC operations.
> >
> > What I'm thinking, and please tell me if i'm wrong or if this is already
> > in the works, is that if I create a row-level put() method that submits
> > a map of column values at once, I would reduce the 2 + (#cols) RPC
> > operations to one single atomic row-write RPC as well as eliminate the
> > small but noticeable overhead in lease creation, renewal, and cancellation.
> >
> > It's not clear exactly what the performance improvement would be. The
> > same amount of serialization/deserilalization must occur, but YourKit
> > profiling tells me that the serialization overhead is negligible.
> >
> > Any thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.