Add HBase batch update to reduce RPC overhead
---------------------------------------------
Key: HADOOP-1468
URL: https://issues.apache.org/jira/browse/HADOOP-1468
Project: Hadoop
Issue Type: New Feature
Components: contrib/hbase
Affects Versions: 0.14.0
Reporter: Jim Kellerman
Fix For: 0.14.0
On Wed, 2007-06-06 at 10:05 -0700, James Kennedy wrote:
Hi,
>
> I'm noticing that since the HClient/HRegionServer interface only allows
> for a per-column put(), there is a lot of RPC and some lease management
> overhead when writing large amounts of data. For example:
>
> for (int i = 0; i < 10000; i++) {
> Text rowKey = new Text(i+"");
> long lock = client.startUpdate(rowKey);
> client.put(lock, COL1, rowKey.getBytes());
> client.put(lock, COL2, someValue.getBytes());
> client.commit(lock);
> }
>
> This code takes my machine (using a single HMaster/HRegionServer on
> local filesystem) approximately 13 seconds to execute. When i measure
> the execution time within HRegionServer.put() I get total time spent in
> put() < 2 seconds. So it looks like there's definately overhead in the
> RPC communication and serialization/deserialization between client and
> server.
>
> To write 10000 rows, 10000 x (startUpdate=1 + #cols=2 + commit=1) =
> 40000 RPC operations.
>
> What I'm thinking, and please tell me if i'm wrong or if this is already
> in the works, is that if I create a row-level put() method that submits
> a map of column values at once, I would reduce the 2 + (#cols) RPC
> operations to one single atomic row-write RPC as well as eliminate the
> small but noticeable overhead in lease creation, renewal, and cancellation.
>
> It's not clear exactly what the performance improvement would be. The
> same amount of serialization/deserilalization must occur, but YourKit
> profiling tells me that the serialization overhead is negligible.
>
> Any thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.