Hi folks- Something we've been using that's been working out pretty well with client-loaded data (I.e., not "bulk loading" via creating your own StoreFiles) is to pre-bucket data by region on chunks of data.
When the client calls flushCommits() it will internally do the same thing and bucket Puts by RegionServer, but when loading a lot of data it is more efficient to make 1 RS RPC call to deliver 100 puts than 10 RPC calls to deliver 10 Puts apiece, for example. This RS iteration is not a "bug" in the client because it makes it easy to communicate to any RS in the cluster, but it doesn't know you are trying to batch load. The gist is to bucket every 50k-100k+ Puts or so and reduce the number of RS RPC calls. Your milage may vary, so work out the optimal bucket interval with your data. I'll add this to the Hbase book and look into adding a utility method for this. public void bucketAndPut(HTable htable, List<Put> puts) throws IOException { // could also use a guava multimap Map<String, List<Put>> putMap = new HashMap<String, List<Put>>(); for (Put put: puts) { HRegionLocation rl = htable.getRegionLocation( put.getRow() ); String hostname = rl.getServerAddress().getHostname(); add( putMap, hostname, put); } for (List<Put> puts2: putMap.values()) { // adjust writeBuffer as necessary // or use .batch method htable.put( puts2 ); } htable.flushCommits(); } private void add(Map<String, List<Put>> putMap, String hostname, Put put) { List<Put> recs = putMap.get( hostname); if (recs == null) { recs = new ArrayList<Put>(); putMap.put( hostname, recs); } recs.add(put); } Doug Meil Chief Software Architect, Explorys doug.m...@explorys.com