Interesting project, Alex. Since there're bucketsCount scanners compared to one scanner originally, have you performed load testing to see the impact ?
Thanks On Tue, Apr 19, 2011 at 10:25 AM, Alex Baranau <alex.barano...@gmail.com>wrote: > Hello guys, > > I'd like to introduce a new small java project/lib around HBase: HBaseWD. > It > is aimed to help with distribution of the load (across regionservers) when > writing sequential (becasue of the row key nature) records. It implements > the solution which was discussed several times on this mailing list (e.g. > here: http://search-hadoop.com/m/gNRA82No5Wk). > > Please find the sources at https://github.com/sematext/HBaseWD (there's > also > a jar of current version for convenience). It is very easy to make use of > it: e.g. I added it to one existing project with 1+2 lines of code (one > where I write to HBase and 2 for configuring MapReduce job). > > Any feedback is highly appreciated! > > Please find below the short intro to the lib [1]. > > Alex Baranau > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase > > [1] > > Description: > ------------ > HBaseWD stands for Distributing (sequential) Writes. It was inspired by > discussions on HBase mailing lists around the problem of choosing between: > * writing records with sequential row keys (e.g. time-series data with row > key > built based on ts) > * using random unique IDs for records > > First approach makes possible to perform fast range scans with help of > setting > start/stop keys on Scanner, but creates single region server hot-spotting > problem upon writing data (as row keys go in sequence all records end up > written into a single region at a time). > > Second approach aims for fastest writing performance by distributing new > records over random regions but makes not possible doing fast range scans > against written data. > > The suggested approach stays in the middle of the two above and proved to > perform well by distributing records over the cluster during writing data > while allowing range scans over it. HBaseWD provides very simple API to > work with which makes it perfect to use with existing code. > > Please refer to unit-tests for lib usage info as they aimed to act as > example. > > Brief Usage Info (Examples): > ---------------------------- > > Distributing records with sequential keys which are being written in up to > Byte.MAX_VALUE buckets: > > byte bucketsCount = (byte) 32; // distributing into 32 buckets > RowKeyDistributor keyDistributor = > new > RowKeyDistributorByOneBytePrefix(bucketsCount); > for (int i = 0; i < 100; i++) { > Put put = new Put(keyDistributor.getDistributedKey(originalKey)); > ... // add values > hTable.put(put); > } > > > Performing a range scan over written data (internally <bucketsCount> > scanners > executed): > > Scan scan = new Scan(startKey, stopKey); > ResultScanner rs = DistributedScanner.create(hTable, scan, > keyDistributor); > for (Result current : rs) { > ... > } > > Performing mapreduce job over written data chunk specified by Scan: > > Configuration conf = HBaseConfiguration.create(); > Job job = new Job(conf, "testMapreduceJob"); > > Scan scan = new Scan(startKey, stopKey); > > TableMapReduceUtil.initTableMapperJob("table", scan, > RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, > job); > > // Substituting standard TableInputFormat which was set in > // TableMapReduceUtil.initTableMapperJob(...) > job.setInputFormatClass(WdTableInputFormat.class); > keyDistributor.addInfo(job.getConfiguration()); > > > Extending Row Keys Distributing Patterns: > ----------------------------------------- > > HBaseWD is designed to be flexible and to support custom row key > distribution > approaches. To define custom row key distributing logic just implement > AbstractRowKeyDistributor abstract class which is really very simple: > > public abstract class AbstractRowKeyDistributor implements > Parametrizable { > public abstract byte[] getDistributedKey(byte[] originalKey); > public abstract byte[] getOriginalKey(byte[] adjustedKey); > public abstract byte[][] getAllDistributedKeys(byte[] originalKey); > ... // some utility methods > } >