Re: Row distribution

Mohit Anchlia Thu, 26 Jul 2012 08:42:11 -0700

On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau <alex.barano...@gmail.com>wrote:


> Looks like you have only one region in your table. Right?
>
> If you want your writes to be distributed from the start (without waiting
> for HBase to fill table enough to split it in many regions), you should
> pre-split your table. In your case you can pre-split table with 10 regions
> (just an example, you can define more), with start keys: "", "1", "2", ...,
> "9" [1].
>
> Thanks a lot! Is there any specific best practice on how many regions one
should split a table into?


> Btw, since you are salting your keys to achieve distribution, you might
> also find this small lib helpful which implements most of the stuff for you
> [2].
>
> I'll take a look


> Hope this helps.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> [1]
>
>     byte[][] splitKeys = new byte[9][];
>     // the first region starting with empty key will be created
> automatically
>     for (int i = 1; i < splitKeys.length; i++) {
>       splitKeys[i] = Bytes.toBytes(String.valueOf(i));
>     }
>
>     HBaseAdmin admin = new HBaseAdmin(conf);
>     admin.createTable(tableDescriptor, splitKeys);
>
> [2]
> https://github.com/sematext/HBaseWD
>
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>
> On Wed, Jul 25, 2012 at 7:54 PM, Mohit Anchlia <mohitanch...@gmail.com
> >wrote:
>
> > On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau <alex.barano...@gmail.com
> > >wrote:
> >
> > > Hi Mohit,
> > >
> > > 1. When talking about particular table:
> > >
> > > For viewing rows distribution you can check out how regions are
> > > distributed. And each region defined by the start/stop key, so
> depending
> > on
> > > your key format, etc. you can see which records go into each region.
> You
> > > can see the regions distribution in web ui as Adrien mentioned. It may
> > also
> > > be handy for you to query .META. table [1] which holds regions info.
> > >
> > > In cases when you use random keys or when you just not sure how data is
> > > distributed in key buckets (which are regions), you may also want to
> look
> > > at HBase data on HDFS [2]. Since data is stored for each region
> > separately,
> > > you can see the size on the HDFS each one occupies.
> > >
> > > I did a scan and the data looks like as pasted below. It appears all my
> > writes are going to just one server. My keys are of this type
> > [0-9]:[current timestamp]. Number between 0-9 is generated randomly. I
> > thought by having this random number I'll be able to place my keys on
> > multiple nodes. How should I approach this such that I am able to use
> other
> > nodes as well?
> >
> >
> >
> >  SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:regioninfo,
> > timestamp=1343170773523, value=REGION => {NAME =>
> > 'SESSION_TIMELINE1,,1343074465420.5831bbac53e591c609918c0e2d7da7
> >  1c609918c0e2d7da7bf.                           bf.', STARTKEY => '',
> > ENDKEY => '', ENCODED => 5831bbac53e591c609918c0e2d7da7bf, TABLE =>
> {{NAME
> > => 'SESSION_TIMELINE1', FAMILIES => [{NAM
> >                                                 E => 'S_T_MTX',
> BLOOMFILTER
> > => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'GZ', VERSIONS =>
> '1',
> > TTL => '2147483647', BLOCKSIZE => '
> >                                                 65536', IN_MEMORY =>
> > 'false', BLOCKCACHE => 'true'}]}}
> >  SESSION_TIMELINE1,,1343074465420.5831bbac53e59 column=info:server,
> > timestamp=1343178912655, value=dsdb3.:60020
> >  1c609918c0e2d7da7bf.
> >
> > > 2. When talking about whole cluster, it makes sense to use cluster
> > > monitoring tool [3], to find out more about overall load distribution,
> > > regions of multiple tables distribution, requests amount, and many more
> > > such things.
> > >
> > > And of course, you can use HBase Java API to fetch some data of the
> > cluster
> > > state as well. I guess you should start looking at it from HBaseAdmin
> > > class.
> > >
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> > > [1]
> > >
> > > hbase(main):001:0> scan '.META.', {LIMIT=>1, STARTROW=>"mytable,,"}
> > > ROW
> > > COLUMN+CELL
> > >
> > >
> > >  mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845.
> > >  column=info:regioninfo, timestamp=1341279432625, value=REGION => {NAME
> > =>
> > > 'mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845.', STARTKEY =>
> > > 'chicago', ENDKEY => 'new_york', ENCODED =>
> > > fd61cd7ef426d2f233a4cd7e8b73845, TABLE => {{NAME => 'mytable', FAMILIES
> > =>
> > > [{NAME => 'job', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0',
> > > COMPRESSION => 'NONE', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE
> =>
> > > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
> > >
> > >
> > >
> > >  mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845.
> > >  column=info:server, timestamp=1341279432673, value=myserver:60020
> > >
> > >
> > >  mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845.
> > >  column=info:serverstartcode, timestamp=1341279432673,
> > value=1341267474257
> > >
> > >
> > > 1 row(s) in 0.1980 seconds
> > >
> > > [2]
> > >
> > > ubuntu@ip-10-80-47-73:~$ sudo -u hdfs hadoop fs -du /hbase/mytable
> > > Found 130 items
> > > 3397        hdfs://hbase.master/hbase/mytable
> > > /02925d3c335bff7e273f392324f16dca
> > > 2682163424  hdfs://hbase.master/hbase/mytable
> > > /03231b8ae2b73317c4858b1a85c09ad2
> > > 1038862956  hdfs://hbase.master/hbase/mytable
> > > /04f911571593e931a9a3d9e2a6616236
> > > 1039181555  hdfs://hbase.master/hbase/mytable
> > > /0a177633196cae7b158836181d69dc0f
> > > 1076888812  hdfs://hbase.master/hbase/mytable
> > > /0d52fc477c41a9a236803234d44c7c06
> > >
> > > [3]
> > > You can get data from JMX directly using any tool you like or use:
> > > * Ganglia
> > > * SPM monitoring (
> > > http://sematext.com/spm/hbase-performance-monitoring/index.html)
> > > * others
> > >
> > >
> > > On Wed, Jul 25, 2012 at 1:59 AM, Adrien Mogenet <
> > adrien.moge...@gmail.com
> > > >wrote:
> > >
> > > > From the web-interface, you can have such statistics when viewing the
> > > > details of a table.
> > > > You can also develop your own "balance viewer" through the HBase API
> > > (list
> > > > of RS, regions, storeFiles, their size, etc.)
> > > >
> > > > On Wed, Jul 25, 2012 at 7:32 AM, Mohit Anchlia <
> mohitanch...@gmail.com
> > > > >wrote:
> > > >
> > > > > Is there an easy way to tell how my nodes are balanced and how the
> > rows
> > > > are
> > > > > distributed in the cluster?
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Adrien Mogenet
> > > > 06.59.16.64.22
> > > > http://www.mogenet.me
> > > >
> > >
> > >
> > >
> > > --
> > > Alex Baranau
> > > ------
> > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > -
> > > Solr
> > >
> >
>
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>

Re: Row distribution

Reply via email to