On Thu, Jul 26, 2012 at 1:29 PM, Alex Baranau <alex.barano...@gmail.com>wrote:
> > But who decides on how many regionservers to > > have in the cluster? > > RegionServer is a process started on each slave in your cluster. So the > number of RS is the same as the number of slaves. You might want to take a > look at one of Intro to HBase presentations (which have pictures!) [1] > > > How different is this mechanism as compared to regionsplitter that uses > > default string md5 split. Just trying to understand the difference in how > > different the key range is. > > You can use any of the splitter algorithm, but note that it probably will > not take into account the row keys you are going to use. E.g.: > * if your row keys have format <country><state><company><...> and > * you know that you will have most of the data about US companies (if e.g. > this is your target audience) then > * based on the example I gave, you can create regions defined by these > start keys: > "" > "US" > "US_FL" > "US_KN" > "US_MS" > "US_NC" > "US_VM" > "V" > so that data is more or less evenly distributed (note: there's no need to > split other countries in regions as they they will have small amount of > data). > Thanks for great explanation!! > > No standard splitter will know what your data is (at the time of creation > of the table). > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > [1] > http://blog.sematext.com/2012/07/09/introduction-to-hbase/ > > http://blog.sematext.com/2012/07/09/intro-to-hbase-internals-and-schema-desig/ > or any other "intro to hbase" presentations over the web. > > On Thu, Jul 26, 2012 at 3:50 PM, Mohit Anchlia <mohitanch...@gmail.com > >wrote: > > > On Thu, Jul 26, 2012 at 10:34 AM, Alex Baranau <alex.barano...@gmail.com > > >wrote: > > > > > > Is there any specific best practice on how many regions one > > > > should split a table into? > > > > > > As always, "it depends". Usually you don't want your RegionServers to > > serve > > > more than 50 regions or so. The fewer the better. But at the same time > > you > > > usually want your regions to be distributed over the whole cluster (so > > that > > > you use all power). So, it might make sense to start with one region > per > > RS > > > (if your writes are more or less evenly distributed across pre-splitted > > > regions) if you don't know about you data size. If you know that you'll > > > need to have more regions because of how big is your data, then you > might > > > create more regions at the start (with pre-splitting), so that you > avoid > > > region splits operations (you really want to avoid them if you can). > > > Of course, you need to take into account other tables in your cluster > as > > > well. I.e. "usually not more than 50 regions" total per regionserver. > > > > > > > > > > > > > Thanks for the detailed explanation. I understand the regions per > > > regionserver, which is essentially range of rows distributed accross > the > > > cluster for a given table. But who decides on how many regionservers to > > > have in the cluster? > > > > > > > > > > > Just one more question, in the split keys that you described below, > is > > it > > > > based on the first byte value of the Key? > > > > > > yes. And the first byte contains readable char, because of > > > Bytes.ToBytes(String.valueOf(i)). If you want to prefix with (byte) 0, > > ..., > > > (byte) 9 (i.e. with 0x00, 0x01, ..., 0x09) then no need to convert to > > > String. > > > > > > > > How different is this mechanism as compared to regionsplitter that uses > > default string md5 split. Just trying to understand the difference in how > > different the key range is. > > > > > Alex Baranau > > > ------ > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > ElasticSearch > > - > > > Solr > > > > > > On Thu, Jul 26, 2012 at 11:43 AM, Mohit Anchlia < > mohitanch...@gmail.com > > > >wrote: > > > > > > > On Thu, Jul 26, 2012 at 7:16 AM, Alex Baranau < > > alex.barano...@gmail.com > > > > >wrote: > > > > > > > > > Looks like you have only one region in your table. Right? > > > > > > > > > > If you want your writes to be distributed from the start (without > > > waiting > > > > > for HBase to fill table enough to split it in many regions), you > > should > > > > > pre-split your table. In your case you can pre-split table with 10 > > > > regions > > > > > (just an example, you can define more), with start keys: "", "1", > > "2", > > > > ..., > > > > > "9" [1]. > > > > > > > > > > Just one more question, in the split keys that you described below, > > is > > > it > > > > based on the first byte value of the Key? > > > > > > > > > > > > > Btw, since you are salting your keys to achieve distribution, you > > might > > > > > also find this small lib helpful which implements most of the stuff > > for > > > > you > > > > > [2]. > > > > > > > > > > Hope this helps. > > > > > > > > > > Alex Baranau > > > > > ------ > > > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > > > ElasticSearch > > > > - > > > > > Solr > > > > > > > > > > [1] > > > > > > > > > > byte[][] splitKeys = new byte[9][]; > > > > > // the first region starting with empty key will be created > > > > > automatically > > > > > for (int i = 1; i < splitKeys.length; i++) { > > > > > splitKeys[i] = Bytes.toBytes(String.valueOf(i)); > > > > > } > > > > > > > > > > HBaseAdmin admin = new HBaseAdmin(conf); > > > > > admin.createTable(tableDescriptor, splitKeys); > > > > > > > > > > [2] > > > > > https://github.com/sematext/HBaseWD > > > > > > > > > > > > > > > > > > > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > > > > > > > > > On Wed, Jul 25, 2012 at 7:54 PM, Mohit Anchlia < > > mohitanch...@gmail.com > > > > > >wrote: > > > > > > > > > > > On Wed, Jul 25, 2012 at 6:53 AM, Alex Baranau < > > > > alex.barano...@gmail.com > > > > > > >wrote: > > > > > > > > > > > > > Hi Mohit, > > > > > > > > > > > > > > 1. When talking about particular table: > > > > > > > > > > > > > > For viewing rows distribution you can check out how regions are > > > > > > > distributed. And each region defined by the start/stop key, so > > > > > depending > > > > > > on > > > > > > > your key format, etc. you can see which records go into each > > > region. > > > > > You > > > > > > > can see the regions distribution in web ui as Adrien mentioned. > > It > > > > may > > > > > > also > > > > > > > be handy for you to query .META. table [1] which holds regions > > > info. > > > > > > > > > > > > > > In cases when you use random keys or when you just not sure how > > > data > > > > is > > > > > > > distributed in key buckets (which are regions), you may also > want > > > to > > > > > look > > > > > > > at HBase data on HDFS [2]. Since data is stored for each region > > > > > > separately, > > > > > > > you can see the size on the HDFS each one occupies. > > > > > > > > > > > > > > I did a scan and the data looks like as pasted below. It > appears > > > all > > > > my > > > > > > writes are going to just one server. My keys are of this type > > > > > > [0-9]:[current timestamp]. Number between 0-9 is generated > > randomly. > > > I > > > > > > thought by having this random number I'll be able to place my > keys > > on > > > > > > multiple nodes. How should I approach this such that I am able to > > use > > > > > other > > > > > > nodes as well? > > > > > > > > > > > > > > > > > > > > > > > > SESSION_TIMELINE1,,1343074465420.5831bbac53e59 > > > column=info:regioninfo, > > > > > > timestamp=1343170773523, value=REGION => {NAME => > > > > > > 'SESSION_TIMELINE1,,1343074465420.5831bbac53e591c609918c0e2d7da7 > > > > > > 1c609918c0e2d7da7bf. bf.', STARTKEY => > > '', > > > > > > ENDKEY => '', ENCODED => 5831bbac53e591c609918c0e2d7da7bf, TABLE > => > > > > > {{NAME > > > > > > => 'SESSION_TIMELINE1', FAMILIES => [{NAM > > > > > > E => 'S_T_MTX', > > > > > BLOOMFILTER > > > > > > => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'GZ', > VERSIONS > > => > > > > > '1', > > > > > > TTL => '2147483647', BLOCKSIZE => ' > > > > > > 65536', IN_MEMORY > > => > > > > > > 'false', BLOCKCACHE => 'true'}]}} > > > > > > SESSION_TIMELINE1,,1343074465420.5831bbac53e59 > column=info:server, > > > > > > timestamp=1343178912655, value=dsdb3.:60020 > > > > > > 1c609918c0e2d7da7bf. > > > > > > > > > > > > > 2. When talking about whole cluster, it makes sense to use > > cluster > > > > > > > monitoring tool [3], to find out more about overall load > > > > distribution, > > > > > > > regions of multiple tables distribution, requests amount, and > > many > > > > more > > > > > > > such things. > > > > > > > > > > > > > > And of course, you can use HBase Java API to fetch some data of > > the > > > > > > cluster > > > > > > > state as well. I guess you should start looking at it from > > > HBaseAdmin > > > > > > > class. > > > > > > > > > > > > > > Alex Baranau > > > > > > > ------ > > > > > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > > > > > ElasticSearch > > > > > > - > > > > > > > Solr > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > hbase(main):001:0> scan '.META.', {LIMIT=>1, > > STARTROW=>"mytable,,"} > > > > > > > ROW > > > > > > > COLUMN+CELL > > > > > > > > > > > > > > > > > > > > > mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845. > > > > > > > column=info:regioninfo, timestamp=1341279432625, value=REGION > => > > > > {NAME > > > > > > => > > > > > > > 'mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845.', > > > STARTKEY > > > > => > > > > > > > 'chicago', ENDKEY => 'new_york', ENCODED => > > > > > > > fd61cd7ef426d2f233a4cd7e8b73845, TABLE => {{NAME => 'mytable', > > > > FAMILIES > > > > > > => > > > > > > > [{NAME => 'job', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => > '0', > > > > > > > COMPRESSION => 'NONE', VERSIONS => '1', TTL => '2147483647', > > > > BLOCKSIZE > > > > > => > > > > > > > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} > > > > > > > > > > > > > > > > > > > > > > > > > > > > mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845. > > > > > > > column=info:server, timestamp=1341279432673, > > value=myserver:60020 > > > > > > > > > > > > > > > > > > > > > mytable,,1341279432683.8fd61cd7ef426d2f233a4cd7e8b73845. > > > > > > > column=info:serverstartcode, timestamp=1341279432673, > > > > > > value=1341267474257 > > > > > > > > > > > > > > > > > > > > > 1 row(s) in 0.1980 seconds > > > > > > > > > > > > > > [2] > > > > > > > > > > > > > > ubuntu@ip-10-80-47-73:~$ sudo -u hdfs hadoop fs -du > > /hbase/mytable > > > > > > > Found 130 items > > > > > > > 3397 hdfs://hbase.master/hbase/mytable > > > > > > > /02925d3c335bff7e273f392324f16dca > > > > > > > 2682163424 hdfs://hbase.master/hbase/mytable > > > > > > > /03231b8ae2b73317c4858b1a85c09ad2 > > > > > > > 1038862956 hdfs://hbase.master/hbase/mytable > > > > > > > /04f911571593e931a9a3d9e2a6616236 > > > > > > > 1039181555 hdfs://hbase.master/hbase/mytable > > > > > > > /0a177633196cae7b158836181d69dc0f > > > > > > > 1076888812 hdfs://hbase.master/hbase/mytable > > > > > > > /0d52fc477c41a9a236803234d44c7c06 > > > > > > > > > > > > > > [3] > > > > > > > You can get data from JMX directly using any tool you like or > > use: > > > > > > > * Ganglia > > > > > > > * SPM monitoring ( > > > > > > > > http://sematext.com/spm/hbase-performance-monitoring/index.html) > > > > > > > * others > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 25, 2012 at 1:59 AM, Adrien Mogenet < > > > > > > adrien.moge...@gmail.com > > > > > > > >wrote: > > > > > > > > > > > > > > > From the web-interface, you can have such statistics when > > viewing > > > > the > > > > > > > > details of a table. > > > > > > > > You can also develop your own "balance viewer" through the > > HBase > > > > API > > > > > > > (list > > > > > > > > of RS, regions, storeFiles, their size, etc.) > > > > > > > > > > > > > > > > On Wed, Jul 25, 2012 at 7:32 AM, Mohit Anchlia < > > > > > mohitanch...@gmail.com > > > > > > > > >wrote: > > > > > > > > > > > > > > > > > Is there an easy way to tell how my nodes are balanced and > > how > > > > the > > > > > > rows > > > > > > > > are > > > > > > > > > distributed in the cluster? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Adrien Mogenet > > > > > > > > 06.59.16.64.22 > > > > > > > > http://www.mogenet.me > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Alex Baranau > > > > > > > ------ > > > > > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > > > > > ElasticSearch > > > > > > - > > > > > > > Solr > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Alex Baranau > > > > > ------ > > > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > > > ElasticSearch > > > > - > > > > > Solr > > > > > > > > > > > > > > > > > > > > > -- > > > Alex Baranau > > > ------ > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > ElasticSearch > > - > > > Solr > > > > > > > > > -- > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr >