One thought to ponder: If you are going to be splitting continuously and at a quicker pace, do you have a strategy/plan to merge old regions? Otherwise, you can end up with a cluster with proliferation of regions.
Regards, Shahab On Tue, Aug 18, 2015 at 3:55 PM, Shushant Arora <shushantaror...@gmail.com> wrote: > for hbase key containing time as prefix say(yyyy-mm-dd#other fields of guid > base) I am using bulk load to avoid hot spot of regionserver (avoiding > write to WAL). > > What should be the initial splits of regions. Say I have 30 regionserves. > > shall intial 30 days as intial splits and then auto split takes care of > splitting regions if it grows further will serve ? > Or since if it has date as prefix and when region is split in 2 from midway > - and new data will come for increasing date only will lead to one region > to be half filled always and rest half never filled? > > On Tue, Aug 18, 2015 at 9:41 PM, anil gupta <anilgupt...@gmail.com> wrote: > > > As per my experience, Phoenix is way superior than Hive-HBase integration > > for sql-like querying on HBase. It's because, Phoenix is built on top of > > HBase unlike Hive. > > > > On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > To my knowledge, Phoenix provides better integration with hbase. > > > > > > A third possibility is Spark on HBase. > > > > > > If you want to explore these alternatives, I suggest asking on > respective > > > mailing lists where you can get expert opinions. > > > > > > Cheers > > > > > > On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora < > > shushantaror...@gmail.com > > > > > > > wrote: > > > > > > > Thanks! > > > > > > > > Which one is better for sqlkind of queries over hbase (queries > involve > > > > filter , key range scan), aggregates by column values. > > > > . > > > > 1.Hive storage handlers > > > > 2.or Phoenix > > > > > > > > On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > > > > > For #1, if you want to count distinct values for F1, you can write > a > > > > > coprocessor which aggregates the count on region server and returns > > the > > > > > result to client which does the final aggregation. > > > > > > > > > > Take a look > > > > > at > > > > > > > > > > > > > > > hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java > > > > > and related classes for example. > > > > > > > > > > On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora < > > > > > shushantaror...@gmail.com> > > > > > wrote: > > > > > > > > > > > Thanks ! > > > > > > few more doubts : > > > > > > > > > > > > 1.Say if requirement is to count distinct value of F1- > > > > > > > > > > > > If field is part of key- is hbase can't just scan key and skip > > value > > > > > > deserialsation and return result to client which will calculate > > > > distinct > > > > > > and in second approcah Hbase will desrialise the value of return > > > column > > > > > > containing F1 to cleint which will calculate the distinct. > > > > > > > > > > > > 2.For bulk load when LoadIncrementalHFiles runs and regionserver > > > moves > > > > > the > > > > > > hfiles from hdfs to region directory - does regionserver localise > > the > > > > > hfile > > > > > > by downloading it to local and then uploading again in region > > > > directory? > > > > > Or > > > > > > it just moves to to region directory and wait for next compaction > > to > > > > get > > > > > it > > > > > > localise as in regionserver failure case? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu <yuzhih...@gmail.com> > > > wrote: > > > > > > > > > > > > > For both scenarios you mentioned, field is not leading part of > > row > > > > key. > > > > > > > You would need to specify timerange or start row / stop row to > > > narrow > > > > > the > > > > > > > key range being scanned. > > > > > > > > > > > > > > I am leaning toward using second approach. > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora < > > > > > > shushantaror...@gmail.com > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > ~8-10 fields of size (5 of 20 bytes each )and 3 fields of > size > > > 200 > > > > > > bytes > > > > > > > > each. > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu <yuzhih...@gmail.com > > > > > > wrote: > > > > > > > > > > > > > > > > > How many fields such as F1 are you considering for > embedding > > in > > > > row > > > > > > > key ? > > > > > > > > > > > > > > > > > > Suggested reading: > > > > > > > > > http://hbase.apache.org/book.html#rowkey.design > > > > > > > > > http://hbase.apache.org/book.html#client.filter.kvm (see > > > > > > > > > ColumnPrefixFilter) > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora < > > > > > > > > shushantaror...@gmail.com > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > 1.so size limit is per cell's identifier + value ? > > > > > > > > > > > > > > > > > > > > What is more optimise - to have field in key or in column > > > > > family's > > > > > > > > > column ? > > > > > > > > > > If pattern is like every row has that field. > > > > > > > > > > > > > > > > > > > > Say I have a field F1 in all rows so > > > > > > > > > > Situtatio -1 > > > > > > > > > > key1#F1(as composite key) - and rest fields in column > > > > > > > > > > > > > > > > > > > > Situation-2 > > > > > > > > > > key1 as key and F1 part of column family. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is the main reason I asked the key size limit. > > > > > > > > > > If I asked for no of rows where F1 is = 'someval' will it > > be > > > > > faster > > > > > > > in > > > > > > > > > > situation-1 than in situation-2. Since in 1 it can return > > the > > > > > > result > > > > > > > > just > > > > > > > > > > by traversing keys no need to read columns? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu < > > yuzhih...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > For #1, it is the limit on a single keyvalue, not row, > > not > > > > key. > > > > > > > > > > > > > > > > > > > > > > For #2, please see the following: > > > > > > > > > > > > > > > > > > > > > > http://hbase.apache.org/book.html#store.memstore > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://hbase.apache.org/book.html#regionserver_splitting_implementation > > > > > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora < > > > > > > > > > > shushantaror...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > 1.Is hbase.client.keyvalue.maxsize is max size of > row > > or > > > > key > > > > > > > only > > > > > > > > ? > > > > > > > > > Is > > > > > > > > > > > > there any limit on key size only ? > > > > > > > > > > > > 2.Access pattern is mostly on key based only- Is > > > memstores > > > > > and > > > > > > > > > regions > > > > > > > > > > > on a > > > > > > > > > > > > regionserver are per table basis? Is it if I have > > > multiple > > > > > > tables > > > > > > > > it > > > > > > > > > > will > > > > > > > > > > > > have multiple memstores instead of few if it would > have > > > > been > > > > > > one > > > > > > > > > large > > > > > > > > > > > > table ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu < > > > > yuzhih...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > For #1, take a look at the following in > > > > hbase-default.xml : > > > > > > > > > > > > > > > > > > > > > > > > > > <name>hbase.client.keyvalue.maxsize</name> > > > > > > > > > > > > > <value>10485760</value> > > > > > > > > > > > > > > > > > > > > > > > > > > For #2, it would be easier to answer if you can > > outline > > > > > > access > > > > > > > > > > patterns > > > > > > > > > > > > in > > > > > > > > > > > > > your app. > > > > > > > > > > > > > > > > > > > > > > > > > > For #3, adjustment according to current region > > > boundaries > > > > > is > > > > > > > done > > > > > > > > > > > client > > > > > > > > > > > > > side. Take a look at the javadoc for LoadQueueItem > > > > > > > > > > > > > in LoadIncrementalHFiles.java > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora < > > > > > > > > > > > > shushantaror...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.Is there any max limit on key size of hbase > > table. > > > > > > > > > > > > > > 2.Is multiple small tables vs one large table > which > > > one > > > > > is > > > > > > > > > > preferred. > > > > > > > > > > > > > > 3.for bulk load -when LoadIncremantalHfile is > run > > it > > > > > again > > > > > > > > > > > > recalculates > > > > > > > > > > > > > > the region splits based on region boundary - is > > this > > > > > > division > > > > > > > > > > happens > > > > > > > > > > > > on > > > > > > > > > > > > > > client side or server side again at region server > > or > > > > > hbase > > > > > > > > master > > > > > > > > > > and > > > > > > > > > > > > > then > > > > > > > > > > > > > > it assigns the splits which cross target region > > > > boundary > > > > > to > > > > > > > > > desired > > > > > > > > > > > > > > regionserver. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Thanks & Regards, > > Anil Gupta > > >