One thought to ponder:

If you are going to be splitting continuously and at a quicker pace, do you
have a strategy/plan to merge old regions? Otherwise, you can end up with a
cluster with proliferation of regions.

Regards,
Shahab

On Tue, Aug 18, 2015 at 3:55 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> for hbase key containing time as prefix say(yyyy-mm-dd#other fields of guid
> base) I am using bulk load to avoid hot spot of regionserver (avoiding
> write to WAL).
>
> What should be the initial splits of regions. Say I have 30 regionserves.
>
> shall intial 30 days as intial splits and then auto split takes care of
> splitting regions if it grows further will serve ?
> Or since if it has date as prefix and when region is split in 2 from midway
> - and new data will come for increasing date only will lead to  one region
> to be half filled always and rest half never filled?
>
> On Tue, Aug 18, 2015 at 9:41 PM, anil gupta <anilgupt...@gmail.com> wrote:
>
> > As per my experience, Phoenix is way superior than Hive-HBase integration
> > for sql-like querying on HBase. It's because, Phoenix is built on top of
> > HBase unlike Hive.
> >
> > On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > To my knowledge, Phoenix provides better integration with hbase.
> > >
> > > A third possibility is Spark on HBase.
> > >
> > > If you want to explore these alternatives, I suggest asking on
> respective
> > > mailing lists where you can get expert opinions.
> > >
> > > Cheers
> > >
> > > On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora <
> > shushantaror...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Thanks!
> > > >
> > > > Which one is better for sqlkind of queries over hbase (queries
> involve
> > > > filter , key range scan), aggregates by column values.
> > > > .
> > > > 1.Hive storage handlers
> > > > 2.or Phoenix
> > > >
> > > > On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> > > >
> > > > > For #1, if you want to count distinct values for F1, you can write
> a
> > > > > coprocessor which aggregates the count on region server and returns
> > the
> > > > > result to client which does the final aggregation.
> > > > >
> > > > > Take a look
> > > > > at
> > > > >
> > > >
> > >
> >
> hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java
> > > > > and related classes for example.
> > > > >
> > > > > On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora <
> > > > > shushantaror...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks !
> > > > > > few more doubts :
> > > > > >
> > > > > > 1.Say if requirement is to count distinct value of F1-
> > > > > >
> > > > > > If field is part of key- is hbase can't just scan key and skip
> > value
> > > > > > deserialsation and return result to client which will calculate
> > > > distinct
> > > > > > and in second approcah Hbase will desrialise the value of return
> > > column
> > > > > > containing F1 to cleint which will calculate the distinct.
> > > > > >
> > > > > > 2.For bulk load when LoadIncrementalHFiles runs and regionserver
> > > moves
> > > > > the
> > > > > > hfiles from hdfs to region directory - does regionserver localise
> > the
> > > > > hfile
> > > > > > by downloading it to local and then uploading again in region
> > > > directory?
> > > > > Or
> > > > > > it just moves to to region directory and wait for next compaction
> > to
> > > > get
> > > > > it
> > > > > > localise  as in regionserver failure case?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu <yuzhih...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > For both scenarios you mentioned, field is not leading part of
> > row
> > > > key.
> > > > > > > You would need to specify timerange or start row / stop row to
> > > narrow
> > > > > the
> > > > > > > key range being scanned.
> > > > > > >
> > > > > > > I am leaning toward using second approach.
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > > On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora <
> > > > > > shushantaror...@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > ~8-10 fields of size (5 of  20 bytes each )and 3 fields of
> size
> > > 200
> > > > > > bytes
> > > > > > > > each.
> > > > > > > >
> > > > > > > > On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu <yuzhih...@gmail.com
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > How many fields such as F1 are you considering for
> embedding
> > in
> > > > row
> > > > > > > key ?
> > > > > > > > >
> > > > > > > > > Suggested reading:
> > > > > > > > > http://hbase.apache.org/book.html#rowkey.design
> > > > > > > > > http://hbase.apache.org/book.html#client.filter.kvm (see
> > > > > > > > > ColumnPrefixFilter)
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > >
> > > > > > > > > On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora <
> > > > > > > > shushantaror...@gmail.com
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > 1.so size limit is per cell's identifier + value ?
> > > > > > > > > >
> > > > > > > > > > What is more optimise - to have field in key or in column
> > > > > family's
> > > > > > > > > column ?
> > > > > > > > > > If pattern is like every row has that field.
> > > > > > > > > >
> > > > > > > > > > Say I have a field F1 in all rows so
> > > > > > > > > > Situtatio -1
> > > > > > > > > > key1#F1(as composite key)  - and rest fields in column
> > > > > > > > > >
> > > > > > > > > > Situation-2
> > > > > > > > > > key1 as key and F1 part of column family.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This is the main reason I  asked the key size limit.
> > > > > > > > > > If I asked for no of rows where F1 is = 'someval' will it
> > be
> > > > > faster
> > > > > > > in
> > > > > > > > > > situation-1 than in situation-2. Since in 1 it can return
> > the
> > > > > > result
> > > > > > > > just
> > > > > > > > > > by traversing keys no need to read columns?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu <
> > yuzhih...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > For #1, it is the limit on a single keyvalue, not row,
> > not
> > > > key.
> > > > > > > > > > >
> > > > > > > > > > > For #2, please see the following:
> > > > > > > > > > >
> > > > > > > > > > > http://hbase.apache.org/book.html#store.memstore
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> http://hbase.apache.org/book.html#regionserver_splitting_implementation
> > > > > > > > > > >
> > > > > > > > > > > Cheers
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora <
> > > > > > > > > > shushantaror...@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > 1.Is hbase.client.keyvalue.maxsize  is max size of
> row
> > or
> > > > key
> > > > > > > only
> > > > > > > > ?
> > > > > > > > > Is
> > > > > > > > > > > > there any limit on key size only ?
> > > > > > > > > > > > 2.Access pattern is mostly on key based only- Is
> > > memstores
> > > > > and
> > > > > > > > > regions
> > > > > > > > > > > on a
> > > > > > > > > > > > regionserver are per table basis? Is it if I have
> > > multiple
> > > > > > tables
> > > > > > > > it
> > > > > > > > > > will
> > > > > > > > > > > > have multiple memstores instead of few if it would
> have
> > > > been
> > > > > > one
> > > > > > > > > large
> > > > > > > > > > > > table ?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu <
> > > > yuzhih...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > For #1, take a look at the following in
> > > > hbase-default.xml :
> > > > > > > > > > > > >
> > > > > > > > > > > > >     <name>hbase.client.keyvalue.maxsize</name>
> > > > > > > > > > > > >     <value>10485760</value>
> > > > > > > > > > > > >
> > > > > > > > > > > > > For #2, it would be easier to answer if you can
> > outline
> > > > > > access
> > > > > > > > > > patterns
> > > > > > > > > > > > in
> > > > > > > > > > > > > your app.
> > > > > > > > > > > > >
> > > > > > > > > > > > > For #3, adjustment according to current region
> > > boundaries
> > > > > is
> > > > > > > done
> > > > > > > > > > > client
> > > > > > > > > > > > > side. Take a look at the javadoc for LoadQueueItem
> > > > > > > > > > > > > in LoadIncrementalHFiles.java
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora <
> > > > > > > > > > > > shushantaror...@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 1.Is there any max limit on key size of hbase
> > table.
> > > > > > > > > > > > > > 2.Is multiple small tables vs one large table
> which
> > > one
> > > > > is
> > > > > > > > > > preferred.
> > > > > > > > > > > > > > 3.for bulk load -when  LoadIncremantalHfile is
> run
> > it
> > > > > again
> > > > > > > > > > > > recalculates
> > > > > > > > > > > > > > the region splits based on region boundary - is
> > this
> > > > > > division
> > > > > > > > > > happens
> > > > > > > > > > > > on
> > > > > > > > > > > > > > client side or server side again at region server
> > or
> > > > > hbase
> > > > > > > > master
> > > > > > > > > > and
> > > > > > > > > > > > > then
> > > > > > > > > > > > > > it assigns the splits which cross target region
> > > > boundary
> > > > > to
> > > > > > > > > desired
> > > > > > > > > > > > > > regionserver.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
> >
>

Reply via email to