hbase uniformsplit for non hex keys

2016-05-31 Thread Shushant Arora
1.Can I use Uniform split for non hex keys?
2.if yes, how to specify key range for split.
3.If no then whats the diff between HexSplit and Uniform Split.

Thanks!


Re: hbase get and mvcc

2016-05-16 Thread Shushant Arora
thanks!

Does puts which fall inside readpoint of ongoing scan/get are preserved in
HFile also or only in memstore and it blocks memstore flush until all
ongoing scans are completed.



On Tue, May 17, 2016 at 5:31 AM, Stack <st...@duboce.net> wrote:

> On Mon, May 16, 2016 at 4:55 PM, Shushant Arora <shushantaror...@gmail.com
> >
> wrote:
>
> > Hi
> >
> > Hbase uses MVCC for achieving consistent result for Get operations .
> > To achieve MVCC it has to maintain multiple versions of same row/cells .
> > How many max version of a row/cell does Hbase keeps at any time to
> support
> > MVCC.
> >
> > Since say multiple gets started one after the other and has not completed
> > yet and multiple puts are also occuring in between . Thus it maintains
> all
> > versions whose read point is still in use ?
> >
> >
> Yes.
>
> All ongoing Gets/Scans are registered on startup with their current
> readpoint (see HRegion; see constructor for HRegionScannerImpl). Any Put
> that falls inside the readpoint of currently ongoing Gets/Scans will be
> preserved while the Get/Scan is ongoing.
>
> St.Ack
>
>
>
> > Thanks!
> >
>


hbase get and mvcc

2016-05-16 Thread Shushant Arora
Hi

Hbase uses MVCC for achieving consistent result for Get operations .
To achieve MVCC it has to maintain multiple versions of same row/cells .
How many max version of a row/cell does Hbase keeps at any time to support
MVCC.

Since say multiple gets started one after the other and has not completed
yet and multiple puts are also occuring in between . Thus it maintains all
versions whose read point is still in use ?

Thanks!


hfile v2 and bloomfilter

2016-05-15 Thread Shushant Arora
In Hfile v2 block level blommfilters are stored inb scanned section along
with data block and leaf index.

Load on open section contains bloomfilter data . Whats this bloom filter
data?
1.Does it contains index of bloomchunks stored in scanned section ?
2.What does meta blocks of non scanned section contains.
3.Does leaf level index contains row keys only? Will having tall table vs
wide table affect the size of leaf index.

Thanks!


hbase block and columnfamily of a row

2016-05-14 Thread Shushant Arora
can a hbase table with single column family hve its row spawned on
 multiple blocks in a same HFile ?

Suppose there is only one hfile in that case is it possile a column family
having 5-6 columns is spawned on multiple blocks ? or its always block is
closed at max( 64k default or when all columns of a columnfamily for a
single row fits in that block).

Thanks!


hbase zookeeper lag

2016-05-14 Thread Shushant Arora
Hi

Hbase uses zookeeper for various purposes. e.g for region split.

Regionserver creates a znode in zookeeper with splitting state and master
gets notification of this directory , since zookeeper is not fully
consistent - there may be lag between  actual directory creation and
notification till then regionserver will start splitting.
1.will this lag creates issue- Region is already splitted in two but master
does not even know about it until lag of zookeeper is cleared.


and also when regionserver is down it will be notified to master but there
also it can be lag. So it can happen a node in zookeeper is lagging lot
behind say ~2minutes . So master will be notified after 2 minutes.
2.Won't this lag create issue- make client will get region not reachable
and will try with backoff but actual recovery of region server backup will
start after 2 minutes?

Thanks!


Re: hbase architecture doubts

2016-05-09 Thread Shushant Arora
4.Can same row be in 2 blocks in Hfile. One cell in block 1 and another in
block2 ?

On Mon, May 9, 2016 at 4:57 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Thanks!
>
> 1.Will write take lock on all the column families or just the column
> family being affected by write?
>
> 2.How does eviction in LRUBlockcache is implemeted for InMemory or
> multiaccess priority. Say all elements of InMemory priority area(25%) are
> recently used than single and multiaccess area. Now if a new inmemory row
> comes will it evict from inmemory or single access area ?
>
> 3.Why block cache is single per regionserver. Why not single per region.
>
>
> On Sun, May 8, 2016 at 11:43 PM, Stack <st...@duboce.net> wrote:
>
>> On Sun, May 8, 2016 at 6:12 AM, Shushant Arora <shushantaror...@gmail.com
>> >
>> wrote:
>>
>> > Thanks !
>> >
>> > One doubt regarding locking in memtore :
>> >
>> > Hbase use implicit row lock while applying put operation on a row.
>> >
>> > put(byte[] rowkey).
>> >
>> > when htable.put(p) is fired , regionserver will lock the row but all get
>> > operations will not lock the row and return the row state which was at
>> > state previous to put took lock.
>> >
>> > Memstore is implemented as CSLM so how does it return the row state
>> > previous to put lock when get is fired before put is finished?
>> >
>> >
>> Multiversion Concurrency Control. This is the core class:
>>
>> http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/MultiVersionConcurrencyControl.html
>> See how it is used in the codebase.
>>
>> Ask more questions if not clear.
>> St.Ack
>>
>>
>>
>> > On Tue, May 3, 2016 at 7:41 AM, Stack <st...@duboce.net> wrote:
>> >
>> > > On Mon, May 2, 2016 at 5:34 PM, Shushant Arora <
>> > shushantaror...@gmail.com>
>> > > wrote:
>> > >
>> > > > Thanks Stack.
>> > > >
>> > > > 1.So is it at any time there will be two reference 1.active memstore
>> > > > 2.snapshot memstore
>> > > > snapshot will be initialised at time of flush using active memstore
>> > with
>> > > a
>> > > > momentaily lock and then active will be discarded and read will be
>> > served
>> > > > usinmg snapshot and write will go to new active memstore.
>> > > >
>> > > >
>> > > Yes
>> > >
>> > >
>> > > > 2key of CSLS is keyvalue . Which part of keyValue is used while
>> sorting
>> > > the
>> > > > set. Is it whole keyvalue or just row key. Does Hfile has separate
>> > entry
>> > > > for each key value and keyvalues of same row key are always stored
>> > > > contiguosly in HFile and may not be in same block?
>> > > >
>> > > >
>> > > Just the row key. Value is not considered in the sort.
>> > >
>> > > Yes, HFile has separate entry for each KeyValue (or 'Cell' in
>> > hbase-speak).
>> > >
>> > > Cells in HFile are sorted. Those of the same or near 'Cell'
>> coordinates
>> > > will be sorted together and may therefore appear inside the same
>> block.
>> > >
>> > > St.Ack
>> > >
>> > >
>> > >
>> > > > On Tue, May 3, 2016 at 12:05 AM, Stack <st...@duboce.net> wrote:
>> > > >
>> > > > > On Mon, May 2, 2016 at 10:06 AM, Shushant Arora <
>> > > > shushantaror...@gmail.com
>> > > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Thanks Stack
>> > > > > >
>> > > > > > for point 2 :
>> > > > > > I am concerned with downtime of Hbase for read and write.
>> > > > > > If write lock is just for the time while we move aside the
>> current
>> > > > > > MemStore.
>> > > > > > Then when a write happens to key will it update the memstore
>> only
>> > but
>> > > > > > snapshot does not have that update and when snapshot is dunmped
>> to
>> > > > Hfile
>> > > > > > won't we loose the update?
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > No. The update is in t

Re: hbase architecture doubts

2016-05-09 Thread Shushant Arora
Thanks!

1.Will write take lock on all the column families or just the column family
being affected by write?

2.How does eviction in LRUBlockcache is implemeted for InMemory or
multiaccess priority. Say all elements of InMemory priority area(25%) are
recently used than single and multiaccess area. Now if a new inmemory row
comes will it evict from inmemory or single access area ?

3.Why block cache is single per regionserver. Why not single per region.


On Sun, May 8, 2016 at 11:43 PM, Stack <st...@duboce.net> wrote:

> On Sun, May 8, 2016 at 6:12 AM, Shushant Arora <shushantaror...@gmail.com>
> wrote:
>
> > Thanks !
> >
> > One doubt regarding locking in memtore :
> >
> > Hbase use implicit row lock while applying put operation on a row.
> >
> > put(byte[] rowkey).
> >
> > when htable.put(p) is fired , regionserver will lock the row but all get
> > operations will not lock the row and return the row state which was at
> > state previous to put took lock.
> >
> > Memstore is implemented as CSLM so how does it return the row state
> > previous to put lock when get is fired before put is finished?
> >
> >
> Multiversion Concurrency Control. This is the core class:
>
> http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/MultiVersionConcurrencyControl.html
> See how it is used in the codebase.
>
> Ask more questions if not clear.
> St.Ack
>
>
>
> > On Tue, May 3, 2016 at 7:41 AM, Stack <st...@duboce.net> wrote:
> >
> > > On Mon, May 2, 2016 at 5:34 PM, Shushant Arora <
> > shushantaror...@gmail.com>
> > > wrote:
> > >
> > > > Thanks Stack.
> > > >
> > > > 1.So is it at any time there will be two reference 1.active memstore
> > > > 2.snapshot memstore
> > > > snapshot will be initialised at time of flush using active memstore
> > with
> > > a
> > > > momentaily lock and then active will be discarded and read will be
> > served
> > > > usinmg snapshot and write will go to new active memstore.
> > > >
> > > >
> > > Yes
> > >
> > >
> > > > 2key of CSLS is keyvalue . Which part of keyValue is used while
> sorting
> > > the
> > > > set. Is it whole keyvalue or just row key. Does Hfile has separate
> > entry
> > > > for each key value and keyvalues of same row key are always stored
> > > > contiguosly in HFile and may not be in same block?
> > > >
> > > >
> > > Just the row key. Value is not considered in the sort.
> > >
> > > Yes, HFile has separate entry for each KeyValue (or 'Cell' in
> > hbase-speak).
> > >
> > > Cells in HFile are sorted. Those of the same or near 'Cell' coordinates
> > > will be sorted together and may therefore appear inside the same block.
> > >
> > > St.Ack
> > >
> > >
> > >
> > > > On Tue, May 3, 2016 at 12:05 AM, Stack <st...@duboce.net> wrote:
> > > >
> > > > > On Mon, May 2, 2016 at 10:06 AM, Shushant Arora <
> > > > shushantaror...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks Stack
> > > > > >
> > > > > > for point 2 :
> > > > > > I am concerned with downtime of Hbase for read and write.
> > > > > > If write lock is just for the time while we move aside the
> current
> > > > > > MemStore.
> > > > > > Then when a write happens to key will it update the memstore only
> > but
> > > > > > snapshot does not have that update and when snapshot is dunmped
> to
> > > > Hfile
> > > > > > won't we loose the update?
> > > > > >
> > > > > >
> > > > > >
> > > > > No. The update is in the new currently active MemStore. The update
> > will
> > > > be
> > > > > included in the next flush added to a new hfile.
> > > > >
> > > > > St.Ack
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > On Mon, May 2, 2016 at 9:06 PM, Stack <st...@duboce.net> wrote:
> > > > > >
> > > > > > > On Mon, May 2, 2016 at 1:25 AM, Shushant Arora <
> > > > > > shushantaror...@gmail.com>
> > > > > > > wrote:
> > > > >

Re: hbase architecture doubts

2016-05-08 Thread Shushant Arora
Thanks !

One doubt regarding locking in memtore :

Hbase use implicit row lock while applying put operation on a row.

put(byte[] rowkey).

when htable.put(p) is fired , regionserver will lock the row but all get
operations will not lock the row and return the row state which was at
state previous to put took lock.

Memstore is implemented as CSLM so how does it return the row state
previous to put lock when get is fired before put is finished?

On Tue, May 3, 2016 at 7:41 AM, Stack <st...@duboce.net> wrote:

> On Mon, May 2, 2016 at 5:34 PM, Shushant Arora <shushantaror...@gmail.com>
> wrote:
>
> > Thanks Stack.
> >
> > 1.So is it at any time there will be two reference 1.active memstore
> > 2.snapshot memstore
> > snapshot will be initialised at time of flush using active memstore with
> a
> > momentaily lock and then active will be discarded and read will be served
> > usinmg snapshot and write will go to new active memstore.
> >
> >
> Yes
>
>
> > 2key of CSLS is keyvalue . Which part of keyValue is used while sorting
> the
> > set. Is it whole keyvalue or just row key. Does Hfile has separate entry
> > for each key value and keyvalues of same row key are always stored
> > contiguosly in HFile and may not be in same block?
> >
> >
> Just the row key. Value is not considered in the sort.
>
> Yes, HFile has separate entry for each KeyValue (or 'Cell' in hbase-speak).
>
> Cells in HFile are sorted. Those of the same or near 'Cell' coordinates
> will be sorted together and may therefore appear inside the same block.
>
> St.Ack
>
>
>
> > On Tue, May 3, 2016 at 12:05 AM, Stack <st...@duboce.net> wrote:
> >
> > > On Mon, May 2, 2016 at 10:06 AM, Shushant Arora <
> > shushantaror...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Thanks Stack
> > > >
> > > > for point 2 :
> > > > I am concerned with downtime of Hbase for read and write.
> > > > If write lock is just for the time while we move aside the current
> > > > MemStore.
> > > > Then when a write happens to key will it update the memstore only but
> > > > snapshot does not have that update and when snapshot is dunmped to
> > Hfile
> > > > won't we loose the update?
> > > >
> > > >
> > > >
> > > No. The update is in the new currently active MemStore. The update will
> > be
> > > included in the next flush added to a new hfile.
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > >
> > > > On Mon, May 2, 2016 at 9:06 PM, Stack <st...@duboce.net> wrote:
> > > >
> > > > > On Mon, May 2, 2016 at 1:25 AM, Shushant Arora <
> > > > shushantaror...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Few doubts;
> > > > > >
> > > > > > 1.LSM tree comprises two tree-like
> > > > > > <https://en.wikipedia.org/wiki/Tree_(data_structure)>
> structures,
> > > > called
> > > > > > C0 and
> > > > > > C1 and If the insertion causes the C0 component to exceed a
> certain
> > > > size
> > > > > > threshold, a contiguous segment of entries is removed from C0 and
> > > > merged
> > > > > > into C1 on disk
> > > > > >
> > > > > > But in Hbase when C0 which is memstore I guess? is exceeded the
> > > > threshold
> > > > > > size its dumped on to HDFS as HFIle(c1 I guess?) - and does
> > > compaction
> > > > is
> > > > > > the process which here means as merging of C0 and C1 ?
> > > > > >
> > > > > >
> > > > > The 'merge' in the quoted high-level description may just mean that
> > the
> > > > > dumped hfile is 'merged' with the others at read time. Or it may be
> > as
> > > > > stated, that the 'merge' happens at flush time. Some LSM tree
> > > > > implementations do it this way -- Bigtable, and it calls the merge
> of
> > > > > memstore and a file-on-disk a form of compaction -- but this is not
> > > what
> > > > > HBase does; it just dumps the memstore as a flushed hfile. Later,
> > we'll
> > > > run
> > > > > a compaction process to merge hfiles in back

hbase doubts

2016-05-05 Thread Shushant Arora
1.Why is it better to have single file per region than multiple files for
read performance. Why can't multile threads read multiple file and give
better performance?

2Does hbase regionserver has single thread for compactions and split for
all regions its holding? Why can't single thread per regions will work
better than sequential compactions/split for all regions in a regionserver.

3.Why hbase flush and compact all memstores of all the families of a table
at same time irrespective of their size when even one memstore reaches
threshold.

Thanks
Shushant


Re: hbase architecture doubts

2016-05-02 Thread Shushant Arora
Thanks Stack.

1.So is it at any time there will be two reference 1.active memstore
2.snapshot memstore
snapshot will be initialised at time of flush using active memstore with a
momentaily lock and then active will be discarded and read will be served
usinmg snapshot and write will go to new active memstore.

2key of CSLS is keyvalue . Which part of keyValue is used while sorting the
set. Is it whole keyvalue or just row key. Does Hfile has separate entry
for each key value and keyvalues of same row key are always stored
contiguosly in HFile and may not be in same block?

On Tue, May 3, 2016 at 12:05 AM, Stack <st...@duboce.net> wrote:

> On Mon, May 2, 2016 at 10:06 AM, Shushant Arora <shushantaror...@gmail.com
> >
> wrote:
>
> > Thanks Stack
> >
> > for point 2 :
> > I am concerned with downtime of Hbase for read and write.
> > If write lock is just for the time while we move aside the current
> > MemStore.
> > Then when a write happens to key will it update the memstore only but
> > snapshot does not have that update and when snapshot is dunmped to Hfile
> > won't we loose the update?
> >
> >
> >
> No. The update is in the new currently active MemStore. The update will be
> included in the next flush added to a new hfile.
>
> St.Ack
>
>
>
>
>
> > On Mon, May 2, 2016 at 9:06 PM, Stack <st...@duboce.net> wrote:
> >
> > > On Mon, May 2, 2016 at 1:25 AM, Shushant Arora <
> > shushantaror...@gmail.com>
> > > wrote:
> > >
> > > > Thanks!
> > > >
> > > > Few doubts;
> > > >
> > > > 1.LSM tree comprises two tree-like
> > > > <https://en.wikipedia.org/wiki/Tree_(data_structure)> structures,
> > called
> > > > C0 and
> > > > C1 and If the insertion causes the C0 component to exceed a certain
> > size
> > > > threshold, a contiguous segment of entries is removed from C0 and
> > merged
> > > > into C1 on disk
> > > >
> > > > But in Hbase when C0 which is memstore I guess? is exceeded the
> > threshold
> > > > size its dumped on to HDFS as HFIle(c1 I guess?) - and does
> compaction
> > is
> > > > the process which here means as merging of C0 and C1 ?
> > > >
> > > >
> > > The 'merge' in the quoted high-level description may just mean that the
> > > dumped hfile is 'merged' with the others at read time. Or it may be as
> > > stated, that the 'merge' happens at flush time. Some LSM tree
> > > implementations do it this way -- Bigtable, and it calls the merge of
> > > memstore and a file-on-disk a form of compaction -- but this is not
> what
> > > HBase does; it just dumps the memstore as a flushed hfile. Later, we'll
> > run
> > > a compaction process to merge hfiles in background.
> > >
> > >
> > >
> > > > 2.Moves current, active Map aside as a snapshot (while a write lock
> is
> > > held
> > > > for a short period of time), and then creates a new CSLS instances.
> > > >
> > > > In background, the snapshot is then dumped to disk. We get an
> Iterator
> > on
> > > > CSLS. We write a block at a time. When we exceed configured block
> size,
> > > we
> > > > start a new one.
> > > >
> > > > -- Does write lock is held till the time complete CSLS is dumpled on
> > > > disk.
> > >
> > >
> > >
> > > No. Just while we move aside the current MemStore.
> > >
> > > What is your concern/objective? Are you studying LSM trees generally or
> > are
> > > you worried that HBase is offline for periods of time for read and
> write?
> > >
> > > Thanks,
> > > St.Ack
> > >
> > >
> > >
> > > > And read is allowed using snapshot.
> > > >
> > > >
> > >
> > >
> > >
> > > > Thanks!
> > > >
> > > >
> > > >
> > > > On Mon, May 2, 2016 at 11:39 AM, Stack <st...@duboce.net> wrote:
> > > >
> > > > > On Sun, May 1, 2016 at 3:36 AM, Shushant Arora <
> > > > shushantaror...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > 1.Does Hbase uses ConcurrentskipListMap(CSLM) to store data in
> > > > memstore?
> > > > > >
> > > > > > Yes (We use a CSLS but this is implemented over a CSLM).
> > > > >
> > > &g

Re: hbase architecture doubts

2016-05-02 Thread Shushant Arora
Thanks Stack

for point 2 :
I am concerned with downtime of Hbase for read and write.
If write lock is just for the time while we move aside the current
MemStore.
Then when a write happens to key will it update the memstore only but
snapshot does not have that update and when snapshot is dunmped to Hfile
won't we loose the update?


On Mon, May 2, 2016 at 9:06 PM, Stack <st...@duboce.net> wrote:

> On Mon, May 2, 2016 at 1:25 AM, Shushant Arora <shushantaror...@gmail.com>
> wrote:
>
> > Thanks!
> >
> > Few doubts;
> >
> > 1.LSM tree comprises two tree-like
> > <https://en.wikipedia.org/wiki/Tree_(data_structure)> structures, called
> > C0 and
> > C1 and If the insertion causes the C0 component to exceed a certain size
> > threshold, a contiguous segment of entries is removed from C0 and merged
> > into C1 on disk
> >
> > But in Hbase when C0 which is memstore I guess? is exceeded the threshold
> > size its dumped on to HDFS as HFIle(c1 I guess?) - and does compaction is
> > the process which here means as merging of C0 and C1 ?
> >
> >
> The 'merge' in the quoted high-level description may just mean that the
> dumped hfile is 'merged' with the others at read time. Or it may be as
> stated, that the 'merge' happens at flush time. Some LSM tree
> implementations do it this way -- Bigtable, and it calls the merge of
> memstore and a file-on-disk a form of compaction -- but this is not what
> HBase does; it just dumps the memstore as a flushed hfile. Later, we'll run
> a compaction process to merge hfiles in background.
>
>
>
> > 2.Moves current, active Map aside as a snapshot (while a write lock is
> held
> > for a short period of time), and then creates a new CSLS instances.
> >
> > In background, the snapshot is then dumped to disk. We get an Iterator on
> > CSLS. We write a block at a time. When we exceed configured block size,
> we
> > start a new one.
> >
> > -- Does write lock is held till the time complete CSLS is dumpled on
> > disk.
>
>
>
> No. Just while we move aside the current MemStore.
>
> What is your concern/objective? Are you studying LSM trees generally or are
> you worried that HBase is offline for periods of time for read and write?
>
> Thanks,
> St.Ack
>
>
>
> > And read is allowed using snapshot.
> >
> >
>
>
>
> > Thanks!
> >
> >
> >
> > On Mon, May 2, 2016 at 11:39 AM, Stack <st...@duboce.net> wrote:
> >
> > > On Sun, May 1, 2016 at 3:36 AM, Shushant Arora <
> > shushantaror...@gmail.com>
> > > wrote:
> > >
> > > > 1.Does Hbase uses ConcurrentskipListMap(CSLM) to store data in
> > memstore?
> > > >
> > > > Yes (We use a CSLS but this is implemented over a CSLM).
> > >
> > >
> > > > 2.When mwmstore is flushed to HDFS- does it dump the memstore
> > > > Concurrentskiplist as Hfile2? Then How does it calculates blocks out
> of
> > > > CSLM and dmp them in HDFS.
> > > >
> > > >
> > > Moves current, active Map aside as a snapshot (while a write lock is
> held
> > > for a short period of time), and then creates a new CSLS instances.
> > >
> > > In background, the snapshot is then dumped to disk. We get an Iterator
> on
> > > CSLS. We write a block at a time. When we exceed configured block size,
> > we
> > > start a new one.
> > >
> > >
> > > > 3.After dumping the inmemory CSLM of memstore to HFILe does memstore
> > > > content is discarded
> > >
> > >
> > > Yes
> > >
> > >
> > >
> > > > and if while dumping memstore any read request comes
> > > > will it be responded by copy of memstore or discard of memstore will
> be
> > > > blocked until read request is completed?
> > > >
> > > > We will respond using the snapshot until it has been successfully
> > dumped.
> > > Once dumped, we'll respond using the hfile.
> > >
> > > No blocking (other than for the short period during which the snapshot
> is
> > > made and the file is swapped into the read path).
> > >
> > >
> > >
> > > > 4.When a read request comes does it look in inmemory CSLM and then in
> > > > HFile?
> > >
> > >
> > > Generally, yes.
> > >
> > >
> > >
> > > > And what is LogStructuredMerge tree and its usage in Hbase.
> > > >
> > > >
> > > Suggest you read up on LSM Trees (
> > > https://en.wikipedia.org/wiki/Log-structured_merge-tree) and if you
> > still
> > > can't see the LSM tree in the HBase forest, ask specific questions and
> > > we'll help you out.
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > > > Thanks!
> > > >
> > >
> >
>


Re: hbase architecture doubts

2016-05-02 Thread Shushant Arora
Thanks!

Few doubts;

1.LSM tree comprises two tree-like
<https://en.wikipedia.org/wiki/Tree_(data_structure)> structures, called C0 and
C1 and If the insertion causes the C0 component to exceed a certain size
threshold, a contiguous segment of entries is removed from C0 and merged
into C1 on disk

But in Hbase when C0 which is memstore I guess? is exceeded the threshold
size its dumped on to HDFS as HFIle(c1 I guess?) - and does compaction is
the process which here means as merging of C0 and C1 ?

2.Moves current, active Map aside as a snapshot (while a write lock is held
for a short period of time), and then creates a new CSLS instances.

In background, the snapshot is then dumped to disk. We get an Iterator on
CSLS. We write a block at a time. When we exceed configured block size, we
start a new one.

-- Does write lock is held till the time complete CSLS is dumpled on
disk.And read is allowed using snapshot.

Thanks!



On Mon, May 2, 2016 at 11:39 AM, Stack <st...@duboce.net> wrote:

> On Sun, May 1, 2016 at 3:36 AM, Shushant Arora <shushantaror...@gmail.com>
> wrote:
>
> > 1.Does Hbase uses ConcurrentskipListMap(CSLM) to store data in memstore?
> >
> > Yes (We use a CSLS but this is implemented over a CSLM).
>
>
> > 2.When mwmstore is flushed to HDFS- does it dump the memstore
> > Concurrentskiplist as Hfile2? Then How does it calculates blocks out of
> > CSLM and dmp them in HDFS.
> >
> >
> Moves current, active Map aside as a snapshot (while a write lock is held
> for a short period of time), and then creates a new CSLS instances.
>
> In background, the snapshot is then dumped to disk. We get an Iterator on
> CSLS. We write a block at a time. When we exceed configured block size, we
> start a new one.
>
>
> > 3.After dumping the inmemory CSLM of memstore to HFILe does memstore
> > content is discarded
>
>
> Yes
>
>
>
> > and if while dumping memstore any read request comes
> > will it be responded by copy of memstore or discard of memstore will be
> > blocked until read request is completed?
> >
> > We will respond using the snapshot until it has been successfully dumped.
> Once dumped, we'll respond using the hfile.
>
> No blocking (other than for the short period during which the snapshot is
> made and the file is swapped into the read path).
>
>
>
> > 4.When a read request comes does it look in inmemory CSLM and then in
> > HFile?
>
>
> Generally, yes.
>
>
>
> > And what is LogStructuredMerge tree and its usage in Hbase.
> >
> >
> Suggest you read up on LSM Trees (
> https://en.wikipedia.org/wiki/Log-structured_merge-tree) and if you still
> can't see the LSM tree in the HBase forest, ask specific questions and
> we'll help you out.
>
> St.Ack
>
>
>
>
> > Thanks!
> >
>


hbase architecture doubts

2016-05-01 Thread Shushant Arora
1.Does Hbase uses ConcurrentskipListMap(CSLM) to store data in memstore?

2.When mwmstore is flushed to HDFS- does it dump the memstore
Concurrentskiplist as Hfile2? Then How does it calculates blocks out of
CSLM and dmp them in HDFS.

3.After dumping the inmemory CSLM of memstore to HFILe does memstore
content is discarded and if while dumping memstore any read request comes
will it be responded by copy of memstore or discard of memstore will be
blocked until read request is completed?

4.When a read request comes does it look in inmemory CSLM and then in
HFile? And what is LogStructuredMerge tree and its usage in Hbase.

Thanks!


Re: hbase custom scan

2016-04-04 Thread Shushant Arora
table will have ~100 regions.

I did n't get the advantage of top rows from same vs different regions ?
They will come from different regions .

On Tue, Apr 5, 2016 at 9:10 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> How many regions does your table have ?
>
> After sorting, is there a chance that the top N rows come from distinct
> regions ?
>
> On Mon, Apr 4, 2016 at 8:27 PM, Shushant Arora <shushantaror...@gmail.com>
> wrote:
>
> > Hi
> >
> > I have a requirement to scan a hbase table based on insertion timestamp.
> > I need to fetch the keys sorted by insertion timestamp not by key .
> >
> > I can't made timestamp as prefix of key to avoid hot spotting.
> > Is there any efficient way possible for this requirement.
> >
> > Thanks!
> >
>


hbase custom scan

2016-04-04 Thread Shushant Arora
Hi

I have a requirement to scan a hbase table based on insertion timestamp.
I need to fetch the keys sorted by insertion timestamp not by key .

I can't made timestamp as prefix of key to avoid hot spotting.
Is there any efficient way possible for this requirement.

Thanks!


does hbase scan doubts

2016-03-13 Thread Shushant Arora
Does hbase scan or get is single threaded?
Say I have hbase table with 100 regionservers.

When I scan a key rangle say a-z(distributed on all regionservers), will
the client make calls to regionservers in parallel all at once or one by
one.First it will get all keys from one regionserver then make a next call
to another regionserver in lexicographic order of keys?

If it makes call in parallel then how does it ensure result to be sorted by
key always?

Thanks!


Re: use of hbase client in application server

2016-03-13 Thread Shushant Arora
2.DO I need to check whether Hconnection is still active before using it to
create Htable instance.

By still valid I meant that say I created Hconnection object and after 3-4
minutes when request came for any crud operation for some table and before
getting Htable using from
HConnection say Hconnection object or TCP/IP connection to cluster dropped.
Does now if I create Htable using HConnection will it recrete the TCP
connection to cluster automatically?

1.And will increasing no of HConnectoions will improve the performance?

Thanks!

On Sun, Mar 13, 2016 at 7:47 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> For #1, single Hconnection should work.
>
> For #2, can you clarify ? As long as the hbase-site.xml used to create
> the Hconnection
> is still valid, you can continue using the connection.
>
> For #3, they're handled by the connection automatically.
>
> For #4, the HTable ctor you cited doesn't exist in master branch.
> You can control the following parameters for the ThreadPoolExecutor - see
> HTable#getDefaultExecutor():
>
> int maxThreads = conf.getInt("hbase.htable.threads.max", Integer.
> MAX_VALUE);
>
> if (maxThreads == 0) {
>
>   maxThreads = 1; // is there a better default?
>
> }
>
> int corePoolSize = conf.getInt("hbase.htable.threads.coresize", 1);
>
> long keepAliveTime = conf.getLong("hbase.htable.threads.keepalivetime",
> 60);
>
> On Sun, Mar 13, 2016 at 3:12 AM, Shushant Arora <shushantaror...@gmail.com
> >
> wrote:
>
> > I have a requirement to use long running hbase client in application
> > server.
> >
> > 1.Do I need to create multiple HConnections or single Hconnection will
> > work?
> > 2. DO I need to check whether Hconnection is still active before using it
> > to create Htable instance.
> > 3.DO I need to handle region split and regionserver changes while using
> > Hconnection or are they handled automatically.
> > 4.Whats the use of thread pool in Htable instance.
> > ExecutorService threadPool;
> > HTable h = new HTable(conf, Bytes.toBytes("tablename"), threadPool);
> >
> >
> > Thanks!
> >
>


use of hbase client in application server

2016-03-13 Thread Shushant Arora
I have a requirement to use long running hbase client in application server.

1.Do I need to create multiple HConnections or single Hconnection will work?
2. DO I need to check whether Hconnection is still active before using it
to create Htable instance.
3.DO I need to handle region split and regionserver changes while using
Hconnection or are they handled automatically.
4.Whats the use of thread pool in Htable instance.
ExecutorService threadPool;
HTable h = new HTable(conf, Bytes.toBytes("tablename"), threadPool);


Thanks!


Re: disable major compaction per table

2016-02-18 Thread Shushant Arora
Thanks!

does hbase compress repeated values in keys and  columns :location say
(ASIA). will that be repeated with each key or hbase snappy compression
will handle that.

same applies for repeated values of a column?

Thanks!

On Wed, Feb 17, 2016 at 7:14 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. hbase.hregion.majorcompaction = 0 per table/column family
>
> I searched code base but didn't find relevant test case for the above.
> Mind giving me some pointer ?
>
> Thanks
>
> On Tue, Feb 16, 2016 at 5:38 PM, Vladimir Rodionov <vladrodio...@gmail.com
> >
> wrote:
>
> > 1.does major compaction in hbase runs per table basis.
> >
> > Per Region
> >
> > 2.By default every 24 hours?
> >
> > In older versions - yes. Current  (1.x+) - 7 days
> >
> > 3.Can I disable automatic major compaction for few tables while keep it
> > enable for rest of tables?
> >
> > yes, you can. You can set
> >
> > hbase.hregion.majorcompaction = 0 per table/column family
> >
> > 4.Does hbase put ,get and delete are blocked while major compaction and
> are
> > working in minor compaction?
> >
> > No, they are not.
> >
> > -Vlad
> >
> > On Tue, Feb 16, 2016 at 4:51 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > For #2, see http://hbase.apache.org/book.html#managed.compactions
> > >
> > > For #3, I don't think so.
> > >
> > > On Tue, Feb 16, 2016 at 4:46 PM, Shushant Arora <
> > shushantaror...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > 1.does major compaction in hbase runs per table basis.
> > > > 2.By default every 24 hours?
> > > > 3.Can I disable automatic major compaction for few tables while keep
> it
> > > > enable for rest of tables?
> > > >
> > > > 4.Does hbase put ,get and delete are blocked while major compaction
> and
> > > are
> > > > working in minor compaction?
> > > >
> > > > Thanks
> > > >
> > >
> >
>


disable major compaction per table

2016-02-16 Thread Shushant Arora
Hi

1.does major compaction in hbase runs per table basis.
2.By default every 24 hours?
3.Can I disable automatic major compaction for few tables while keep it
enable for rest of tables?

4.Does hbase put ,get and delete are blocked while major compaction and are
working in minor compaction?

Thanks


timestamp/ttl of a cell

2015-11-25 Thread Shushant Arora
Hi

Can TTL of rows be  set/updated instead of complete column family?
or
Can timestamp version of a cell be decreased ? Aim is to delete some rows
whose timestamp
is set to old values so that it matches TTL of column family if tTL of
row/cell cannot be specified.


Re: timestamp/ttl of a cell

2015-11-25 Thread Shushant Arora
Thanks!
Whats the syntax to set it in shell and java ?

On Wed, Nov 25, 2015 at 6:05 PM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> This? HBASE-10560
>
> 2015-11-25 6:45 GMT-05:00 Shushant Arora <shushantaror...@gmail.com>:
>
> > Hi
> >
> > Can TTL of rows be  set/updated instead of complete column family?
> > or
> > Can timestamp version of a cell be decreased ? Aim is to delete some rows
> > whose timestamp
> > is set to old values so that it matches TTL of column family if tTL of
> > row/cell cannot be specified.
> >
>


hbase timerange scan

2015-11-04 Thread Shushant Arora
Does hbase timerange scan is full table scan without the start and stop key?
Or is it take care of HFile meta data about min and max timerange n HFile .
And how it optimises this metadata after compaction of multiple files?


Re: hbase doubts

2015-08-18 Thread Shushant Arora
and will using keyprefixregionsplit policy instead of default Increasing to
upperbound split policy help here?

On Wed, Aug 19, 2015 at 10:23 AM, Shushant Arora shushantaror...@gmail.com
wrote:

 When last region gets new data and split in two - what is the split point
 - say last reagion was having 10 files and split alogorithm decided to
 split this region-

 Will the two children regions have 5-5 files or the key space of original
 region(parent region) say have range (2015-08-01#guid to 2015-08-06#guid)
 will be divided to 2 equal parts child1 has (2015-08-01#guid to
 2015-08-03#guids) and child2 has range (2015-08-04#guid to 2015-08-06#guid)
 and all data is  rewritten in child regions to accomany this key range and
 then since its time series based so new data will come in increasing dates
 and for dates2015-08-06 only so will go to child2 and child1 wil always be
 half filled. And child2 only will lead to new splits when reached split
 size threshold.






 On Wed, Aug 19, 2015 at 4:16 AM, Ted Yu yuzhih...@gmail.com wrote:

 Since year and month are part of the row key in this scenario (instead of
 just the day of month), the last region would get new data and be split.

 Is this effect desirable for your app ?

 Cheers

 On Tue, Aug 18, 2015 at 12:55 PM, Shushant Arora 
 shushantaror...@gmail.com
 wrote:

  for hbase key containing time as prefix say(-mm-dd#other fields of
 guid
  base) I am using bulk load to avoid hot spot of regionserver (avoiding
  write to WAL).
 
  What should be the initial splits of regions. Say I have 30
 regionserves.
 
  shall intial 30 days as intial splits and then auto split takes care of
  splitting regions if it grows further will serve ?
  Or since if it has date as prefix and when region is split in 2 from
 midway
  - and new data will come for increasing date only will lead to  one
 region
  to be half filled always and rest half never filled?
 
  On Tue, Aug 18, 2015 at 9:41 PM, anil gupta anilgupt...@gmail.com
 wrote:
 
   As per my experience, Phoenix is way superior than Hive-HBase
 integration
   for sql-like querying on HBase. It's because, Phoenix is built on top
 of
   HBase unlike Hive.
  
   On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu yuzhih...@gmail.com wrote:
  
To my knowledge, Phoenix provides better integration with hbase.
   
A third possibility is Spark on HBase.
   
If you want to explore these alternatives, I suggest asking on
  respective
mailing lists where you can get expert opinions.
   
Cheers
   
On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora 
   shushantaror...@gmail.com

wrote:
   
 Thanks!

 Which one is better for sqlkind of queries over hbase (queries
  involve
 filter , key range scan), aggregates by column values.
 .
 1.Hive storage handlers
 2.or Phoenix

 On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu yuzhih...@gmail.com
 wrote:

  For #1, if you want to count distinct values for F1, you can
 write
  a
  coprocessor which aggregates the count on region server and
 returns
   the
  result to client which does the final aggregation.
 
  Take a look
  at
 

   
  
 
 hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java
  and related classes for example.
 
  On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora 
  shushantaror...@gmail.com
  wrote:
 
   Thanks !
   few more doubts :
  
   1.Say if requirement is to count distinct value of F1-
  
   If field is part of key- is hbase can't just scan key and skip
   value
   deserialsation and return result to client which will
 calculate
 distinct
   and in second approcah Hbase will desrialise the value of
 return
column
   containing F1 to cleint which will calculate the distinct.
  
   2.For bulk load when LoadIncrementalHFiles runs and
 regionserver
moves
  the
   hfiles from hdfs to region directory - does regionserver
 localise
   the
  hfile
   by downloading it to local and then uploading again in region
 directory?
  Or
   it just moves to to region directory and wait for next
 compaction
   to
 get
  it
   localise  as in regionserver failure case?
  
  
  
  
   On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu yuzhih...@gmail.com
 
wrote:
  
For both scenarios you mentioned, field is not leading part
 of
   row
 key.
You would need to specify timerange or start row / stop row
 to
narrow
  the
key range being scanned.
   
I am leaning toward using second approach.
   
Cheers
   
On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora 
   shushantaror...@gmail.com

wrote:
   
 ~8-10 fields of size (5 of  20 bytes each )and 3 fields of
  size
200
   bytes
 each.

 On Mon, Aug 17

Re: hbase doubts

2015-08-18 Thread Shushant Arora
When last region gets new data and split in two - what is the split point -
say last reagion was having 10 files and split alogorithm decided to split
this region-

Will the two children regions have 5-5 files or the key space of original
region(parent region) say have range (2015-08-01#guid to 2015-08-06#guid)
will be divided to 2 equal parts child1 has (2015-08-01#guid to
2015-08-03#guids) and child2 has range (2015-08-04#guid to 2015-08-06#guid)
and all data is  rewritten in child regions to accomany this key range and
then since its time series based so new data will come in increasing dates
and for dates2015-08-06 only so will go to child2 and child1 wil always be
half filled. And child2 only will lead to new splits when reached split
size threshold.






On Wed, Aug 19, 2015 at 4:16 AM, Ted Yu yuzhih...@gmail.com wrote:

 Since year and month are part of the row key in this scenario (instead of
 just the day of month), the last region would get new data and be split.

 Is this effect desirable for your app ?

 Cheers

 On Tue, Aug 18, 2015 at 12:55 PM, Shushant Arora 
 shushantaror...@gmail.com
 wrote:

  for hbase key containing time as prefix say(-mm-dd#other fields of
 guid
  base) I am using bulk load to avoid hot spot of regionserver (avoiding
  write to WAL).
 
  What should be the initial splits of regions. Say I have 30 regionserves.
 
  shall intial 30 days as intial splits and then auto split takes care of
  splitting regions if it grows further will serve ?
  Or since if it has date as prefix and when region is split in 2 from
 midway
  - and new data will come for increasing date only will lead to  one
 region
  to be half filled always and rest half never filled?
 
  On Tue, Aug 18, 2015 at 9:41 PM, anil gupta anilgupt...@gmail.com
 wrote:
 
   As per my experience, Phoenix is way superior than Hive-HBase
 integration
   for sql-like querying on HBase. It's because, Phoenix is built on top
 of
   HBase unlike Hive.
  
   On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu yuzhih...@gmail.com wrote:
  
To my knowledge, Phoenix provides better integration with hbase.
   
A third possibility is Spark on HBase.
   
If you want to explore these alternatives, I suggest asking on
  respective
mailing lists where you can get expert opinions.
   
Cheers
   
On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora 
   shushantaror...@gmail.com

wrote:
   
 Thanks!

 Which one is better for sqlkind of queries over hbase (queries
  involve
 filter , key range scan), aggregates by column values.
 .
 1.Hive storage handlers
 2.or Phoenix

 On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu yuzhih...@gmail.com
 wrote:

  For #1, if you want to count distinct values for F1, you can
 write
  a
  coprocessor which aggregates the count on region server and
 returns
   the
  result to client which does the final aggregation.
 
  Take a look
  at
 

   
  
 
 hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java
  and related classes for example.
 
  On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora 
  shushantaror...@gmail.com
  wrote:
 
   Thanks !
   few more doubts :
  
   1.Say if requirement is to count distinct value of F1-
  
   If field is part of key- is hbase can't just scan key and skip
   value
   deserialsation and return result to client which will calculate
 distinct
   and in second approcah Hbase will desrialise the value of
 return
column
   containing F1 to cleint which will calculate the distinct.
  
   2.For bulk load when LoadIncrementalHFiles runs and
 regionserver
moves
  the
   hfiles from hdfs to region directory - does regionserver
 localise
   the
  hfile
   by downloading it to local and then uploading again in region
 directory?
  Or
   it just moves to to region directory and wait for next
 compaction
   to
 get
  it
   localise  as in regionserver failure case?
  
  
  
  
   On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu yuzhih...@gmail.com
wrote:
  
For both scenarios you mentioned, field is not leading part
 of
   row
 key.
You would need to specify timerange or start row / stop row
 to
narrow
  the
key range being scanned.
   
I am leaning toward using second approach.
   
Cheers
   
On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora 
   shushantaror...@gmail.com

wrote:
   
 ~8-10 fields of size (5 of  20 bytes each )and 3 fields of
  size
200
   bytes
 each.

 On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu 
 yuzhih...@gmail.com
  
 wrote:

  How many fields such as F1 are you considering for
  embedding
   in
 row
key ?
 
  Suggested reading

Re: hbase doubts

2015-08-18 Thread Shushant Arora
for hbase key containing time as prefix say(-mm-dd#other fields of guid
base) I am using bulk load to avoid hot spot of regionserver (avoiding
write to WAL).

What should be the initial splits of regions. Say I have 30 regionserves.

shall intial 30 days as intial splits and then auto split takes care of
splitting regions if it grows further will serve ?
Or since if it has date as prefix and when region is split in 2 from midway
- and new data will come for increasing date only will lead to  one region
to be half filled always and rest half never filled?

On Tue, Aug 18, 2015 at 9:41 PM, anil gupta anilgupt...@gmail.com wrote:

 As per my experience, Phoenix is way superior than Hive-HBase integration
 for sql-like querying on HBase. It's because, Phoenix is built on top of
 HBase unlike Hive.

 On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu yuzhih...@gmail.com wrote:

  To my knowledge, Phoenix provides better integration with hbase.
 
  A third possibility is Spark on HBase.
 
  If you want to explore these alternatives, I suggest asking on respective
  mailing lists where you can get expert opinions.
 
  Cheers
 
  On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora 
 shushantaror...@gmail.com
  
  wrote:
 
   Thanks!
  
   Which one is better for sqlkind of queries over hbase (queries involve
   filter , key range scan), aggregates by column values.
   .
   1.Hive storage handlers
   2.or Phoenix
  
   On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu yuzhih...@gmail.com wrote:
  
For #1, if you want to count distinct values for F1, you can write a
coprocessor which aggregates the count on region server and returns
 the
result to client which does the final aggregation.
   
Take a look
at
   
  
 
 hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java
and related classes for example.
   
On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora 
shushantaror...@gmail.com
wrote:
   
 Thanks !
 few more doubts :

 1.Say if requirement is to count distinct value of F1-

 If field is part of key- is hbase can't just scan key and skip
 value
 deserialsation and return result to client which will calculate
   distinct
 and in second approcah Hbase will desrialise the value of return
  column
 containing F1 to cleint which will calculate the distinct.

 2.For bulk load when LoadIncrementalHFiles runs and regionserver
  moves
the
 hfiles from hdfs to region directory - does regionserver localise
 the
hfile
 by downloading it to local and then uploading again in region
   directory?
Or
 it just moves to to region directory and wait for next compaction
 to
   get
it
 localise  as in regionserver failure case?




 On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu yuzhih...@gmail.com
  wrote:

  For both scenarios you mentioned, field is not leading part of
 row
   key.
  You would need to specify timerange or start row / stop row to
  narrow
the
  key range being scanned.
 
  I am leaning toward using second approach.
 
  Cheers
 
  On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora 
 shushantaror...@gmail.com
  
  wrote:
 
   ~8-10 fields of size (5 of  20 bytes each )and 3 fields of size
  200
 bytes
   each.
  
   On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu yuzhih...@gmail.com
   wrote:
  
How many fields such as F1 are you considering for embedding
 in
   row
  key ?
   
Suggested reading:
http://hbase.apache.org/book.html#rowkey.design
http://hbase.apache.org/book.html#client.filter.kvm (see
ColumnPrefixFilter)
   
Cheers
   
On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora 
   shushantaror...@gmail.com

wrote:
   
 1.so size limit is per cell's identifier + value ?

 What is more optimise - to have field in key or in column
family's
column ?
 If pattern is like every row has that field.

 Say I have a field F1 in all rows so
 Situtatio -1
 key1#F1(as composite key)  - and rest fields in column

 Situation-2
 key1 as key and F1 part of column family.


 This is the main reason I  asked the key size limit.
 If I asked for no of rows where F1 is = 'someval' will it
 be
faster
  in
 situation-1 than in situation-2. Since in 1 it can return
 the
 result
   just
 by traversing keys no need to read columns?


 On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu 
 yuzhih...@gmail.com
  
 wrote:

  For #1, it is the limit on a single keyvalue, not row,
 not
   key.
 
  For #2, please see the following:
 
  http://hbase.apache.org/book.html#store.memstore
 
   
 
   
  http

Re: hbase doubts

2015-08-18 Thread Shushant Arora
Thanks!

Which one is better for sqlkind of queries over hbase (queries involve
filter , key range scan), aggregates by column values.
.
1.Hive storage handlers
2.or Phoenix

On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 For #1, if you want to count distinct values for F1, you can write a
 coprocessor which aggregates the count on region server and returns the
 result to client which does the final aggregation.

 Take a look
 at
 hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java
 and related classes for example.

 On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora 
 shushantaror...@gmail.com
 wrote:

  Thanks !
  few more doubts :
 
  1.Say if requirement is to count distinct value of F1-
 
  If field is part of key- is hbase can't just scan key and skip value
  deserialsation and return result to client which will calculate distinct
  and in second approcah Hbase will desrialise the value of return column
  containing F1 to cleint which will calculate the distinct.
 
  2.For bulk load when LoadIncrementalHFiles runs and regionserver moves
 the
  hfiles from hdfs to region directory - does regionserver localise the
 hfile
  by downloading it to local and then uploading again in region directory?
 Or
  it just moves to to region directory and wait for next compaction to get
 it
  localise  as in regionserver failure case?
 
 
 
 
  On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   For both scenarios you mentioned, field is not leading part of row key.
   You would need to specify timerange or start row / stop row to narrow
 the
   key range being scanned.
  
   I am leaning toward using second approach.
  
   Cheers
  
   On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora 
  shushantaror...@gmail.com
   
   wrote:
  
~8-10 fields of size (5 of  20 bytes each )and 3 fields of size 200
  bytes
each.
   
On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu yuzhih...@gmail.com wrote:
   
 How many fields such as F1 are you considering for embedding in row
   key ?

 Suggested reading:
 http://hbase.apache.org/book.html#rowkey.design
 http://hbase.apache.org/book.html#client.filter.kvm (see
 ColumnPrefixFilter)

 Cheers

 On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora 
shushantaror...@gmail.com
 
 wrote:

  1.so size limit is per cell's identifier + value ?
 
  What is more optimise - to have field in key or in column
 family's
 column ?
  If pattern is like every row has that field.
 
  Say I have a field F1 in all rows so
  Situtatio -1
  key1#F1(as composite key)  - and rest fields in column
 
  Situation-2
  key1 as key and F1 part of column family.
 
 
  This is the main reason I  asked the key size limit.
  If I asked for no of rows where F1 is = 'someval' will it be
 faster
   in
  situation-1 than in situation-2. Since in 1 it can return the
  result
just
  by traversing keys no need to read columns?
 
 
  On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu yuzhih...@gmail.com
  wrote:
 
   For #1, it is the limit on a single keyvalue, not row, not key.
  
   For #2, please see the following:
  
   http://hbase.apache.org/book.html#store.memstore
  

  
 http://hbase.apache.org/book.html#regionserver_splitting_implementation
  
   Cheers
  
   On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora 
  shushantaror...@gmail.com
   
   wrote:
  
1.Is hbase.client.keyvalue.maxsize  is max size of row or key
   only
?
 Is
there any limit on key size only ?
2.Access pattern is mostly on key based only- Is memstores
 and
 regions
   on a
regionserver are per table basis? Is it if I have multiple
  tables
it
  will
have multiple memstores instead of few if it would have been
  one
 large
table ?
   
   
On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu yuzhih...@gmail.com
 
wrote:
   
 For #1, take a look at the following in hbase-default.xml :

 namehbase.client.keyvalue.maxsize/name
 value10485760/value

 For #2, it would be easier to answer if you can outline
  access
  patterns
in
 your app.

 For #3, adjustment according to current region boundaries
 is
   done
   client
 side. Take a look at the javadoc for LoadQueueItem
 in LoadIncrementalHFiles.java

 Cheers

 On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora 
shushantaror...@gmail.com
 
 wrote:

  1.Is there any max limit on key size of hbase table.
  2.Is multiple small tables vs one large table which one
 is
  preferred.
  3.for bulk load -when  LoadIncremantalHfile is run it
 again

hbase doubts

2015-08-17 Thread Shushant Arora
1.Is there any max limit on key size of hbase table.
2.Is multiple small tables vs one large table which one is preferred.
3.for bulk load -when  LoadIncremantalHfile is run it again recalculates
the region splits based on region boundary - is this division happens on
client side or server side again at region server or hbase master and then
it assigns the splits which cross target region boundary to desired
regionserver.


Re: hbase doubts

2015-08-17 Thread Shushant Arora
1.Is hbase.client.keyvalue.maxsize  is max size of row or key only ? Is
there any limit on key size only ?
2.Access pattern is mostly on key based only- Is memstores and regions on a
regionserver are per table basis? Is it if I have multiple tables it will
have multiple memstores instead of few if it would have been one large
table ?


On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu yuzhih...@gmail.com wrote:

 For #1, take a look at the following in hbase-default.xml :

 namehbase.client.keyvalue.maxsize/name
 value10485760/value

 For #2, it would be easier to answer if you can outline access patterns in
 your app.

 For #3, adjustment according to current region boundaries is done client
 side. Take a look at the javadoc for LoadQueueItem
 in LoadIncrementalHFiles.java

 Cheers

 On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora shushantaror...@gmail.com
 
 wrote:

  1.Is there any max limit on key size of hbase table.
  2.Is multiple small tables vs one large table which one is preferred.
  3.for bulk load -when  LoadIncremantalHfile is run it again recalculates
  the region splits based on region boundary - is this division happens on
  client side or server side again at region server or hbase master and
 then
  it assigns the splits which cross target region boundary to desired
  regionserver.
 



Re: hbase doubts

2015-08-17 Thread Shushant Arora
Thanks !
few more doubts :

1.Say if requirement is to count distinct value of F1-

If field is part of key- is hbase can't just scan key and skip value
deserialsation and return result to client which will calculate distinct
and in second approcah Hbase will desrialise the value of return column
containing F1 to cleint which will calculate the distinct.

2.For bulk load when LoadIncrementalHFiles runs and regionserver moves the
hfiles from hdfs to region directory - does regionserver localise the hfile
by downloading it to local and then uploading again in region directory? Or
it just moves to to region directory and wait for next compaction to get it
localise  as in regionserver failure case?




On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu yuzhih...@gmail.com wrote:

 For both scenarios you mentioned, field is not leading part of row key.
 You would need to specify timerange or start row / stop row to narrow the
 key range being scanned.

 I am leaning toward using second approach.

 Cheers

 On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora shushantaror...@gmail.com
 
 wrote:

  ~8-10 fields of size (5 of  20 bytes each )and 3 fields of size 200 bytes
  each.
 
  On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   How many fields such as F1 are you considering for embedding in row
 key ?
  
   Suggested reading:
   http://hbase.apache.org/book.html#rowkey.design
   http://hbase.apache.org/book.html#client.filter.kvm (see
   ColumnPrefixFilter)
  
   Cheers
  
   On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora 
  shushantaror...@gmail.com
   
   wrote:
  
1.so size limit is per cell's identifier + value ?
   
What is more optimise - to have field in key or in column family's
   column ?
If pattern is like every row has that field.
   
Say I have a field F1 in all rows so
Situtatio -1
key1#F1(as composite key)  - and rest fields in column
   
Situation-2
key1 as key and F1 part of column family.
   
   
This is the main reason I  asked the key size limit.
If I asked for no of rows where F1 is = 'someval' will it be faster
 in
situation-1 than in situation-2. Since in 1 it can return the result
  just
by traversing keys no need to read columns?
   
   
On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu yuzhih...@gmail.com wrote:
   
 For #1, it is the limit on a single keyvalue, not row, not key.

 For #2, please see the following:

 http://hbase.apache.org/book.html#store.memstore

  
 http://hbase.apache.org/book.html#regionserver_splitting_implementation

 Cheers

 On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora 
shushantaror...@gmail.com
 
 wrote:

  1.Is hbase.client.keyvalue.maxsize  is max size of row or key
 only
  ?
   Is
  there any limit on key size only ?
  2.Access pattern is mostly on key based only- Is memstores and
   regions
 on a
  regionserver are per table basis? Is it if I have multiple tables
  it
will
  have multiple memstores instead of few if it would have been one
   large
  table ?
 
 
  On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu yuzhih...@gmail.com
  wrote:
 
   For #1, take a look at the following in hbase-default.xml :
  
   namehbase.client.keyvalue.maxsize/name
   value10485760/value
  
   For #2, it would be easier to answer if you can outline access
patterns
  in
   your app.
  
   For #3, adjustment according to current region boundaries is
 done
 client
   side. Take a look at the javadoc for LoadQueueItem
   in LoadIncrementalHFiles.java
  
   Cheers
  
   On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora 
  shushantaror...@gmail.com
   
   wrote:
  
1.Is there any max limit on key size of hbase table.
2.Is multiple small tables vs one large table which one is
preferred.
3.for bulk load -when  LoadIncremantalHfile is run it again
  recalculates
the region splits based on region boundary - is this division
happens
  on
client side or server side again at region server or hbase
  master
and
   then
it assigns the splits which cross target region boundary to
   desired
regionserver.
   
  
 

   
  
 



Re: hbase doubts

2015-08-17 Thread Shushant Arora
~8-10 fields of size (5 of  20 bytes each )and 3 fields of size 200 bytes
each.

On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu yuzhih...@gmail.com wrote:

 How many fields such as F1 are you considering for embedding in row key ?

 Suggested reading:
 http://hbase.apache.org/book.html#rowkey.design
 http://hbase.apache.org/book.html#client.filter.kvm (see
 ColumnPrefixFilter)

 Cheers

 On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora shushantaror...@gmail.com
 
 wrote:

  1.so size limit is per cell's identifier + value ?
 
  What is more optimise - to have field in key or in column family's
 column ?
  If pattern is like every row has that field.
 
  Say I have a field F1 in all rows so
  Situtatio -1
  key1#F1(as composite key)  - and rest fields in column
 
  Situation-2
  key1 as key and F1 part of column family.
 
 
  This is the main reason I  asked the key size limit.
  If I asked for no of rows where F1 is = 'someval' will it be faster in
  situation-1 than in situation-2. Since in 1 it can return the result just
  by traversing keys no need to read columns?
 
 
  On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   For #1, it is the limit on a single keyvalue, not row, not key.
  
   For #2, please see the following:
  
   http://hbase.apache.org/book.html#store.memstore
  
 http://hbase.apache.org/book.html#regionserver_splitting_implementation
  
   Cheers
  
   On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora 
  shushantaror...@gmail.com
   
   wrote:
  
1.Is hbase.client.keyvalue.maxsize  is max size of row or key only ?
 Is
there any limit on key size only ?
2.Access pattern is mostly on key based only- Is memstores and
 regions
   on a
regionserver are per table basis? Is it if I have multiple tables it
  will
have multiple memstores instead of few if it would have been one
 large
table ?
   
   
On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu yuzhih...@gmail.com wrote:
   
 For #1, take a look at the following in hbase-default.xml :

 namehbase.client.keyvalue.maxsize/name
 value10485760/value

 For #2, it would be easier to answer if you can outline access
  patterns
in
 your app.

 For #3, adjustment according to current region boundaries is done
   client
 side. Take a look at the javadoc for LoadQueueItem
 in LoadIncrementalHFiles.java

 Cheers

 On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora 
shushantaror...@gmail.com
 
 wrote:

  1.Is there any max limit on key size of hbase table.
  2.Is multiple small tables vs one large table which one is
  preferred.
  3.for bulk load -when  LoadIncremantalHfile is run it again
recalculates
  the region splits based on region boundary - is this division
  happens
on
  client side or server side again at region server or hbase master
  and
 then
  it assigns the splits which cross target region boundary to
 desired
  regionserver.
 

   
  
 



Re: hbase doubts

2015-08-17 Thread Shushant Arora
1.so size limit is per cell's identifier + value ?

What is more optimise - to have field in key or in column family's column ?
If pattern is like every row has that field.

Say I have a field F1 in all rows so
Situtatio -1
key1#F1(as composite key)  - and rest fields in column

Situation-2
key1 as key and F1 part of column family.


This is the main reason I  asked the key size limit.
If I asked for no of rows where F1 is = 'someval' will it be faster in
situation-1 than in situation-2. Since in 1 it can return the result just
by traversing keys no need to read columns?


On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu yuzhih...@gmail.com wrote:

 For #1, it is the limit on a single keyvalue, not row, not key.

 For #2, please see the following:

 http://hbase.apache.org/book.html#store.memstore
 http://hbase.apache.org/book.html#regionserver_splitting_implementation

 Cheers

 On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora shushantaror...@gmail.com
 
 wrote:

  1.Is hbase.client.keyvalue.maxsize  is max size of row or key only ? Is
  there any limit on key size only ?
  2.Access pattern is mostly on key based only- Is memstores and regions
 on a
  regionserver are per table basis? Is it if I have multiple tables it will
  have multiple memstores instead of few if it would have been one large
  table ?
 
 
  On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   For #1, take a look at the following in hbase-default.xml :
  
   namehbase.client.keyvalue.maxsize/name
   value10485760/value
  
   For #2, it would be easier to answer if you can outline access patterns
  in
   your app.
  
   For #3, adjustment according to current region boundaries is done
 client
   side. Take a look at the javadoc for LoadQueueItem
   in LoadIncrementalHFiles.java
  
   Cheers
  
   On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora 
  shushantaror...@gmail.com
   
   wrote:
  
1.Is there any max limit on key size of hbase table.
2.Is multiple small tables vs one large table which one is preferred.
3.for bulk load -when  LoadIncremantalHfile is run it again
  recalculates
the region splits based on region boundary - is this division happens
  on
client side or server side again at region server or hbase master and
   then
it assigns the splits which cross target region boundary to desired
regionserver.
   
  
 



bulk load doubts

2015-07-21 Thread Shushant Arora
1.Does bulk loaded HFile not  get replicated? Is it mean if a Regionserver
gets down , all Hfiles which were bulk loaded to this server are lost
irrespective of HDFS replication set to 3 ? if yes- Why bulk loaded HFiles
are not replicated.

2.Is there any issue in timestamp prefix as key of table- and used bulk
load for writing.

3.Does in bulk load MR job using HFileOutPutFormat2 as outputformat will
create single HFile per region ? Or it can be multiple Hfiles per region?
If multiple does loadIncrementalHFiles merges these Hfiles to 1 while
loading to same region or just do simple copy?

4.Is there any performance issue if I run bulk load every 5 sec -
containing ~20MB of data.Does it  creates frequent compactions and that
lead to performance issue?


hbase doubts

2015-07-16 Thread Shushant Arora
does bulk put supported in hbase ?

And in MR job when we put in a table using TableOutputFormat how is it more
efficient than normal put by individual reducers ? Does TableOutputformat
not do put one by one ?

And in bulkload hadoop job when we specify HFileOutputFormat , does job
creates Hfiles based on regionserver in which they will finally land or
just in sorted order and then Hbase utility LoadIncremental HFiles handle
regionserver in which keys of these Hfiles will go by parsing the Hfile
instead of just dumping the HFiles?


Hbase master selection doubt

2015-06-27 Thread Shushant Arora
How Hbase uses Zookeeper for Master selection and region server failure
detection when Zookeeper is not strictly consistent.

Say In Hbase Master selection process, how does a node is 100 % sure that a
master is created ? Does it has to create the /master node and that node
already exists will thow node exists excpetion .  Since only by reading (ls
/) . It may get stale data and gets node does not exists.but in actual
/master was present.

Does there any issue with non strictly consistency of Zookeeper for Hbase?


Re: Hbase master selection doubt

2015-06-27 Thread Shushant Arora
Zookeeper is Sequential Consistency
Updates from a client will be applied in the order that they were sent.

On Sat, Jun 27, 2015 at 8:18 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. non strictly consistency of Zookeeper

 Can you elaborate on what the above means ?

 please read this:

 http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkGuarantees

 Cheers

 On Sat, Jun 27, 2015 at 7:20 AM, Shushant Arora shushantaror...@gmail.com
 
 wrote:

  How Hbase uses Zookeeper for Master selection and region server failure
  detection when Zookeeper is not strictly consistent.
 
  Say In Hbase Master selection process, how does a node is 100 % sure
 that a
  master is created ? Does it has to create the /master node and that node
  already exists will thow node exists excpetion .  Since only by reading
 (ls
  /) . It may get stale data and gets node does not exists.but in actual
  /master was present.
 
  Does there any issue with non strictly consistency of Zookeeper for
 Hbase?
 



Re: Hbase master selection doubt

2015-06-27 Thread Shushant Arora
By strictly consistent I mean - all clients should see same data at any
time in different sessions.

Say a client C1 was connected to follower F1, And F1 was few seconds behind
the leader. And client C2 connects to F2 which is in sync with Leader . Now
C1 and C2 will see different data under root dir say(/master ) is visible
to C2 not to C1. Till F1 comes in sync with Leader.

On Sat, Jun 27, 2015 at 8:23 PM, Shushant Arora shushantaror...@gmail.com
wrote:

 Zookeeper is Sequential Consistency
 Updates from a client will be applied in the order that they were sent.

 On Sat, Jun 27, 2015 at 8:18 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. non strictly consistency of Zookeeper

 Can you elaborate on what the above means ?

 please read this:

 http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkGuarantees

 Cheers

 On Sat, Jun 27, 2015 at 7:20 AM, Shushant Arora 
 shushantaror...@gmail.com
 wrote:

  How Hbase uses Zookeeper for Master selection and region server failure
  detection when Zookeeper is not strictly consistent.
 
  Say In Hbase Master selection process, how does a node is 100 % sure
 that a
  master is created ? Does it has to create the /master node and that node
  already exists will thow node exists excpetion .  Since only by reading
 (ls
  /) . It may get stale data and gets node does not exists.but in actual
  /master was present.
 
  Does there any issue with non strictly consistency of Zookeeper for
 Hbase?
 





Re: avoiding hot spot for timestamp prefix key

2015-05-22 Thread Shushant Arora
guid change with every key, patterns is
2015-05-22 00:02:01#AB12EC945
2015-05-22 00:02:02#CD9870001234AB457

When we specify custom split algorithm , it may happen that keys of same
sorting order range say (1-7) lies in region R1 as well as in region R2?
Then how .META. table will make further lookups at read time,  say I search
for key 3, then will it search in both the regions R1 and R2 ?

On Fri, May 22, 2015 at 10:48 AM, Ted Yu yuzhih...@gmail.com wrote:

 Does guid change with every key ?

 bq. use second part of key

 I don't think so. Suppose first row in the parent region is
 '1432104178817#321'. After split, the first row in first daughter region
 would still be '1432104178817#321'. Right ?

 Cheers

 On Thu, May 21, 2015 at 9:57 PM, Shushant Arora shushantaror...@gmail.com
 
 wrote:

  Can I avoid hotspot of region with custom region split policy in hbase
  0.96 .
 
  Key is of the form timestamp#guid.
  So can I have custom region split policy and use second part of key (i.e)
  guid as region split criteria and avoid hot spot??
 



Re: avoiding hot spot for timestamp prefix key

2015-05-22 Thread Shushant Arora
since custom split policy is based on second part i.e guid so key with
first part as 2015-05-22 00:01:02 will be in which region how will that be
identified?


On Fri, May 22, 2015 at 1:12 PM, Ted Yu yuzhih...@gmail.com wrote:

 The custom split policy needs to respect the fact that timestamp is the
 leading part of the rowkey.

 This would avoid the overlap you mentioned.

 Cheers



  On May 21, 2015, at 11:55 PM, Shushant Arora shushantaror...@gmail.com
 wrote:
 
  guid change with every key, patterns is
  2015-05-22 00:02:01#AB12EC945
  2015-05-22 00:02:02#CD9870001234AB457
 
  When we specify custom split algorithm , it may happen that keys of same
  sorting order range say (1-7) lies in region R1 as well as in region R2?
  Then how .META. table will make further lookups at read time,  say I
 search
  for key 3, then will it search in both the regions R1 and R2 ?
 
  On Fri, May 22, 2015 at 10:48 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  Does guid change with every key ?
 
  bq. use second part of key
 
  I don't think so. Suppose first row in the parent region is
  '1432104178817#321'. After split, the first row in first daughter region
  would still be '1432104178817#321'. Right ?
 
  Cheers
 
  On Thu, May 21, 2015 at 9:57 PM, Shushant Arora 
 shushantaror...@gmail.com
  wrote:
 
  Can I avoid hotspot of region with custom region split policy in hbase
  0.96 .
 
  Key is of the form timestamp#guid.
  So can I have custom region split policy and use second part of key
 (i.e)
  guid as region split criteria and avoid hot spot??
 



avoiding hot spot for timestamp prefix key

2015-05-21 Thread Shushant Arora
Can I avoid hotspot of region with custom region split policy in hbase
0.96 .

Key is of the form timestamp#guid.
So can I have custom region split policy and use second part of key (i.e)
guid as region split criteria and avoid hot spot??


default no of reducers

2015-04-28 Thread Shushant Arora
In Normal MR job can I configure ( cluster wide) default number of reducers
- if I don't specify any reducers in my job.


Re: pre split region server

2014-07-16 Thread Shushant Arora
Thanks!
Few more doubts

1.When I don't supply SPLITS at table creation , all put operation will go
to one region only.
 But when region grows more than hbase.hregion.max.filesize , then 2
regions will be created both have half-half data or another will be empty
initially?
2.If both have 50-50% data and row key is monotonically increasing then 1
region will be half filled always and will never be filled again ?
3.While prespliting table only way is to specify row boundaries and key
prefixes  ?Say if i don't know key ranges , as in my case its GUID
hexadecimal 32 character string , what should be region split boundary ?
and How many splits should be created - is it equal to no of regionserver
aka datanodes ?
4.For keys of type ACTIVITYTYPE-DATE (where activity type has 2 values
1.login 2.logout) what should be split strategy ?



On Tue, Jul 15, 2014 at 7:03 PM, Ted Yu yuzhih...@gmail.com wrote:

 Shushant:
 For #2, if table has only one region, the hosting region server would
 receive all writes.
 For #4, yes - presplitting goes with fixed number of regions.

 Cheers


 On Tue, Jul 15, 2014 at 6:23 AM, sudhakara st sudhakara...@gmail.com
 wrote:

  You can find info here
  http://hbase.apache.org/book/rowkey.design.html#rowkey.regionsplits
  http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
 
 
  On Tue, Jul 15, 2014 at 6:40 PM, Shushant Arora 
 shushantaror...@gmail.com
  
  wrote:
 
   1.How to split region servers at table definition time?
  
   2.Will hbase write onto only one region server when no splits are
 defined
   even if key is not monotonically increasing?
  
   3. When does a region split occurs.
  
   4. Will no of regions be fixed when hbase table is presplitted at table
   creation time.
  
 
 
 
  --
 
  Regards,
  ...sudhakara
 



Re: pre split region server

2014-07-16 Thread Shushant Arora
Thanks Ted.

Can you tell give shell syntax for #3 at table creation time.


On Wed, Jul 16, 2014 at 1:52 PM, Ted Yu yuzhih...@gmail.com wrote:

 For #1, the two regions would contain roughly half the data.

 For #2, 1 region would not receive new data. As you see, such schema
 design is suboptimal.

 For #3, you can split the key space evenly. Using number of region servers
 as number of splits is Okay.

 Cheers

 On Jul 16, 2014, at 12:25 AM, Shushant Arora shushantaror...@gmail.com
 wrote:

  Thanks!
  Few more doubts
 
  1.When I don't supply SPLITS at table creation , all put operation will
 go
  to one region only.
  But when region grows more than hbase.hregion.max.filesize , then 2
  regions will be created both have half-half data or another will be empty
  initially?
  2.If both have 50-50% data and row key is monotonically increasing then 1
  region will be half filled always and will never be filled again ?
  3.While prespliting table only way is to specify row boundaries and key
  prefixes  ?Say if i don't know key ranges , as in my case its GUID
  hexadecimal 32 character string , what should be region split boundary ?
  and How many splits should be created - is it equal to no of regionserver
  aka datanodes ?
  4.For keys of type ACTIVITYTYPE-DATE (where activity type has 2 values
  1.login 2.logout) what should be split strategy ?
 
 
 
  On Tue, Jul 15, 2014 at 7:03 PM, Ted Yu yuzhih...@gmail.com wrote:
 
  Shushant:
  For #2, if table has only one region, the hosting region server would
  receive all writes.
  For #4, yes - presplitting goes with fixed number of regions.
 
  Cheers
 
 
  On Tue, Jul 15, 2014 at 6:23 AM, sudhakara st sudhakara...@gmail.com
  wrote:
 
  You can find info here
  http://hbase.apache.org/book/rowkey.design.html#rowkey.regionsplits
  http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
 
 
  On Tue, Jul 15, 2014 at 6:40 PM, Shushant Arora 
  shushantaror...@gmail.com
  wrote:
 
  1.How to split region servers at table definition time?
 
  2.Will hbase write onto only one region server when no splits are
  defined
  even if key is not monotonically increasing?
 
  3. When does a region split occurs.
 
  4. Will no of regions be fixed when hbase table is presplitted at
 table
  creation time.
 
 
 
  --
 
  Regards,
  ...sudhakara
 



pre split region server

2014-07-15 Thread Shushant Arora
1.How to split region servers at table definition time?

2.Will hbase write onto only one region server when no splits are defined
even if key is not monotonically increasing?

3. When does a region split occurs.

4. Will no of regions be fixed when hbase table is presplitted at table
creation time.


overriding slaves for particular job

2014-06-21 Thread Shushant Arora
Hi

Can I override slaves nodes for one of my job only.
Let say I want  current job to be executed on node1 and node2 only.
If both are busy let the job  wait.

Thanks
Shushant


Re: hbase key design to efficient query on base of 2 or more column

2014-05-19 Thread Shushant Arora
I cannot apply server side filter.
2nd requirement is not just get users with supreme category rather
distribution of users category wise.

1.How many of supreme , how many of normal and how many of medium till date.


On Mon, May 19, 2014 at 12:58 PM, Michael Segel
michael_se...@hotmail.comwrote:

 Whoa!

 BAD BOY. This isn’t a good idea for secondary index.

 You have a row key (primary index) which is time.
 The secondary is a filter… with 3 choices.

 HINT: Do you really want a secondary index based on a field that only has
 3 choices for a value?

 What are they teaching in school these days?

 How about applying a server side filter?  ;-)



 On May 18, 2014, at 12:33 PM, John Hancock jhancock1...@gmail.com wrote:

  Shushant,
 
  Here's one idea, there might be better ways.
 
  Take a look at phoenix it supports secondary indexing:
  http://phoenix.incubator.apache.org/secondary_indexing.html
 
  -John
 
 
  On Sat, May 17, 2014 at 8:34 AM, Shushant Arora
  shushantaror...@gmail.comwrote:
 
  Hi
 
  I have a requirement to query my data base on date and user category.
  User category can be Supreme,Normal,Medium.
 
  I want to query how many new users are there in my table from date range
  (2014-01-01) to (2014-05-16) category wise.
 
  Another requirement is to query how many users of Supreme category are
  there in my table Broken down wise month in which they came.
 
  What should be my key
  1.If i take key as combination of date#category. I cannot query based on
  category?
  2.If I take key as category#date I cannot query based on date.
 
 
  Thanks
  Shushant.
 




Re: hbase key design to efficient query on base of 2 or more column

2014-05-19 Thread Shushant Arora
Ok..but what if I have 2 multivalue dimensions on which I have to analyse
no of users. Say Category can have 50 values and another dimension is
country of user(say 100+ values). I need weekly count on category and
country + I need overall distinct user count on category and country.

How to achieve this in Hbase.


On Mon, May 19, 2014 at 3:11 PM, Michael Segel michael_se...@hotmail.comwrote:

 The point is that choosing a field that has a small finite set of values
 is not a good candidate for indexing using an inverted table or b-tree etc …

 I’d say that you’re actually going to be better off using a scan with a
 start and stop row, then doing the counts on the client side.

 So as you get back your result set… you process the data. (Either in a M/R
 job or single client thread.)

 HTH

 On May 19, 2014, at 8:48 AM, Shushant Arora shushantaror...@gmail.com
 wrote:

  I cannot apply server side filter.
  2nd requirement is not just get users with supreme category rather
  distribution of users category wise.
 
  1.How many of supreme , how many of normal and how many of medium till
 date.
 
 
  On Mon, May 19, 2014 at 12:58 PM, Michael Segel
  michael_se...@hotmail.comwrote:
 
  Whoa!
 
  BAD BOY. This isn’t a good idea for secondary index.
 
  You have a row key (primary index) which is time.
  The secondary is a filter… with 3 choices.
 
  HINT: Do you really want a secondary index based on a field that only
 has
  3 choices for a value?
 
  What are they teaching in school these days?
 
  How about applying a server side filter?  ;-)
 
 
 
  On May 18, 2014, at 12:33 PM, John Hancock jhancock1...@gmail.com
 wrote:
 
  Shushant,
 
  Here's one idea, there might be better ways.
 
  Take a look at phoenix it supports secondary indexing:
  http://phoenix.incubator.apache.org/secondary_indexing.html
 
  -John
 
 
  On Sat, May 17, 2014 at 8:34 AM, Shushant Arora
  shushantaror...@gmail.comwrote:
 
  Hi
 
  I have a requirement to query my data base on date and user category.
  User category can be Supreme,Normal,Medium.
 
  I want to query how many new users are there in my table from date
 range
  (2014-01-01) to (2014-05-16) category wise.
 
  Another requirement is to query how many users of Supreme category are
  there in my table Broken down wise month in which they came.
 
  What should be my key
  1.If i take key as combination of date#category. I cannot query based
 on
  category?
  2.If I take key as category#date I cannot query based on date.
 
 
  Thanks
  Shushant.
 
 
 




Re: hbase key design to efficient query on base of 2 or more column

2014-05-19 Thread Shushant Arora
By server side filter you mean to partition the data across multiple hbase
table one for each category or something else?


On Mon, May 19, 2014 at 11:05 PM, Vladimir Rodionov vrodio...@carrieriq.com
 wrote:

  I cannot apply server side filter.

 Why is that? Are you using stock HBase or some other, API - compatible
 product?


 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com
 
 From: Shushant Arora [shushantaror...@gmail.com]
 Sent: Monday, May 19, 2014 12:48 AM
 To: user@hbase.apache.org
 Subject: Re: hbase key design to efficient query on base of 2 or more
 column

 I cannot apply server side filter.
 2nd requirement is not just get users with supreme category rather
 distribution of users category wise.

 1.How many of supreme , how many of normal and how many of medium till
 date.



 Confidentiality Notice:  The information contained in this message,
 including any attachments hereto, may be confidential and is intended to be
 read only by the individual or entity to whom this message is addressed. If
 the reader of this message is not the intended recipient or an agent or
 designee of the intended recipient, please note that any review, use,
 disclosure or distribution of this message or its attachments, in any form,
 is strictly prohibited.  If you have received this message in error, please
 immediately notify the sender and/or notificati...@carrieriq.com and
 delete or destroy any copy of this message and its attachments.



hbase key design to efficient query on base of 2 or more column

2014-05-17 Thread Shushant Arora
Hi

I have a requirement to query my data base on date and user category.
User category can be Supreme,Normal,Medium.

I want to query how many new users are there in my table from date range
(2014-01-01) to (2014-05-16) category wise.

Another requirement is to query how many users of Supreme category are
there in my table Broken down wise month in which they came.

What should be my key
1.If i take key as combination of date#category. I cannot query based on
category?
2.If I take key as category#date I cannot query based on date.


Thanks
Shushant.


when to use hive vs hbase

2014-04-30 Thread Shushant Arora
I have a requirement of processing huge weblogs on daily basis.

1. data will come incremental to datastore on daily basis and I  need
cumulative and daily
distinct user count from logs and after that aggregated data will be loaded
in RDBMS like mydql.

2.data will be loaded in hdfs datawarehouse on daily basis and same will be
fetched from Hdfs warehouse after some filtering in RDMS like mysql and
will be processed there.

Which datawarehouse is suitable for approach 1 and 2 and why?.

Thanks
Shushant


Re: when to use hive vs hbase

2014-04-30 Thread Shushant Arora
Hi Jean

Thanks for explanation .

I still  have one doubt
Why HBase is not good for bulk loads and aggregations
(Full table scan) ? Hive will also read each row for aggregation as well as
HBase .
Can you explain more ?


On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi Shushant,

 Hive and HBase are 2 different things. You can not really use one vs
 another one.

 Hive is a query engine against HDFS data. Data can be stored with different
 format like flat text, sequence files, Paquet file, or even HBase table.
 HBase is both a query engine (Get and scans) and a storage engine on top of
 HDFS which allow you to store data for random read and random write.

 Then you can also add tools like Phoenix and Impala in the picture which
 will allow you to query the data from HDFS or HBase too.

 A good way to know if HBase is a good fit or not is to ask yourself how you
 are going to write into HBase or to read from HBase. HBase is good for
 Random Reads and Random Writes. If you only do bulk loads and aggregations
 (Full table scan), HBase is not a good fit. If you do random access (Client
 information, events details, etc.) HBase is a good fit.

 It's a bit over simplified, but that should give you some starting points.


 2014-04-30 4:34 GMT-04:00 Shushant Arora shushantaror...@gmail.com:

  I have a requirement of processing huge weblogs on daily basis.
 
  1. data will come incremental to datastore on daily basis and I  need
  cumulative and daily
  distinct user count from logs and after that aggregated data will be
 loaded
  in RDBMS like mydql.
 
  2.data will be loaded in hdfs datawarehouse on daily basis and same will
 be
  fetched from Hdfs warehouse after some filtering in RDMS like mysql and
  will be processed there.
 
  Which datawarehouse is suitable for approach 1 and 2 and why?.
 
  Thanks
  Shushant
 



Re: when to use hive vs hbase

2014-04-30 Thread Shushant Arora
Thanks Jean !

Few more questions
what are good practices for key column design in HBase?
Say my web logs contains timestamp and request id which uniquely identify
each row

1.Shall I make -MM-DD-HH-MM-SS_REQ_ID as row key ? In scenario where
this data will be fetched from HBase on daily base and will be loaded in
MYSql DB.
Daily my ETLruns and it will fetch record with keycol=lastdate and
keycol=today ? Will this key design over load one region server ? Or it
will be equally divided among region servers.






On Wed, Apr 30, 2014 at 5:55 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 With HBase you have some overhead. The Region Server will do a lot for you.
 Manage lal the columns families, the columns, the delete marker, the
 compactions, etc. If you read a file directly from HDFS it will be faster
 for sure because you will not have all those validations and all this extra
 memory usage.

 HBase is absolutely perfect and is excellent to what it's build for. But if
 you are doing only full table scans, it's not it's primary usecase. It can
 still do it if you want, but if you do only that, it's not yet the most
 efficient option.

 If your usecase is a mix of full scans and random read/random writes, then
 yes, go with it!

 Last, some full table scan can be good fits with HBase if you use some of
 it's specific features like TTL on certain columns families when using more
 than 1, etc.

 HTH


 2014-04-30 8:13 GMT-04:00 Shushant Arora shushantaror...@gmail.com:

  Hi Jean
 
  Thanks for explanation .
 
  I still  have one doubt
  Why HBase is not good for bulk loads and aggregations
  (Full table scan) ? Hive will also read each row for aggregation as well
 as
  HBase .
  Can you explain more ?
 
 
  On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
   Hi Shushant,
  
   Hive and HBase are 2 different things. You can not really use one vs
   another one.
  
   Hive is a query engine against HDFS data. Data can be stored with
  different
   format like flat text, sequence files, Paquet file, or even HBase
 table.
   HBase is both a query engine (Get and scans) and a storage engine on
 top
  of
   HDFS which allow you to store data for random read and random write.
  
   Then you can also add tools like Phoenix and Impala in the picture
 which
   will allow you to query the data from HDFS or HBase too.
  
   A good way to know if HBase is a good fit or not is to ask yourself how
  you
   are going to write into HBase or to read from HBase. HBase is good for
   Random Reads and Random Writes. If you only do bulk loads and
  aggregations
   (Full table scan), HBase is not a good fit. If you do random access
  (Client
   information, events details, etc.) HBase is a good fit.
  
   It's a bit over simplified, but that should give you some starting
  points.
  
  
   2014-04-30 4:34 GMT-04:00 Shushant Arora shushantaror...@gmail.com:
  
I have a requirement of processing huge weblogs on daily basis.
   
1. data will come incremental to datastore on daily basis and I  need
cumulative and daily
distinct user count from logs and after that aggregated data will be
   loaded
in RDBMS like mydql.
   
2.data will be loaded in hdfs datawarehouse on daily basis and same
  will
   be
fetched from Hdfs warehouse after some filtering in RDMS like mysql
 and
will be processed there.
   
Which datawarehouse is suitable for approach 1 and 2 and why?.
   
Thanks
Shushant
   
  
 



hive hbase integration

2014-04-17 Thread Shushant Arora
Wanna know why hive hbase integration is required.
Is it because  hbase cannot provide all functionalities of sql like and if
yes then why?
What is storage handler and best practices for hive hbase integration?