hbase uniformsplit for non hex keys
1.Can I use Uniform split for non hex keys? 2.if yes, how to specify key range for split. 3.If no then whats the diff between HexSplit and Uniform Split. Thanks!
Re: hbase get and mvcc
thanks! Does puts which fall inside readpoint of ongoing scan/get are preserved in HFile also or only in memstore and it blocks memstore flush until all ongoing scans are completed. On Tue, May 17, 2016 at 5:31 AM, Stack <st...@duboce.net> wrote: > On Mon, May 16, 2016 at 4:55 PM, Shushant Arora <shushantaror...@gmail.com > > > wrote: > > > Hi > > > > Hbase uses MVCC for achieving consistent result for Get operations . > > To achieve MVCC it has to maintain multiple versions of same row/cells . > > How many max version of a row/cell does Hbase keeps at any time to > support > > MVCC. > > > > Since say multiple gets started one after the other and has not completed > > yet and multiple puts are also occuring in between . Thus it maintains > all > > versions whose read point is still in use ? > > > > > Yes. > > All ongoing Gets/Scans are registered on startup with their current > readpoint (see HRegion; see constructor for HRegionScannerImpl). Any Put > that falls inside the readpoint of currently ongoing Gets/Scans will be > preserved while the Get/Scan is ongoing. > > St.Ack > > > > > Thanks! > > >
hbase get and mvcc
Hi Hbase uses MVCC for achieving consistent result for Get operations . To achieve MVCC it has to maintain multiple versions of same row/cells . How many max version of a row/cell does Hbase keeps at any time to support MVCC. Since say multiple gets started one after the other and has not completed yet and multiple puts are also occuring in between . Thus it maintains all versions whose read point is still in use ? Thanks!
hfile v2 and bloomfilter
In Hfile v2 block level blommfilters are stored inb scanned section along with data block and leaf index. Load on open section contains bloomfilter data . Whats this bloom filter data? 1.Does it contains index of bloomchunks stored in scanned section ? 2.What does meta blocks of non scanned section contains. 3.Does leaf level index contains row keys only? Will having tall table vs wide table affect the size of leaf index. Thanks!
hbase block and columnfamily of a row
can a hbase table with single column family hve its row spawned on multiple blocks in a same HFile ? Suppose there is only one hfile in that case is it possile a column family having 5-6 columns is spawned on multiple blocks ? or its always block is closed at max( 64k default or when all columns of a columnfamily for a single row fits in that block). Thanks!
hbase zookeeper lag
Hi Hbase uses zookeeper for various purposes. e.g for region split. Regionserver creates a znode in zookeeper with splitting state and master gets notification of this directory , since zookeeper is not fully consistent - there may be lag between actual directory creation and notification till then regionserver will start splitting. 1.will this lag creates issue- Region is already splitted in two but master does not even know about it until lag of zookeeper is cleared. and also when regionserver is down it will be notified to master but there also it can be lag. So it can happen a node in zookeeper is lagging lot behind say ~2minutes . So master will be notified after 2 minutes. 2.Won't this lag create issue- make client will get region not reachable and will try with backoff but actual recovery of region server backup will start after 2 minutes? Thanks!
Re: hbase architecture doubts
4.Can same row be in 2 blocks in Hfile. One cell in block 1 and another in block2 ? On Mon, May 9, 2016 at 4:57 PM, Shushant Arora <shushantaror...@gmail.com> wrote: > Thanks! > > 1.Will write take lock on all the column families or just the column > family being affected by write? > > 2.How does eviction in LRUBlockcache is implemeted for InMemory or > multiaccess priority. Say all elements of InMemory priority area(25%) are > recently used than single and multiaccess area. Now if a new inmemory row > comes will it evict from inmemory or single access area ? > > 3.Why block cache is single per regionserver. Why not single per region. > > > On Sun, May 8, 2016 at 11:43 PM, Stack <st...@duboce.net> wrote: > >> On Sun, May 8, 2016 at 6:12 AM, Shushant Arora <shushantaror...@gmail.com >> > >> wrote: >> >> > Thanks ! >> > >> > One doubt regarding locking in memtore : >> > >> > Hbase use implicit row lock while applying put operation on a row. >> > >> > put(byte[] rowkey). >> > >> > when htable.put(p) is fired , regionserver will lock the row but all get >> > operations will not lock the row and return the row state which was at >> > state previous to put took lock. >> > >> > Memstore is implemented as CSLM so how does it return the row state >> > previous to put lock when get is fired before put is finished? >> > >> > >> Multiversion Concurrency Control. This is the core class: >> >> http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/MultiVersionConcurrencyControl.html >> See how it is used in the codebase. >> >> Ask more questions if not clear. >> St.Ack >> >> >> >> > On Tue, May 3, 2016 at 7:41 AM, Stack <st...@duboce.net> wrote: >> > >> > > On Mon, May 2, 2016 at 5:34 PM, Shushant Arora < >> > shushantaror...@gmail.com> >> > > wrote: >> > > >> > > > Thanks Stack. >> > > > >> > > > 1.So is it at any time there will be two reference 1.active memstore >> > > > 2.snapshot memstore >> > > > snapshot will be initialised at time of flush using active memstore >> > with >> > > a >> > > > momentaily lock and then active will be discarded and read will be >> > served >> > > > usinmg snapshot and write will go to new active memstore. >> > > > >> > > > >> > > Yes >> > > >> > > >> > > > 2key of CSLS is keyvalue . Which part of keyValue is used while >> sorting >> > > the >> > > > set. Is it whole keyvalue or just row key. Does Hfile has separate >> > entry >> > > > for each key value and keyvalues of same row key are always stored >> > > > contiguosly in HFile and may not be in same block? >> > > > >> > > > >> > > Just the row key. Value is not considered in the sort. >> > > >> > > Yes, HFile has separate entry for each KeyValue (or 'Cell' in >> > hbase-speak). >> > > >> > > Cells in HFile are sorted. Those of the same or near 'Cell' >> coordinates >> > > will be sorted together and may therefore appear inside the same >> block. >> > > >> > > St.Ack >> > > >> > > >> > > >> > > > On Tue, May 3, 2016 at 12:05 AM, Stack <st...@duboce.net> wrote: >> > > > >> > > > > On Mon, May 2, 2016 at 10:06 AM, Shushant Arora < >> > > > shushantaror...@gmail.com >> > > > > > >> > > > > wrote: >> > > > > >> > > > > > Thanks Stack >> > > > > > >> > > > > > for point 2 : >> > > > > > I am concerned with downtime of Hbase for read and write. >> > > > > > If write lock is just for the time while we move aside the >> current >> > > > > > MemStore. >> > > > > > Then when a write happens to key will it update the memstore >> only >> > but >> > > > > > snapshot does not have that update and when snapshot is dunmped >> to >> > > > Hfile >> > > > > > won't we loose the update? >> > > > > > >> > > > > > >> > > > > > >> > > > > No. The update is in t
Re: hbase architecture doubts
Thanks! 1.Will write take lock on all the column families or just the column family being affected by write? 2.How does eviction in LRUBlockcache is implemeted for InMemory or multiaccess priority. Say all elements of InMemory priority area(25%) are recently used than single and multiaccess area. Now if a new inmemory row comes will it evict from inmemory or single access area ? 3.Why block cache is single per regionserver. Why not single per region. On Sun, May 8, 2016 at 11:43 PM, Stack <st...@duboce.net> wrote: > On Sun, May 8, 2016 at 6:12 AM, Shushant Arora <shushantaror...@gmail.com> > wrote: > > > Thanks ! > > > > One doubt regarding locking in memtore : > > > > Hbase use implicit row lock while applying put operation on a row. > > > > put(byte[] rowkey). > > > > when htable.put(p) is fired , regionserver will lock the row but all get > > operations will not lock the row and return the row state which was at > > state previous to put took lock. > > > > Memstore is implemented as CSLM so how does it return the row state > > previous to put lock when get is fired before put is finished? > > > > > Multiversion Concurrency Control. This is the core class: > > http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/MultiVersionConcurrencyControl.html > See how it is used in the codebase. > > Ask more questions if not clear. > St.Ack > > > > > On Tue, May 3, 2016 at 7:41 AM, Stack <st...@duboce.net> wrote: > > > > > On Mon, May 2, 2016 at 5:34 PM, Shushant Arora < > > shushantaror...@gmail.com> > > > wrote: > > > > > > > Thanks Stack. > > > > > > > > 1.So is it at any time there will be two reference 1.active memstore > > > > 2.snapshot memstore > > > > snapshot will be initialised at time of flush using active memstore > > with > > > a > > > > momentaily lock and then active will be discarded and read will be > > served > > > > usinmg snapshot and write will go to new active memstore. > > > > > > > > > > > Yes > > > > > > > > > > 2key of CSLS is keyvalue . Which part of keyValue is used while > sorting > > > the > > > > set. Is it whole keyvalue or just row key. Does Hfile has separate > > entry > > > > for each key value and keyvalues of same row key are always stored > > > > contiguosly in HFile and may not be in same block? > > > > > > > > > > > Just the row key. Value is not considered in the sort. > > > > > > Yes, HFile has separate entry for each KeyValue (or 'Cell' in > > hbase-speak). > > > > > > Cells in HFile are sorted. Those of the same or near 'Cell' coordinates > > > will be sorted together and may therefore appear inside the same block. > > > > > > St.Ack > > > > > > > > > > > > > On Tue, May 3, 2016 at 12:05 AM, Stack <st...@duboce.net> wrote: > > > > > > > > > On Mon, May 2, 2016 at 10:06 AM, Shushant Arora < > > > > shushantaror...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > Thanks Stack > > > > > > > > > > > > for point 2 : > > > > > > I am concerned with downtime of Hbase for read and write. > > > > > > If write lock is just for the time while we move aside the > current > > > > > > MemStore. > > > > > > Then when a write happens to key will it update the memstore only > > but > > > > > > snapshot does not have that update and when snapshot is dunmped > to > > > > Hfile > > > > > > won't we loose the update? > > > > > > > > > > > > > > > > > > > > > > > No. The update is in the new currently active MemStore. The update > > will > > > > be > > > > > included in the next flush added to a new hfile. > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, May 2, 2016 at 9:06 PM, Stack <st...@duboce.net> wrote: > > > > > > > > > > > > > On Mon, May 2, 2016 at 1:25 AM, Shushant Arora < > > > > > > shushantaror...@gmail.com> > > > > > > > wrote: > > > > >
Re: hbase architecture doubts
Thanks ! One doubt regarding locking in memtore : Hbase use implicit row lock while applying put operation on a row. put(byte[] rowkey). when htable.put(p) is fired , regionserver will lock the row but all get operations will not lock the row and return the row state which was at state previous to put took lock. Memstore is implemented as CSLM so how does it return the row state previous to put lock when get is fired before put is finished? On Tue, May 3, 2016 at 7:41 AM, Stack <st...@duboce.net> wrote: > On Mon, May 2, 2016 at 5:34 PM, Shushant Arora <shushantaror...@gmail.com> > wrote: > > > Thanks Stack. > > > > 1.So is it at any time there will be two reference 1.active memstore > > 2.snapshot memstore > > snapshot will be initialised at time of flush using active memstore with > a > > momentaily lock and then active will be discarded and read will be served > > usinmg snapshot and write will go to new active memstore. > > > > > Yes > > > > 2key of CSLS is keyvalue . Which part of keyValue is used while sorting > the > > set. Is it whole keyvalue or just row key. Does Hfile has separate entry > > for each key value and keyvalues of same row key are always stored > > contiguosly in HFile and may not be in same block? > > > > > Just the row key. Value is not considered in the sort. > > Yes, HFile has separate entry for each KeyValue (or 'Cell' in hbase-speak). > > Cells in HFile are sorted. Those of the same or near 'Cell' coordinates > will be sorted together and may therefore appear inside the same block. > > St.Ack > > > > > On Tue, May 3, 2016 at 12:05 AM, Stack <st...@duboce.net> wrote: > > > > > On Mon, May 2, 2016 at 10:06 AM, Shushant Arora < > > shushantaror...@gmail.com > > > > > > > wrote: > > > > > > > Thanks Stack > > > > > > > > for point 2 : > > > > I am concerned with downtime of Hbase for read and write. > > > > If write lock is just for the time while we move aside the current > > > > MemStore. > > > > Then when a write happens to key will it update the memstore only but > > > > snapshot does not have that update and when snapshot is dunmped to > > Hfile > > > > won't we loose the update? > > > > > > > > > > > > > > > No. The update is in the new currently active MemStore. The update will > > be > > > included in the next flush added to a new hfile. > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > On Mon, May 2, 2016 at 9:06 PM, Stack <st...@duboce.net> wrote: > > > > > > > > > On Mon, May 2, 2016 at 1:25 AM, Shushant Arora < > > > > shushantaror...@gmail.com> > > > > > wrote: > > > > > > > > > > > Thanks! > > > > > > > > > > > > Few doubts; > > > > > > > > > > > > 1.LSM tree comprises two tree-like > > > > > > <https://en.wikipedia.org/wiki/Tree_(data_structure)> > structures, > > > > called > > > > > > C0 and > > > > > > C1 and If the insertion causes the C0 component to exceed a > certain > > > > size > > > > > > threshold, a contiguous segment of entries is removed from C0 and > > > > merged > > > > > > into C1 on disk > > > > > > > > > > > > But in Hbase when C0 which is memstore I guess? is exceeded the > > > > threshold > > > > > > size its dumped on to HDFS as HFIle(c1 I guess?) - and does > > > compaction > > > > is > > > > > > the process which here means as merging of C0 and C1 ? > > > > > > > > > > > > > > > > > The 'merge' in the quoted high-level description may just mean that > > the > > > > > dumped hfile is 'merged' with the others at read time. Or it may be > > as > > > > > stated, that the 'merge' happens at flush time. Some LSM tree > > > > > implementations do it this way -- Bigtable, and it calls the merge > of > > > > > memstore and a file-on-disk a form of compaction -- but this is not > > > what > > > > > HBase does; it just dumps the memstore as a flushed hfile. Later, > > we'll > > > > run > > > > > a compaction process to merge hfiles in back
hbase doubts
1.Why is it better to have single file per region than multiple files for read performance. Why can't multile threads read multiple file and give better performance? 2Does hbase regionserver has single thread for compactions and split for all regions its holding? Why can't single thread per regions will work better than sequential compactions/split for all regions in a regionserver. 3.Why hbase flush and compact all memstores of all the families of a table at same time irrespective of their size when even one memstore reaches threshold. Thanks Shushant
Re: hbase architecture doubts
Thanks Stack. 1.So is it at any time there will be two reference 1.active memstore 2.snapshot memstore snapshot will be initialised at time of flush using active memstore with a momentaily lock and then active will be discarded and read will be served usinmg snapshot and write will go to new active memstore. 2key of CSLS is keyvalue . Which part of keyValue is used while sorting the set. Is it whole keyvalue or just row key. Does Hfile has separate entry for each key value and keyvalues of same row key are always stored contiguosly in HFile and may not be in same block? On Tue, May 3, 2016 at 12:05 AM, Stack <st...@duboce.net> wrote: > On Mon, May 2, 2016 at 10:06 AM, Shushant Arora <shushantaror...@gmail.com > > > wrote: > > > Thanks Stack > > > > for point 2 : > > I am concerned with downtime of Hbase for read and write. > > If write lock is just for the time while we move aside the current > > MemStore. > > Then when a write happens to key will it update the memstore only but > > snapshot does not have that update and when snapshot is dunmped to Hfile > > won't we loose the update? > > > > > > > No. The update is in the new currently active MemStore. The update will be > included in the next flush added to a new hfile. > > St.Ack > > > > > > > On Mon, May 2, 2016 at 9:06 PM, Stack <st...@duboce.net> wrote: > > > > > On Mon, May 2, 2016 at 1:25 AM, Shushant Arora < > > shushantaror...@gmail.com> > > > wrote: > > > > > > > Thanks! > > > > > > > > Few doubts; > > > > > > > > 1.LSM tree comprises two tree-like > > > > <https://en.wikipedia.org/wiki/Tree_(data_structure)> structures, > > called > > > > C0 and > > > > C1 and If the insertion causes the C0 component to exceed a certain > > size > > > > threshold, a contiguous segment of entries is removed from C0 and > > merged > > > > into C1 on disk > > > > > > > > But in Hbase when C0 which is memstore I guess? is exceeded the > > threshold > > > > size its dumped on to HDFS as HFIle(c1 I guess?) - and does > compaction > > is > > > > the process which here means as merging of C0 and C1 ? > > > > > > > > > > > The 'merge' in the quoted high-level description may just mean that the > > > dumped hfile is 'merged' with the others at read time. Or it may be as > > > stated, that the 'merge' happens at flush time. Some LSM tree > > > implementations do it this way -- Bigtable, and it calls the merge of > > > memstore and a file-on-disk a form of compaction -- but this is not > what > > > HBase does; it just dumps the memstore as a flushed hfile. Later, we'll > > run > > > a compaction process to merge hfiles in background. > > > > > > > > > > > > > 2.Moves current, active Map aside as a snapshot (while a write lock > is > > > held > > > > for a short period of time), and then creates a new CSLS instances. > > > > > > > > In background, the snapshot is then dumped to disk. We get an > Iterator > > on > > > > CSLS. We write a block at a time. When we exceed configured block > size, > > > we > > > > start a new one. > > > > > > > > -- Does write lock is held till the time complete CSLS is dumpled on > > > > disk. > > > > > > > > > > > > No. Just while we move aside the current MemStore. > > > > > > What is your concern/objective? Are you studying LSM trees generally or > > are > > > you worried that HBase is offline for periods of time for read and > write? > > > > > > Thanks, > > > St.Ack > > > > > > > > > > > > > And read is allowed using snapshot. > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > On Mon, May 2, 2016 at 11:39 AM, Stack <st...@duboce.net> wrote: > > > > > > > > > On Sun, May 1, 2016 at 3:36 AM, Shushant Arora < > > > > shushantaror...@gmail.com> > > > > > wrote: > > > > > > > > > > > 1.Does Hbase uses ConcurrentskipListMap(CSLM) to store data in > > > > memstore? > > > > > > > > > > > > Yes (We use a CSLS but this is implemented over a CSLM). > > > > > > > > &g
Re: hbase architecture doubts
Thanks Stack for point 2 : I am concerned with downtime of Hbase for read and write. If write lock is just for the time while we move aside the current MemStore. Then when a write happens to key will it update the memstore only but snapshot does not have that update and when snapshot is dunmped to Hfile won't we loose the update? On Mon, May 2, 2016 at 9:06 PM, Stack <st...@duboce.net> wrote: > On Mon, May 2, 2016 at 1:25 AM, Shushant Arora <shushantaror...@gmail.com> > wrote: > > > Thanks! > > > > Few doubts; > > > > 1.LSM tree comprises two tree-like > > <https://en.wikipedia.org/wiki/Tree_(data_structure)> structures, called > > C0 and > > C1 and If the insertion causes the C0 component to exceed a certain size > > threshold, a contiguous segment of entries is removed from C0 and merged > > into C1 on disk > > > > But in Hbase when C0 which is memstore I guess? is exceeded the threshold > > size its dumped on to HDFS as HFIle(c1 I guess?) - and does compaction is > > the process which here means as merging of C0 and C1 ? > > > > > The 'merge' in the quoted high-level description may just mean that the > dumped hfile is 'merged' with the others at read time. Or it may be as > stated, that the 'merge' happens at flush time. Some LSM tree > implementations do it this way -- Bigtable, and it calls the merge of > memstore and a file-on-disk a form of compaction -- but this is not what > HBase does; it just dumps the memstore as a flushed hfile. Later, we'll run > a compaction process to merge hfiles in background. > > > > > 2.Moves current, active Map aside as a snapshot (while a write lock is > held > > for a short period of time), and then creates a new CSLS instances. > > > > In background, the snapshot is then dumped to disk. We get an Iterator on > > CSLS. We write a block at a time. When we exceed configured block size, > we > > start a new one. > > > > -- Does write lock is held till the time complete CSLS is dumpled on > > disk. > > > > No. Just while we move aside the current MemStore. > > What is your concern/objective? Are you studying LSM trees generally or are > you worried that HBase is offline for periods of time for read and write? > > Thanks, > St.Ack > > > > > And read is allowed using snapshot. > > > > > > > > > Thanks! > > > > > > > > On Mon, May 2, 2016 at 11:39 AM, Stack <st...@duboce.net> wrote: > > > > > On Sun, May 1, 2016 at 3:36 AM, Shushant Arora < > > shushantaror...@gmail.com> > > > wrote: > > > > > > > 1.Does Hbase uses ConcurrentskipListMap(CSLM) to store data in > > memstore? > > > > > > > > Yes (We use a CSLS but this is implemented over a CSLM). > > > > > > > > > > 2.When mwmstore is flushed to HDFS- does it dump the memstore > > > > Concurrentskiplist as Hfile2? Then How does it calculates blocks out > of > > > > CSLM and dmp them in HDFS. > > > > > > > > > > > Moves current, active Map aside as a snapshot (while a write lock is > held > > > for a short period of time), and then creates a new CSLS instances. > > > > > > In background, the snapshot is then dumped to disk. We get an Iterator > on > > > CSLS. We write a block at a time. When we exceed configured block size, > > we > > > start a new one. > > > > > > > > > > 3.After dumping the inmemory CSLM of memstore to HFILe does memstore > > > > content is discarded > > > > > > > > > Yes > > > > > > > > > > > > > and if while dumping memstore any read request comes > > > > will it be responded by copy of memstore or discard of memstore will > be > > > > blocked until read request is completed? > > > > > > > > We will respond using the snapshot until it has been successfully > > dumped. > > > Once dumped, we'll respond using the hfile. > > > > > > No blocking (other than for the short period during which the snapshot > is > > > made and the file is swapped into the read path). > > > > > > > > > > > > > 4.When a read request comes does it look in inmemory CSLM and then in > > > > HFile? > > > > > > > > > Generally, yes. > > > > > > > > > > > > > And what is LogStructuredMerge tree and its usage in Hbase. > > > > > > > > > > > Suggest you read up on LSM Trees ( > > > https://en.wikipedia.org/wiki/Log-structured_merge-tree) and if you > > still > > > can't see the LSM tree in the HBase forest, ask specific questions and > > > we'll help you out. > > > > > > St.Ack > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > >
Re: hbase architecture doubts
Thanks! Few doubts; 1.LSM tree comprises two tree-like <https://en.wikipedia.org/wiki/Tree_(data_structure)> structures, called C0 and C1 and If the insertion causes the C0 component to exceed a certain size threshold, a contiguous segment of entries is removed from C0 and merged into C1 on disk But in Hbase when C0 which is memstore I guess? is exceeded the threshold size its dumped on to HDFS as HFIle(c1 I guess?) - and does compaction is the process which here means as merging of C0 and C1 ? 2.Moves current, active Map aside as a snapshot (while a write lock is held for a short period of time), and then creates a new CSLS instances. In background, the snapshot is then dumped to disk. We get an Iterator on CSLS. We write a block at a time. When we exceed configured block size, we start a new one. -- Does write lock is held till the time complete CSLS is dumpled on disk.And read is allowed using snapshot. Thanks! On Mon, May 2, 2016 at 11:39 AM, Stack <st...@duboce.net> wrote: > On Sun, May 1, 2016 at 3:36 AM, Shushant Arora <shushantaror...@gmail.com> > wrote: > > > 1.Does Hbase uses ConcurrentskipListMap(CSLM) to store data in memstore? > > > > Yes (We use a CSLS but this is implemented over a CSLM). > > > > 2.When mwmstore is flushed to HDFS- does it dump the memstore > > Concurrentskiplist as Hfile2? Then How does it calculates blocks out of > > CSLM and dmp them in HDFS. > > > > > Moves current, active Map aside as a snapshot (while a write lock is held > for a short period of time), and then creates a new CSLS instances. > > In background, the snapshot is then dumped to disk. We get an Iterator on > CSLS. We write a block at a time. When we exceed configured block size, we > start a new one. > > > > 3.After dumping the inmemory CSLM of memstore to HFILe does memstore > > content is discarded > > > Yes > > > > > and if while dumping memstore any read request comes > > will it be responded by copy of memstore or discard of memstore will be > > blocked until read request is completed? > > > > We will respond using the snapshot until it has been successfully dumped. > Once dumped, we'll respond using the hfile. > > No blocking (other than for the short period during which the snapshot is > made and the file is swapped into the read path). > > > > > 4.When a read request comes does it look in inmemory CSLM and then in > > HFile? > > > Generally, yes. > > > > > And what is LogStructuredMerge tree and its usage in Hbase. > > > > > Suggest you read up on LSM Trees ( > https://en.wikipedia.org/wiki/Log-structured_merge-tree) and if you still > can't see the LSM tree in the HBase forest, ask specific questions and > we'll help you out. > > St.Ack > > > > > > Thanks! > > >
hbase architecture doubts
1.Does Hbase uses ConcurrentskipListMap(CSLM) to store data in memstore? 2.When mwmstore is flushed to HDFS- does it dump the memstore Concurrentskiplist as Hfile2? Then How does it calculates blocks out of CSLM and dmp them in HDFS. 3.After dumping the inmemory CSLM of memstore to HFILe does memstore content is discarded and if while dumping memstore any read request comes will it be responded by copy of memstore or discard of memstore will be blocked until read request is completed? 4.When a read request comes does it look in inmemory CSLM and then in HFile? And what is LogStructuredMerge tree and its usage in Hbase. Thanks!
Re: hbase custom scan
table will have ~100 regions. I did n't get the advantage of top rows from same vs different regions ? They will come from different regions . On Tue, Apr 5, 2016 at 9:10 AM, Ted Yu <yuzhih...@gmail.com> wrote: > How many regions does your table have ? > > After sorting, is there a chance that the top N rows come from distinct > regions ? > > On Mon, Apr 4, 2016 at 8:27 PM, Shushant Arora <shushantaror...@gmail.com> > wrote: > > > Hi > > > > I have a requirement to scan a hbase table based on insertion timestamp. > > I need to fetch the keys sorted by insertion timestamp not by key . > > > > I can't made timestamp as prefix of key to avoid hot spotting. > > Is there any efficient way possible for this requirement. > > > > Thanks! > > >
hbase custom scan
Hi I have a requirement to scan a hbase table based on insertion timestamp. I need to fetch the keys sorted by insertion timestamp not by key . I can't made timestamp as prefix of key to avoid hot spotting. Is there any efficient way possible for this requirement. Thanks!
does hbase scan doubts
Does hbase scan or get is single threaded? Say I have hbase table with 100 regionservers. When I scan a key rangle say a-z(distributed on all regionservers), will the client make calls to regionservers in parallel all at once or one by one.First it will get all keys from one regionserver then make a next call to another regionserver in lexicographic order of keys? If it makes call in parallel then how does it ensure result to be sorted by key always? Thanks!
Re: use of hbase client in application server
2.DO I need to check whether Hconnection is still active before using it to create Htable instance. By still valid I meant that say I created Hconnection object and after 3-4 minutes when request came for any crud operation for some table and before getting Htable using from HConnection say Hconnection object or TCP/IP connection to cluster dropped. Does now if I create Htable using HConnection will it recrete the TCP connection to cluster automatically? 1.And will increasing no of HConnectoions will improve the performance? Thanks! On Sun, Mar 13, 2016 at 7:47 PM, Ted Yu <yuzhih...@gmail.com> wrote: > For #1, single Hconnection should work. > > For #2, can you clarify ? As long as the hbase-site.xml used to create > the Hconnection > is still valid, you can continue using the connection. > > For #3, they're handled by the connection automatically. > > For #4, the HTable ctor you cited doesn't exist in master branch. > You can control the following parameters for the ThreadPoolExecutor - see > HTable#getDefaultExecutor(): > > int maxThreads = conf.getInt("hbase.htable.threads.max", Integer. > MAX_VALUE); > > if (maxThreads == 0) { > > maxThreads = 1; // is there a better default? > > } > > int corePoolSize = conf.getInt("hbase.htable.threads.coresize", 1); > > long keepAliveTime = conf.getLong("hbase.htable.threads.keepalivetime", > 60); > > On Sun, Mar 13, 2016 at 3:12 AM, Shushant Arora <shushantaror...@gmail.com > > > wrote: > > > I have a requirement to use long running hbase client in application > > server. > > > > 1.Do I need to create multiple HConnections or single Hconnection will > > work? > > 2. DO I need to check whether Hconnection is still active before using it > > to create Htable instance. > > 3.DO I need to handle region split and regionserver changes while using > > Hconnection or are they handled automatically. > > 4.Whats the use of thread pool in Htable instance. > > ExecutorService threadPool; > > HTable h = new HTable(conf, Bytes.toBytes("tablename"), threadPool); > > > > > > Thanks! > > >
use of hbase client in application server
I have a requirement to use long running hbase client in application server. 1.Do I need to create multiple HConnections or single Hconnection will work? 2. DO I need to check whether Hconnection is still active before using it to create Htable instance. 3.DO I need to handle region split and regionserver changes while using Hconnection or are they handled automatically. 4.Whats the use of thread pool in Htable instance. ExecutorService threadPool; HTable h = new HTable(conf, Bytes.toBytes("tablename"), threadPool); Thanks!
Re: disable major compaction per table
Thanks! does hbase compress repeated values in keys and columns :location say (ASIA). will that be repeated with each key or hbase snappy compression will handle that. same applies for repeated values of a column? Thanks! On Wed, Feb 17, 2016 at 7:14 AM, Ted Yu <yuzhih...@gmail.com> wrote: > bq. hbase.hregion.majorcompaction = 0 per table/column family > > I searched code base but didn't find relevant test case for the above. > Mind giving me some pointer ? > > Thanks > > On Tue, Feb 16, 2016 at 5:38 PM, Vladimir Rodionov <vladrodio...@gmail.com > > > wrote: > > > 1.does major compaction in hbase runs per table basis. > > > > Per Region > > > > 2.By default every 24 hours? > > > > In older versions - yes. Current (1.x+) - 7 days > > > > 3.Can I disable automatic major compaction for few tables while keep it > > enable for rest of tables? > > > > yes, you can. You can set > > > > hbase.hregion.majorcompaction = 0 per table/column family > > > > 4.Does hbase put ,get and delete are blocked while major compaction and > are > > working in minor compaction? > > > > No, they are not. > > > > -Vlad > > > > On Tue, Feb 16, 2016 at 4:51 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > For #2, see http://hbase.apache.org/book.html#managed.compactions > > > > > > For #3, I don't think so. > > > > > > On Tue, Feb 16, 2016 at 4:46 PM, Shushant Arora < > > shushantaror...@gmail.com > > > > > > > wrote: > > > > > > > Hi > > > > > > > > 1.does major compaction in hbase runs per table basis. > > > > 2.By default every 24 hours? > > > > 3.Can I disable automatic major compaction for few tables while keep > it > > > > enable for rest of tables? > > > > > > > > 4.Does hbase put ,get and delete are blocked while major compaction > and > > > are > > > > working in minor compaction? > > > > > > > > Thanks > > > > > > > > > >
disable major compaction per table
Hi 1.does major compaction in hbase runs per table basis. 2.By default every 24 hours? 3.Can I disable automatic major compaction for few tables while keep it enable for rest of tables? 4.Does hbase put ,get and delete are blocked while major compaction and are working in minor compaction? Thanks
timestamp/ttl of a cell
Hi Can TTL of rows be set/updated instead of complete column family? or Can timestamp version of a cell be decreased ? Aim is to delete some rows whose timestamp is set to old values so that it matches TTL of column family if tTL of row/cell cannot be specified.
Re: timestamp/ttl of a cell
Thanks! Whats the syntax to set it in shell and java ? On Wed, Nov 25, 2015 at 6:05 PM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > This? HBASE-10560 > > 2015-11-25 6:45 GMT-05:00 Shushant Arora <shushantaror...@gmail.com>: > > > Hi > > > > Can TTL of rows be set/updated instead of complete column family? > > or > > Can timestamp version of a cell be decreased ? Aim is to delete some rows > > whose timestamp > > is set to old values so that it matches TTL of column family if tTL of > > row/cell cannot be specified. > > >
hbase timerange scan
Does hbase timerange scan is full table scan without the start and stop key? Or is it take care of HFile meta data about min and max timerange n HFile . And how it optimises this metadata after compaction of multiple files?
Re: hbase doubts
and will using keyprefixregionsplit policy instead of default Increasing to upperbound split policy help here? On Wed, Aug 19, 2015 at 10:23 AM, Shushant Arora shushantaror...@gmail.com wrote: When last region gets new data and split in two - what is the split point - say last reagion was having 10 files and split alogorithm decided to split this region- Will the two children regions have 5-5 files or the key space of original region(parent region) say have range (2015-08-01#guid to 2015-08-06#guid) will be divided to 2 equal parts child1 has (2015-08-01#guid to 2015-08-03#guids) and child2 has range (2015-08-04#guid to 2015-08-06#guid) and all data is rewritten in child regions to accomany this key range and then since its time series based so new data will come in increasing dates and for dates2015-08-06 only so will go to child2 and child1 wil always be half filled. And child2 only will lead to new splits when reached split size threshold. On Wed, Aug 19, 2015 at 4:16 AM, Ted Yu yuzhih...@gmail.com wrote: Since year and month are part of the row key in this scenario (instead of just the day of month), the last region would get new data and be split. Is this effect desirable for your app ? Cheers On Tue, Aug 18, 2015 at 12:55 PM, Shushant Arora shushantaror...@gmail.com wrote: for hbase key containing time as prefix say(-mm-dd#other fields of guid base) I am using bulk load to avoid hot spot of regionserver (avoiding write to WAL). What should be the initial splits of regions. Say I have 30 regionserves. shall intial 30 days as intial splits and then auto split takes care of splitting regions if it grows further will serve ? Or since if it has date as prefix and when region is split in 2 from midway - and new data will come for increasing date only will lead to one region to be half filled always and rest half never filled? On Tue, Aug 18, 2015 at 9:41 PM, anil gupta anilgupt...@gmail.com wrote: As per my experience, Phoenix is way superior than Hive-HBase integration for sql-like querying on HBase. It's because, Phoenix is built on top of HBase unlike Hive. On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu yuzhih...@gmail.com wrote: To my knowledge, Phoenix provides better integration with hbase. A third possibility is Spark on HBase. If you want to explore these alternatives, I suggest asking on respective mailing lists where you can get expert opinions. Cheers On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks! Which one is better for sqlkind of queries over hbase (queries involve filter , key range scan), aggregates by column values. . 1.Hive storage handlers 2.or Phoenix On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, if you want to count distinct values for F1, you can write a coprocessor which aggregates the count on region server and returns the result to client which does the final aggregation. Take a look at hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java and related classes for example. On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora shushantaror...@gmail.com wrote: Thanks ! few more doubts : 1.Say if requirement is to count distinct value of F1- If field is part of key- is hbase can't just scan key and skip value deserialsation and return result to client which will calculate distinct and in second approcah Hbase will desrialise the value of return column containing F1 to cleint which will calculate the distinct. 2.For bulk load when LoadIncrementalHFiles runs and regionserver moves the hfiles from hdfs to region directory - does regionserver localise the hfile by downloading it to local and then uploading again in region directory? Or it just moves to to region directory and wait for next compaction to get it localise as in regionserver failure case? On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu yuzhih...@gmail.com wrote: For both scenarios you mentioned, field is not leading part of row key. You would need to specify timerange or start row / stop row to narrow the key range being scanned. I am leaning toward using second approach. Cheers On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora shushantaror...@gmail.com wrote: ~8-10 fields of size (5 of 20 bytes each )and 3 fields of size 200 bytes each. On Mon, Aug 17
Re: hbase doubts
When last region gets new data and split in two - what is the split point - say last reagion was having 10 files and split alogorithm decided to split this region- Will the two children regions have 5-5 files or the key space of original region(parent region) say have range (2015-08-01#guid to 2015-08-06#guid) will be divided to 2 equal parts child1 has (2015-08-01#guid to 2015-08-03#guids) and child2 has range (2015-08-04#guid to 2015-08-06#guid) and all data is rewritten in child regions to accomany this key range and then since its time series based so new data will come in increasing dates and for dates2015-08-06 only so will go to child2 and child1 wil always be half filled. And child2 only will lead to new splits when reached split size threshold. On Wed, Aug 19, 2015 at 4:16 AM, Ted Yu yuzhih...@gmail.com wrote: Since year and month are part of the row key in this scenario (instead of just the day of month), the last region would get new data and be split. Is this effect desirable for your app ? Cheers On Tue, Aug 18, 2015 at 12:55 PM, Shushant Arora shushantaror...@gmail.com wrote: for hbase key containing time as prefix say(-mm-dd#other fields of guid base) I am using bulk load to avoid hot spot of regionserver (avoiding write to WAL). What should be the initial splits of regions. Say I have 30 regionserves. shall intial 30 days as intial splits and then auto split takes care of splitting regions if it grows further will serve ? Or since if it has date as prefix and when region is split in 2 from midway - and new data will come for increasing date only will lead to one region to be half filled always and rest half never filled? On Tue, Aug 18, 2015 at 9:41 PM, anil gupta anilgupt...@gmail.com wrote: As per my experience, Phoenix is way superior than Hive-HBase integration for sql-like querying on HBase. It's because, Phoenix is built on top of HBase unlike Hive. On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu yuzhih...@gmail.com wrote: To my knowledge, Phoenix provides better integration with hbase. A third possibility is Spark on HBase. If you want to explore these alternatives, I suggest asking on respective mailing lists where you can get expert opinions. Cheers On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks! Which one is better for sqlkind of queries over hbase (queries involve filter , key range scan), aggregates by column values. . 1.Hive storage handlers 2.or Phoenix On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, if you want to count distinct values for F1, you can write a coprocessor which aggregates the count on region server and returns the result to client which does the final aggregation. Take a look at hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java and related classes for example. On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora shushantaror...@gmail.com wrote: Thanks ! few more doubts : 1.Say if requirement is to count distinct value of F1- If field is part of key- is hbase can't just scan key and skip value deserialsation and return result to client which will calculate distinct and in second approcah Hbase will desrialise the value of return column containing F1 to cleint which will calculate the distinct. 2.For bulk load when LoadIncrementalHFiles runs and regionserver moves the hfiles from hdfs to region directory - does regionserver localise the hfile by downloading it to local and then uploading again in region directory? Or it just moves to to region directory and wait for next compaction to get it localise as in regionserver failure case? On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu yuzhih...@gmail.com wrote: For both scenarios you mentioned, field is not leading part of row key. You would need to specify timerange or start row / stop row to narrow the key range being scanned. I am leaning toward using second approach. Cheers On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora shushantaror...@gmail.com wrote: ~8-10 fields of size (5 of 20 bytes each )and 3 fields of size 200 bytes each. On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu yuzhih...@gmail.com wrote: How many fields such as F1 are you considering for embedding in row key ? Suggested reading
Re: hbase doubts
for hbase key containing time as prefix say(-mm-dd#other fields of guid base) I am using bulk load to avoid hot spot of regionserver (avoiding write to WAL). What should be the initial splits of regions. Say I have 30 regionserves. shall intial 30 days as intial splits and then auto split takes care of splitting regions if it grows further will serve ? Or since if it has date as prefix and when region is split in 2 from midway - and new data will come for increasing date only will lead to one region to be half filled always and rest half never filled? On Tue, Aug 18, 2015 at 9:41 PM, anil gupta anilgupt...@gmail.com wrote: As per my experience, Phoenix is way superior than Hive-HBase integration for sql-like querying on HBase. It's because, Phoenix is built on top of HBase unlike Hive. On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu yuzhih...@gmail.com wrote: To my knowledge, Phoenix provides better integration with hbase. A third possibility is Spark on HBase. If you want to explore these alternatives, I suggest asking on respective mailing lists where you can get expert opinions. Cheers On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks! Which one is better for sqlkind of queries over hbase (queries involve filter , key range scan), aggregates by column values. . 1.Hive storage handlers 2.or Phoenix On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, if you want to count distinct values for F1, you can write a coprocessor which aggregates the count on region server and returns the result to client which does the final aggregation. Take a look at hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java and related classes for example. On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora shushantaror...@gmail.com wrote: Thanks ! few more doubts : 1.Say if requirement is to count distinct value of F1- If field is part of key- is hbase can't just scan key and skip value deserialsation and return result to client which will calculate distinct and in second approcah Hbase will desrialise the value of return column containing F1 to cleint which will calculate the distinct. 2.For bulk load when LoadIncrementalHFiles runs and regionserver moves the hfiles from hdfs to region directory - does regionserver localise the hfile by downloading it to local and then uploading again in region directory? Or it just moves to to region directory and wait for next compaction to get it localise as in regionserver failure case? On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu yuzhih...@gmail.com wrote: For both scenarios you mentioned, field is not leading part of row key. You would need to specify timerange or start row / stop row to narrow the key range being scanned. I am leaning toward using second approach. Cheers On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora shushantaror...@gmail.com wrote: ~8-10 fields of size (5 of 20 bytes each )and 3 fields of size 200 bytes each. On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu yuzhih...@gmail.com wrote: How many fields such as F1 are you considering for embedding in row key ? Suggested reading: http://hbase.apache.org/book.html#rowkey.design http://hbase.apache.org/book.html#client.filter.kvm (see ColumnPrefixFilter) Cheers On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.so size limit is per cell's identifier + value ? What is more optimise - to have field in key or in column family's column ? If pattern is like every row has that field. Say I have a field F1 in all rows so Situtatio -1 key1#F1(as composite key) - and rest fields in column Situation-2 key1 as key and F1 part of column family. This is the main reason I asked the key size limit. If I asked for no of rows where F1 is = 'someval' will it be faster in situation-1 than in situation-2. Since in 1 it can return the result just by traversing keys no need to read columns? On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, it is the limit on a single keyvalue, not row, not key. For #2, please see the following: http://hbase.apache.org/book.html#store.memstore http
Re: hbase doubts
Thanks! Which one is better for sqlkind of queries over hbase (queries involve filter , key range scan), aggregates by column values. . 1.Hive storage handlers 2.or Phoenix On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, if you want to count distinct values for F1, you can write a coprocessor which aggregates the count on region server and returns the result to client which does the final aggregation. Take a look at hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java and related classes for example. On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora shushantaror...@gmail.com wrote: Thanks ! few more doubts : 1.Say if requirement is to count distinct value of F1- If field is part of key- is hbase can't just scan key and skip value deserialsation and return result to client which will calculate distinct and in second approcah Hbase will desrialise the value of return column containing F1 to cleint which will calculate the distinct. 2.For bulk load when LoadIncrementalHFiles runs and regionserver moves the hfiles from hdfs to region directory - does regionserver localise the hfile by downloading it to local and then uploading again in region directory? Or it just moves to to region directory and wait for next compaction to get it localise as in regionserver failure case? On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu yuzhih...@gmail.com wrote: For both scenarios you mentioned, field is not leading part of row key. You would need to specify timerange or start row / stop row to narrow the key range being scanned. I am leaning toward using second approach. Cheers On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora shushantaror...@gmail.com wrote: ~8-10 fields of size (5 of 20 bytes each )and 3 fields of size 200 bytes each. On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu yuzhih...@gmail.com wrote: How many fields such as F1 are you considering for embedding in row key ? Suggested reading: http://hbase.apache.org/book.html#rowkey.design http://hbase.apache.org/book.html#client.filter.kvm (see ColumnPrefixFilter) Cheers On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.so size limit is per cell's identifier + value ? What is more optimise - to have field in key or in column family's column ? If pattern is like every row has that field. Say I have a field F1 in all rows so Situtatio -1 key1#F1(as composite key) - and rest fields in column Situation-2 key1 as key and F1 part of column family. This is the main reason I asked the key size limit. If I asked for no of rows where F1 is = 'someval' will it be faster in situation-1 than in situation-2. Since in 1 it can return the result just by traversing keys no need to read columns? On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, it is the limit on a single keyvalue, not row, not key. For #2, please see the following: http://hbase.apache.org/book.html#store.memstore http://hbase.apache.org/book.html#regionserver_splitting_implementation Cheers On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.Is hbase.client.keyvalue.maxsize is max size of row or key only ? Is there any limit on key size only ? 2.Access pattern is mostly on key based only- Is memstores and regions on a regionserver are per table basis? Is it if I have multiple tables it will have multiple memstores instead of few if it would have been one large table ? On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, take a look at the following in hbase-default.xml : namehbase.client.keyvalue.maxsize/name value10485760/value For #2, it would be easier to answer if you can outline access patterns in your app. For #3, adjustment according to current region boundaries is done client side. Take a look at the javadoc for LoadQueueItem in LoadIncrementalHFiles.java Cheers On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.Is there any max limit on key size of hbase table. 2.Is multiple small tables vs one large table which one is preferred. 3.for bulk load -when LoadIncremantalHfile is run it again
hbase doubts
1.Is there any max limit on key size of hbase table. 2.Is multiple small tables vs one large table which one is preferred. 3.for bulk load -when LoadIncremantalHfile is run it again recalculates the region splits based on region boundary - is this division happens on client side or server side again at region server or hbase master and then it assigns the splits which cross target region boundary to desired regionserver.
Re: hbase doubts
1.Is hbase.client.keyvalue.maxsize is max size of row or key only ? Is there any limit on key size only ? 2.Access pattern is mostly on key based only- Is memstores and regions on a regionserver are per table basis? Is it if I have multiple tables it will have multiple memstores instead of few if it would have been one large table ? On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, take a look at the following in hbase-default.xml : namehbase.client.keyvalue.maxsize/name value10485760/value For #2, it would be easier to answer if you can outline access patterns in your app. For #3, adjustment according to current region boundaries is done client side. Take a look at the javadoc for LoadQueueItem in LoadIncrementalHFiles.java Cheers On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.Is there any max limit on key size of hbase table. 2.Is multiple small tables vs one large table which one is preferred. 3.for bulk load -when LoadIncremantalHfile is run it again recalculates the region splits based on region boundary - is this division happens on client side or server side again at region server or hbase master and then it assigns the splits which cross target region boundary to desired regionserver.
Re: hbase doubts
Thanks ! few more doubts : 1.Say if requirement is to count distinct value of F1- If field is part of key- is hbase can't just scan key and skip value deserialsation and return result to client which will calculate distinct and in second approcah Hbase will desrialise the value of return column containing F1 to cleint which will calculate the distinct. 2.For bulk load when LoadIncrementalHFiles runs and regionserver moves the hfiles from hdfs to region directory - does regionserver localise the hfile by downloading it to local and then uploading again in region directory? Or it just moves to to region directory and wait for next compaction to get it localise as in regionserver failure case? On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu yuzhih...@gmail.com wrote: For both scenarios you mentioned, field is not leading part of row key. You would need to specify timerange or start row / stop row to narrow the key range being scanned. I am leaning toward using second approach. Cheers On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora shushantaror...@gmail.com wrote: ~8-10 fields of size (5 of 20 bytes each )and 3 fields of size 200 bytes each. On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu yuzhih...@gmail.com wrote: How many fields such as F1 are you considering for embedding in row key ? Suggested reading: http://hbase.apache.org/book.html#rowkey.design http://hbase.apache.org/book.html#client.filter.kvm (see ColumnPrefixFilter) Cheers On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.so size limit is per cell's identifier + value ? What is more optimise - to have field in key or in column family's column ? If pattern is like every row has that field. Say I have a field F1 in all rows so Situtatio -1 key1#F1(as composite key) - and rest fields in column Situation-2 key1 as key and F1 part of column family. This is the main reason I asked the key size limit. If I asked for no of rows where F1 is = 'someval' will it be faster in situation-1 than in situation-2. Since in 1 it can return the result just by traversing keys no need to read columns? On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, it is the limit on a single keyvalue, not row, not key. For #2, please see the following: http://hbase.apache.org/book.html#store.memstore http://hbase.apache.org/book.html#regionserver_splitting_implementation Cheers On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.Is hbase.client.keyvalue.maxsize is max size of row or key only ? Is there any limit on key size only ? 2.Access pattern is mostly on key based only- Is memstores and regions on a regionserver are per table basis? Is it if I have multiple tables it will have multiple memstores instead of few if it would have been one large table ? On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, take a look at the following in hbase-default.xml : namehbase.client.keyvalue.maxsize/name value10485760/value For #2, it would be easier to answer if you can outline access patterns in your app. For #3, adjustment according to current region boundaries is done client side. Take a look at the javadoc for LoadQueueItem in LoadIncrementalHFiles.java Cheers On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.Is there any max limit on key size of hbase table. 2.Is multiple small tables vs one large table which one is preferred. 3.for bulk load -when LoadIncremantalHfile is run it again recalculates the region splits based on region boundary - is this division happens on client side or server side again at region server or hbase master and then it assigns the splits which cross target region boundary to desired regionserver.
Re: hbase doubts
~8-10 fields of size (5 of 20 bytes each )and 3 fields of size 200 bytes each. On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu yuzhih...@gmail.com wrote: How many fields such as F1 are you considering for embedding in row key ? Suggested reading: http://hbase.apache.org/book.html#rowkey.design http://hbase.apache.org/book.html#client.filter.kvm (see ColumnPrefixFilter) Cheers On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.so size limit is per cell's identifier + value ? What is more optimise - to have field in key or in column family's column ? If pattern is like every row has that field. Say I have a field F1 in all rows so Situtatio -1 key1#F1(as composite key) - and rest fields in column Situation-2 key1 as key and F1 part of column family. This is the main reason I asked the key size limit. If I asked for no of rows where F1 is = 'someval' will it be faster in situation-1 than in situation-2. Since in 1 it can return the result just by traversing keys no need to read columns? On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, it is the limit on a single keyvalue, not row, not key. For #2, please see the following: http://hbase.apache.org/book.html#store.memstore http://hbase.apache.org/book.html#regionserver_splitting_implementation Cheers On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.Is hbase.client.keyvalue.maxsize is max size of row or key only ? Is there any limit on key size only ? 2.Access pattern is mostly on key based only- Is memstores and regions on a regionserver are per table basis? Is it if I have multiple tables it will have multiple memstores instead of few if it would have been one large table ? On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, take a look at the following in hbase-default.xml : namehbase.client.keyvalue.maxsize/name value10485760/value For #2, it would be easier to answer if you can outline access patterns in your app. For #3, adjustment according to current region boundaries is done client side. Take a look at the javadoc for LoadQueueItem in LoadIncrementalHFiles.java Cheers On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.Is there any max limit on key size of hbase table. 2.Is multiple small tables vs one large table which one is preferred. 3.for bulk load -when LoadIncremantalHfile is run it again recalculates the region splits based on region boundary - is this division happens on client side or server side again at region server or hbase master and then it assigns the splits which cross target region boundary to desired regionserver.
Re: hbase doubts
1.so size limit is per cell's identifier + value ? What is more optimise - to have field in key or in column family's column ? If pattern is like every row has that field. Say I have a field F1 in all rows so Situtatio -1 key1#F1(as composite key) - and rest fields in column Situation-2 key1 as key and F1 part of column family. This is the main reason I asked the key size limit. If I asked for no of rows where F1 is = 'someval' will it be faster in situation-1 than in situation-2. Since in 1 it can return the result just by traversing keys no need to read columns? On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, it is the limit on a single keyvalue, not row, not key. For #2, please see the following: http://hbase.apache.org/book.html#store.memstore http://hbase.apache.org/book.html#regionserver_splitting_implementation Cheers On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.Is hbase.client.keyvalue.maxsize is max size of row or key only ? Is there any limit on key size only ? 2.Access pattern is mostly on key based only- Is memstores and regions on a regionserver are per table basis? Is it if I have multiple tables it will have multiple memstores instead of few if it would have been one large table ? On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, take a look at the following in hbase-default.xml : namehbase.client.keyvalue.maxsize/name value10485760/value For #2, it would be easier to answer if you can outline access patterns in your app. For #3, adjustment according to current region boundaries is done client side. Take a look at the javadoc for LoadQueueItem in LoadIncrementalHFiles.java Cheers On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora shushantaror...@gmail.com wrote: 1.Is there any max limit on key size of hbase table. 2.Is multiple small tables vs one large table which one is preferred. 3.for bulk load -when LoadIncremantalHfile is run it again recalculates the region splits based on region boundary - is this division happens on client side or server side again at region server or hbase master and then it assigns the splits which cross target region boundary to desired regionserver.
bulk load doubts
1.Does bulk loaded HFile not get replicated? Is it mean if a Regionserver gets down , all Hfiles which were bulk loaded to this server are lost irrespective of HDFS replication set to 3 ? if yes- Why bulk loaded HFiles are not replicated. 2.Is there any issue in timestamp prefix as key of table- and used bulk load for writing. 3.Does in bulk load MR job using HFileOutPutFormat2 as outputformat will create single HFile per region ? Or it can be multiple Hfiles per region? If multiple does loadIncrementalHFiles merges these Hfiles to 1 while loading to same region or just do simple copy? 4.Is there any performance issue if I run bulk load every 5 sec - containing ~20MB of data.Does it creates frequent compactions and that lead to performance issue?
hbase doubts
does bulk put supported in hbase ? And in MR job when we put in a table using TableOutputFormat how is it more efficient than normal put by individual reducers ? Does TableOutputformat not do put one by one ? And in bulkload hadoop job when we specify HFileOutputFormat , does job creates Hfiles based on regionserver in which they will finally land or just in sorted order and then Hbase utility LoadIncremental HFiles handle regionserver in which keys of these Hfiles will go by parsing the Hfile instead of just dumping the HFiles?
Hbase master selection doubt
How Hbase uses Zookeeper for Master selection and region server failure detection when Zookeeper is not strictly consistent. Say In Hbase Master selection process, how does a node is 100 % sure that a master is created ? Does it has to create the /master node and that node already exists will thow node exists excpetion . Since only by reading (ls /) . It may get stale data and gets node does not exists.but in actual /master was present. Does there any issue with non strictly consistency of Zookeeper for Hbase?
Re: Hbase master selection doubt
Zookeeper is Sequential Consistency Updates from a client will be applied in the order that they were sent. On Sat, Jun 27, 2015 at 8:18 PM, Ted Yu yuzhih...@gmail.com wrote: bq. non strictly consistency of Zookeeper Can you elaborate on what the above means ? please read this: http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkGuarantees Cheers On Sat, Jun 27, 2015 at 7:20 AM, Shushant Arora shushantaror...@gmail.com wrote: How Hbase uses Zookeeper for Master selection and region server failure detection when Zookeeper is not strictly consistent. Say In Hbase Master selection process, how does a node is 100 % sure that a master is created ? Does it has to create the /master node and that node already exists will thow node exists excpetion . Since only by reading (ls /) . It may get stale data and gets node does not exists.but in actual /master was present. Does there any issue with non strictly consistency of Zookeeper for Hbase?
Re: Hbase master selection doubt
By strictly consistent I mean - all clients should see same data at any time in different sessions. Say a client C1 was connected to follower F1, And F1 was few seconds behind the leader. And client C2 connects to F2 which is in sync with Leader . Now C1 and C2 will see different data under root dir say(/master ) is visible to C2 not to C1. Till F1 comes in sync with Leader. On Sat, Jun 27, 2015 at 8:23 PM, Shushant Arora shushantaror...@gmail.com wrote: Zookeeper is Sequential Consistency Updates from a client will be applied in the order that they were sent. On Sat, Jun 27, 2015 at 8:18 PM, Ted Yu yuzhih...@gmail.com wrote: bq. non strictly consistency of Zookeeper Can you elaborate on what the above means ? please read this: http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkGuarantees Cheers On Sat, Jun 27, 2015 at 7:20 AM, Shushant Arora shushantaror...@gmail.com wrote: How Hbase uses Zookeeper for Master selection and region server failure detection when Zookeeper is not strictly consistent. Say In Hbase Master selection process, how does a node is 100 % sure that a master is created ? Does it has to create the /master node and that node already exists will thow node exists excpetion . Since only by reading (ls /) . It may get stale data and gets node does not exists.but in actual /master was present. Does there any issue with non strictly consistency of Zookeeper for Hbase?
Re: avoiding hot spot for timestamp prefix key
guid change with every key, patterns is 2015-05-22 00:02:01#AB12EC945 2015-05-22 00:02:02#CD9870001234AB457 When we specify custom split algorithm , it may happen that keys of same sorting order range say (1-7) lies in region R1 as well as in region R2? Then how .META. table will make further lookups at read time, say I search for key 3, then will it search in both the regions R1 and R2 ? On Fri, May 22, 2015 at 10:48 AM, Ted Yu yuzhih...@gmail.com wrote: Does guid change with every key ? bq. use second part of key I don't think so. Suppose first row in the parent region is '1432104178817#321'. After split, the first row in first daughter region would still be '1432104178817#321'. Right ? Cheers On Thu, May 21, 2015 at 9:57 PM, Shushant Arora shushantaror...@gmail.com wrote: Can I avoid hotspot of region with custom region split policy in hbase 0.96 . Key is of the form timestamp#guid. So can I have custom region split policy and use second part of key (i.e) guid as region split criteria and avoid hot spot??
Re: avoiding hot spot for timestamp prefix key
since custom split policy is based on second part i.e guid so key with first part as 2015-05-22 00:01:02 will be in which region how will that be identified? On Fri, May 22, 2015 at 1:12 PM, Ted Yu yuzhih...@gmail.com wrote: The custom split policy needs to respect the fact that timestamp is the leading part of the rowkey. This would avoid the overlap you mentioned. Cheers On May 21, 2015, at 11:55 PM, Shushant Arora shushantaror...@gmail.com wrote: guid change with every key, patterns is 2015-05-22 00:02:01#AB12EC945 2015-05-22 00:02:02#CD9870001234AB457 When we specify custom split algorithm , it may happen that keys of same sorting order range say (1-7) lies in region R1 as well as in region R2? Then how .META. table will make further lookups at read time, say I search for key 3, then will it search in both the regions R1 and R2 ? On Fri, May 22, 2015 at 10:48 AM, Ted Yu yuzhih...@gmail.com wrote: Does guid change with every key ? bq. use second part of key I don't think so. Suppose first row in the parent region is '1432104178817#321'. After split, the first row in first daughter region would still be '1432104178817#321'. Right ? Cheers On Thu, May 21, 2015 at 9:57 PM, Shushant Arora shushantaror...@gmail.com wrote: Can I avoid hotspot of region with custom region split policy in hbase 0.96 . Key is of the form timestamp#guid. So can I have custom region split policy and use second part of key (i.e) guid as region split criteria and avoid hot spot??
avoiding hot spot for timestamp prefix key
Can I avoid hotspot of region with custom region split policy in hbase 0.96 . Key is of the form timestamp#guid. So can I have custom region split policy and use second part of key (i.e) guid as region split criteria and avoid hot spot??
default no of reducers
In Normal MR job can I configure ( cluster wide) default number of reducers - if I don't specify any reducers in my job.
Re: pre split region server
Thanks! Few more doubts 1.When I don't supply SPLITS at table creation , all put operation will go to one region only. But when region grows more than hbase.hregion.max.filesize , then 2 regions will be created both have half-half data or another will be empty initially? 2.If both have 50-50% data and row key is monotonically increasing then 1 region will be half filled always and will never be filled again ? 3.While prespliting table only way is to specify row boundaries and key prefixes ?Say if i don't know key ranges , as in my case its GUID hexadecimal 32 character string , what should be region split boundary ? and How many splits should be created - is it equal to no of regionserver aka datanodes ? 4.For keys of type ACTIVITYTYPE-DATE (where activity type has 2 values 1.login 2.logout) what should be split strategy ? On Tue, Jul 15, 2014 at 7:03 PM, Ted Yu yuzhih...@gmail.com wrote: Shushant: For #2, if table has only one region, the hosting region server would receive all writes. For #4, yes - presplitting goes with fixed number of regions. Cheers On Tue, Jul 15, 2014 at 6:23 AM, sudhakara st sudhakara...@gmail.com wrote: You can find info here http://hbase.apache.org/book/rowkey.design.html#rowkey.regionsplits http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ On Tue, Jul 15, 2014 at 6:40 PM, Shushant Arora shushantaror...@gmail.com wrote: 1.How to split region servers at table definition time? 2.Will hbase write onto only one region server when no splits are defined even if key is not monotonically increasing? 3. When does a region split occurs. 4. Will no of regions be fixed when hbase table is presplitted at table creation time. -- Regards, ...sudhakara
Re: pre split region server
Thanks Ted. Can you tell give shell syntax for #3 at table creation time. On Wed, Jul 16, 2014 at 1:52 PM, Ted Yu yuzhih...@gmail.com wrote: For #1, the two regions would contain roughly half the data. For #2, 1 region would not receive new data. As you see, such schema design is suboptimal. For #3, you can split the key space evenly. Using number of region servers as number of splits is Okay. Cheers On Jul 16, 2014, at 12:25 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks! Few more doubts 1.When I don't supply SPLITS at table creation , all put operation will go to one region only. But when region grows more than hbase.hregion.max.filesize , then 2 regions will be created both have half-half data or another will be empty initially? 2.If both have 50-50% data and row key is monotonically increasing then 1 region will be half filled always and will never be filled again ? 3.While prespliting table only way is to specify row boundaries and key prefixes ?Say if i don't know key ranges , as in my case its GUID hexadecimal 32 character string , what should be region split boundary ? and How many splits should be created - is it equal to no of regionserver aka datanodes ? 4.For keys of type ACTIVITYTYPE-DATE (where activity type has 2 values 1.login 2.logout) what should be split strategy ? On Tue, Jul 15, 2014 at 7:03 PM, Ted Yu yuzhih...@gmail.com wrote: Shushant: For #2, if table has only one region, the hosting region server would receive all writes. For #4, yes - presplitting goes with fixed number of regions. Cheers On Tue, Jul 15, 2014 at 6:23 AM, sudhakara st sudhakara...@gmail.com wrote: You can find info here http://hbase.apache.org/book/rowkey.design.html#rowkey.regionsplits http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ On Tue, Jul 15, 2014 at 6:40 PM, Shushant Arora shushantaror...@gmail.com wrote: 1.How to split region servers at table definition time? 2.Will hbase write onto only one region server when no splits are defined even if key is not monotonically increasing? 3. When does a region split occurs. 4. Will no of regions be fixed when hbase table is presplitted at table creation time. -- Regards, ...sudhakara
pre split region server
1.How to split region servers at table definition time? 2.Will hbase write onto only one region server when no splits are defined even if key is not monotonically increasing? 3. When does a region split occurs. 4. Will no of regions be fixed when hbase table is presplitted at table creation time.
overriding slaves for particular job
Hi Can I override slaves nodes for one of my job only. Let say I want current job to be executed on node1 and node2 only. If both are busy let the job wait. Thanks Shushant
Re: hbase key design to efficient query on base of 2 or more column
I cannot apply server side filter. 2nd requirement is not just get users with supreme category rather distribution of users category wise. 1.How many of supreme , how many of normal and how many of medium till date. On Mon, May 19, 2014 at 12:58 PM, Michael Segel michael_se...@hotmail.comwrote: Whoa! BAD BOY. This isn’t a good idea for secondary index. You have a row key (primary index) which is time. The secondary is a filter… with 3 choices. HINT: Do you really want a secondary index based on a field that only has 3 choices for a value? What are they teaching in school these days? How about applying a server side filter? ;-) On May 18, 2014, at 12:33 PM, John Hancock jhancock1...@gmail.com wrote: Shushant, Here's one idea, there might be better ways. Take a look at phoenix it supports secondary indexing: http://phoenix.incubator.apache.org/secondary_indexing.html -John On Sat, May 17, 2014 at 8:34 AM, Shushant Arora shushantaror...@gmail.comwrote: Hi I have a requirement to query my data base on date and user category. User category can be Supreme,Normal,Medium. I want to query how many new users are there in my table from date range (2014-01-01) to (2014-05-16) category wise. Another requirement is to query how many users of Supreme category are there in my table Broken down wise month in which they came. What should be my key 1.If i take key as combination of date#category. I cannot query based on category? 2.If I take key as category#date I cannot query based on date. Thanks Shushant.
Re: hbase key design to efficient query on base of 2 or more column
Ok..but what if I have 2 multivalue dimensions on which I have to analyse no of users. Say Category can have 50 values and another dimension is country of user(say 100+ values). I need weekly count on category and country + I need overall distinct user count on category and country. How to achieve this in Hbase. On Mon, May 19, 2014 at 3:11 PM, Michael Segel michael_se...@hotmail.comwrote: The point is that choosing a field that has a small finite set of values is not a good candidate for indexing using an inverted table or b-tree etc … I’d say that you’re actually going to be better off using a scan with a start and stop row, then doing the counts on the client side. So as you get back your result set… you process the data. (Either in a M/R job or single client thread.) HTH On May 19, 2014, at 8:48 AM, Shushant Arora shushantaror...@gmail.com wrote: I cannot apply server side filter. 2nd requirement is not just get users with supreme category rather distribution of users category wise. 1.How many of supreme , how many of normal and how many of medium till date. On Mon, May 19, 2014 at 12:58 PM, Michael Segel michael_se...@hotmail.comwrote: Whoa! BAD BOY. This isn’t a good idea for secondary index. You have a row key (primary index) which is time. The secondary is a filter… with 3 choices. HINT: Do you really want a secondary index based on a field that only has 3 choices for a value? What are they teaching in school these days? How about applying a server side filter? ;-) On May 18, 2014, at 12:33 PM, John Hancock jhancock1...@gmail.com wrote: Shushant, Here's one idea, there might be better ways. Take a look at phoenix it supports secondary indexing: http://phoenix.incubator.apache.org/secondary_indexing.html -John On Sat, May 17, 2014 at 8:34 AM, Shushant Arora shushantaror...@gmail.comwrote: Hi I have a requirement to query my data base on date and user category. User category can be Supreme,Normal,Medium. I want to query how many new users are there in my table from date range (2014-01-01) to (2014-05-16) category wise. Another requirement is to query how many users of Supreme category are there in my table Broken down wise month in which they came. What should be my key 1.If i take key as combination of date#category. I cannot query based on category? 2.If I take key as category#date I cannot query based on date. Thanks Shushant.
Re: hbase key design to efficient query on base of 2 or more column
By server side filter you mean to partition the data across multiple hbase table one for each category or something else? On Mon, May 19, 2014 at 11:05 PM, Vladimir Rodionov vrodio...@carrieriq.com wrote: I cannot apply server side filter. Why is that? Are you using stock HBase or some other, API - compatible product? Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com From: Shushant Arora [shushantaror...@gmail.com] Sent: Monday, May 19, 2014 12:48 AM To: user@hbase.apache.org Subject: Re: hbase key design to efficient query on base of 2 or more column I cannot apply server side filter. 2nd requirement is not just get users with supreme category rather distribution of users category wise. 1.How many of supreme , how many of normal and how many of medium till date. Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or notificati...@carrieriq.com and delete or destroy any copy of this message and its attachments.
hbase key design to efficient query on base of 2 or more column
Hi I have a requirement to query my data base on date and user category. User category can be Supreme,Normal,Medium. I want to query how many new users are there in my table from date range (2014-01-01) to (2014-05-16) category wise. Another requirement is to query how many users of Supreme category are there in my table Broken down wise month in which they came. What should be my key 1.If i take key as combination of date#category. I cannot query based on category? 2.If I take key as category#date I cannot query based on date. Thanks Shushant.
when to use hive vs hbase
I have a requirement of processing huge weblogs on daily basis. 1. data will come incremental to datastore on daily basis and I need cumulative and daily distinct user count from logs and after that aggregated data will be loaded in RDBMS like mydql. 2.data will be loaded in hdfs datawarehouse on daily basis and same will be fetched from Hdfs warehouse after some filtering in RDMS like mysql and will be processed there. Which datawarehouse is suitable for approach 1 and 2 and why?. Thanks Shushant
Re: when to use hive vs hbase
Hi Jean Thanks for explanation . I still have one doubt Why HBase is not good for bulk loads and aggregations (Full table scan) ? Hive will also read each row for aggregation as well as HBase . Can you explain more ? On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Shushant, Hive and HBase are 2 different things. You can not really use one vs another one. Hive is a query engine against HDFS data. Data can be stored with different format like flat text, sequence files, Paquet file, or even HBase table. HBase is both a query engine (Get and scans) and a storage engine on top of HDFS which allow you to store data for random read and random write. Then you can also add tools like Phoenix and Impala in the picture which will allow you to query the data from HDFS or HBase too. A good way to know if HBase is a good fit or not is to ask yourself how you are going to write into HBase or to read from HBase. HBase is good for Random Reads and Random Writes. If you only do bulk loads and aggregations (Full table scan), HBase is not a good fit. If you do random access (Client information, events details, etc.) HBase is a good fit. It's a bit over simplified, but that should give you some starting points. 2014-04-30 4:34 GMT-04:00 Shushant Arora shushantaror...@gmail.com: I have a requirement of processing huge weblogs on daily basis. 1. data will come incremental to datastore on daily basis and I need cumulative and daily distinct user count from logs and after that aggregated data will be loaded in RDBMS like mydql. 2.data will be loaded in hdfs datawarehouse on daily basis and same will be fetched from Hdfs warehouse after some filtering in RDMS like mysql and will be processed there. Which datawarehouse is suitable for approach 1 and 2 and why?. Thanks Shushant
Re: when to use hive vs hbase
Thanks Jean ! Few more questions what are good practices for key column design in HBase? Say my web logs contains timestamp and request id which uniquely identify each row 1.Shall I make -MM-DD-HH-MM-SS_REQ_ID as row key ? In scenario where this data will be fetched from HBase on daily base and will be loaded in MYSql DB. Daily my ETLruns and it will fetch record with keycol=lastdate and keycol=today ? Will this key design over load one region server ? Or it will be equally divided among region servers. On Wed, Apr 30, 2014 at 5:55 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: With HBase you have some overhead. The Region Server will do a lot for you. Manage lal the columns families, the columns, the delete marker, the compactions, etc. If you read a file directly from HDFS it will be faster for sure because you will not have all those validations and all this extra memory usage. HBase is absolutely perfect and is excellent to what it's build for. But if you are doing only full table scans, it's not it's primary usecase. It can still do it if you want, but if you do only that, it's not yet the most efficient option. If your usecase is a mix of full scans and random read/random writes, then yes, go with it! Last, some full table scan can be good fits with HBase if you use some of it's specific features like TTL on certain columns families when using more than 1, etc. HTH 2014-04-30 8:13 GMT-04:00 Shushant Arora shushantaror...@gmail.com: Hi Jean Thanks for explanation . I still have one doubt Why HBase is not good for bulk loads and aggregations (Full table scan) ? Hive will also read each row for aggregation as well as HBase . Can you explain more ? On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Shushant, Hive and HBase are 2 different things. You can not really use one vs another one. Hive is a query engine against HDFS data. Data can be stored with different format like flat text, sequence files, Paquet file, or even HBase table. HBase is both a query engine (Get and scans) and a storage engine on top of HDFS which allow you to store data for random read and random write. Then you can also add tools like Phoenix and Impala in the picture which will allow you to query the data from HDFS or HBase too. A good way to know if HBase is a good fit or not is to ask yourself how you are going to write into HBase or to read from HBase. HBase is good for Random Reads and Random Writes. If you only do bulk loads and aggregations (Full table scan), HBase is not a good fit. If you do random access (Client information, events details, etc.) HBase is a good fit. It's a bit over simplified, but that should give you some starting points. 2014-04-30 4:34 GMT-04:00 Shushant Arora shushantaror...@gmail.com: I have a requirement of processing huge weblogs on daily basis. 1. data will come incremental to datastore on daily basis and I need cumulative and daily distinct user count from logs and after that aggregated data will be loaded in RDBMS like mydql. 2.data will be loaded in hdfs datawarehouse on daily basis and same will be fetched from Hdfs warehouse after some filtering in RDMS like mysql and will be processed there. Which datawarehouse is suitable for approach 1 and 2 and why?. Thanks Shushant
hive hbase integration
Wanna know why hive hbase integration is required. Is it because hbase cannot provide all functionalities of sql like and if yes then why? What is storage handler and best practices for hive hbase integration?