Re: Change data capture tool for hbase

2013-06-04 Thread yavuz gokirmak
Hi Asaf, This CDC pattern will be used for directing changes to another system, Assume I have a table hbase_alarms in hbase with columns Severity,Source,Time and tracking changes with this CDC tool. Some external system is putting alarms with their severity and source to hbase_alarms table .

RPC Replication Compression

2013-06-04 Thread Asaf Mesika
Hi, Just wanted to make sure if I read in the internet correctly: 0.96 will support HBase RPC compression thus Replication between master and slave will enjoy it as well (important since bandwidth between geographically distant data centers is scarce and more expensive)

Re: RPC Replication Compression

2013-06-04 Thread Anoop John
0.96 will support HBase RPC compression Yes Replication between master and slave will enjoy it as well (important since bandwidth between geographically distant data centers is scarce and more expensive) But I can not see it is being utilized in replication. May be we can do improvements in

Re: what's the typical scan latency?

2013-06-04 Thread Amit Mor
What's your blockCacheHitCachingRatio ? It would tell you about the ratio of scans requested from cache (default) to the scans actually served from the block cache. You can get that from the RS web ui. What you are seeing can almost map to anything, for example: is scanner caching (client side)

Re: RPC Replication Compression

2013-06-04 Thread Asaf Mesika
If RPC has compression abilities, how come Replication, which also works in RPC does not get it automatically? On Tue, Jun 4, 2013 at 12:34 PM, Anoop John anoop.hb...@gmail.com wrote: 0.96 will support HBase RPC compression Yes Replication between master and slave will enjoy it as well

Using thrift2 interface but getting : 400 Bad Request

2013-06-04 Thread Simon Majou
Hello, I am using thrift thrift2 interfaces (thrift for DDL thrift2 for the rest), my requests work with thrift but with thrift2 I got a error 400. Here is my code (coffeescript) : colValue = new types2.TColumnValue family: 'cf', qualifier:'col', value:'yoo' put = new

Re: Using thrift2 interface but getting : 400 Bad Request

2013-06-04 Thread Ted Yu
Can you check region server log around that time ? Thanks On Jun 4, 2013, at 8:37 AM, Simon Majou si...@majou.org wrote: Hello, I am using thrift thrift2 interfaces (thrift for DDL thrift2 for the rest), my requests work with thrift but with thrift2 I got a error 400. Here is my code

Re: Using thrift2 interface but getting : 400 Bad Request

2013-06-04 Thread Simon Majou
No logs there either (in fact no logs are written in any log file when I execute the request) Simon On Tue, Jun 4, 2013 at 5:42 PM, Ted Yu yuzhih...@gmail.com wrote: Can you check region server log around that time ? Thanks On Jun 4, 2013, at 8:37 AM, Simon Majou si...@majou.org wrote:

Regarding Indexing columns in HBASE

2013-06-04 Thread Ramasubramanian Narayanan
Hi, In a HBASE table, there are 200 columns and the read pattern for diffferent systems invols 70 columns... In the above case, we cannot have 70 columns in the rowkey which will not be a good design... Can you please suggest how to handle this problem? Also can we do indexing in HBASE apart

Re: RPC Replication Compression

2013-06-04 Thread Jean-Daniel Cryans
Replication doesn't need to know about compression at the RPC level so it won't refer to it and as far as I can tell you need to set compression only on the master cluster and the slave will figure it out. Looking at the code tho, I'm not sure it works the same way it used to work before

Re: Regarding Indexing columns in HBASE

2013-06-04 Thread Shahab Yunus
Just a quick thought, why don't you create different tables and duplicate data i.e. go for demoralization and data redundancy. Is your all read access patterns that would require 70 columns are incorporated into one application/client? Or it will be bunch of different clients/applications? If that

Re: Poor HBase map-reduce scan performance

2013-06-04 Thread Bryan Keller
Thanks Enis, I'll see if I can backport this patch - it is exactly what I was going to try. This should solve my scan performance problems if I can get it to work. On May 29, 2013, at 1:29 PM, Enis Söztutar e...@hortonworks.com wrote: Hi, Regarding running raw scans on top of Hfiles, you

Re: Regarding Indexing columns in HBASE

2013-06-04 Thread Ramasubramanian Narayanan
Hi, The read pattern differs from each application.. Is the below approach fine? Create one HBASE table with a unique rowkey and put all 200 columns into it... create mutiple small HBASE tables where it has the read access pattern columns and the rowkey it is mapped to the master table...

Re: Regarding Indexing columns in HBASE

2013-06-04 Thread Michel Segel
Quick and dirty... Create an inverted table for each index Then you can take the intersection of the result set(s) to get your list of rows for further filtering. There is obviously more to this, but its the core idea... Sent from a remote device. Please excuse any typos... Mike Segel

Re: Regarding Indexing columns in HBASE

2013-06-04 Thread Ramasubramanian Narayanan
Hi Michel, If you don't mind can you please help explain in detail ... Also can you pls let me know whether we have secondary index in HBASE? regards, Rams On Tue, Jun 4, 2013 at 1:13 PM, Michel Segel michael_se...@hotmail.comwrote: Quick and dirty... Create an inverted table for each

Re: Regarding Indexing columns in HBASE

2013-06-04 Thread Ian Varley
Rams - you might enjoy this blog post from HBase committer Jesse Yates (from last summer): http://jyates.github.io/2012/07/09/consistent-enough-secondary-indexes.html Secondary Indexing doesn't exist in HBase core today, but there are various proposals and early implementations of it in

Re: Regarding Indexing columns in HBASE

2013-06-04 Thread Michael Segel
Ok... A little bit more detail... First, its possible to store your data in multiple tables each with a different key. Not a good idea for some very obvious reasons You could however create a secondary table which is an inverted table where the rowkey of the index is the value in the

Scan + Gets are disk bound

2013-06-04 Thread Rahul Ravindran
Hi, We are relatively new to Hbase, and we are hitting a roadblock on our scan performance. I searched through the email archives and applied a bunch of the recommendations there, but they did not improve much. So, I am hoping I am missing something which you could guide me towards. Thanks in

Re: Explosion in datasize using HBase as a MR sink

2013-06-04 Thread Rob Verkuylen
Finally fixed this, my code was at fault. Protobufs require a builder object which was a (non static) protected object in an abstract class all parsers extend. The mapper calls a parser factory depending on the input record. Because we designed the parser instances as singletons, the builder

Replication is on columnfamily level or table level?

2013-06-04 Thread N Dm
hi, folks, hbase 0.94.3 By reading several documents, I always have the impression that * Replication* works at the table-*column*-*family level*. However, when I am setting up a table with two columnfamilies and replicate them to two different slavers, the whole table replicated. Is this a bug?

Re: Explosion in datasize using HBase as a MR sink

2013-06-04 Thread Stack
On Tue, Jun 4, 2013 at 9:58 PM, Rob Verkuylen r...@verkuylen.net wrote: Finally fixed this, my code was at fault. Protobufs require a builder object which was a (non static) protected object in an abstract class all parsers extend. The mapper calls a parser factory depending on the input

Re: RPC Replication Compression

2013-06-04 Thread Stack
On Tue, Jun 4, 2013 at 6:48 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Replication doesn't need to know about compression at the RPC level so it won't refer to it and as far as I can tell you need to set compression only on the master cluster and the slave will figure it out. Looking

Re: Poor HBase map-reduce scan performance

2013-06-04 Thread Sandy Pratt
Haven't had a chance to write a JIRA yet, but I thought I'd pop in here with an update in the meantime. I tried a number of different approaches to eliminate latency and bubbles in the scan pipeline, and eventually arrived at adding a streaming scan API to the region server, along with

Questions about HBase

2013-06-04 Thread Pankaj Gupta
Hi, I have a few small questions regarding HBase. I've searched the forum but couldn't find clear answers hence asking them here: 1. Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that? I found this jira:

Re: Questions about HBase

2013-06-04 Thread ramkrishna vasudevan
Does Minor compaction remove HFiles in which all entries are out of TTL or does only Major compaction do that Yes it applies for Minor compactions. Is there a way of configuring major compaction to compact only files older than a certain time or to compress all the files except the latest

Re: Questions about HBase

2013-06-04 Thread Ted Yu
bq. I found this jira: https://issues.apache.org/jira/browse/HBASE-5199 but I dont' know if the compaction being talked about there is minor or major. The optimization above applies to minor compaction selection. Cheers On Tue, Jun 4, 2013 at 7:15 PM, Pankaj Gupta pan...@brightroll.com

Re: Questions about HBase

2013-06-04 Thread Ted Yu
bq. But i am not very sure if we can control the files getting selected for compaction in the older verisons. Same mechanism is available in 0.94 Take a look at src/main/java/org/apache/hadoop/hbase/coprocessor/BaseRegionObserver.java where you would find the following methods (and more):

Re: Scan + Gets are disk bound

2013-06-04 Thread anil gupta
On Tue, Jun 4, 2013 at 11:48 AM, Rahul Ravindran rahu...@yahoo.com wrote: Hi, We are relatively new to Hbase, and we are hitting a roadblock on our scan performance. I searched through the email archives and applied a bunch of the recommendations there, but they did not improve much. So, I

Re: Replication is on columnfamily level or table level?

2013-06-04 Thread Anoop John
Yes the replication can be specified at the CF level.. You have used HCD#setScope() right? S = '3', BLOCKSIZE = '65536'}, {*NAME = 'cf2', REPLICATION_SCOPE = '2'*, You set scope as 2?? You have to set one CF to be replicated to one cluster and another to to another cluster. I dont think it is

Re: Scan + Gets are disk bound

2013-06-04 Thread Rahul Ravindran
Our row-keys do not contain time. By time-based scans, I mean, an MR over the Hbase table where the scan object has no startRow or endRow but has a startTime and endTime. Our row key format is MD5 of UUID+UUID, so, we expect good distribution. We have pre-split initially to prevent any initial

Re: Questions about HBase

2013-06-04 Thread Pankaj Gupta
Thanks for the replies. I'll take a look at src/main/java/org/apache/ hadoop/hbase/coprocessor/BaseRegionObserver.java. @ramkrishna: I do want to have bloom filter and block index all the time. For good read performance they're critical in my workflow. The worry is that when HBase is restarted it

Re: Questions about HBase

2013-06-04 Thread Anoop John
4. This one is related to what I read in the HBase definitive guide bloom filter section Given a random row key you are looking for, it is very likely that this key will fall in between two block start keys. The only way for HBase to figure out if the key actually exists is by loading

Re: Questions about HBase

2013-06-04 Thread Asaf Mesika
If you will read HFile v2 document on HBase site you will understand completely how the search for a record works and why there is linear search in the block but binary search to get to the right block. Also bear in mind the amount of keys in a blocks is not big since a block in HFile by default

Re: Questions about HBase

2013-06-04 Thread ramkrishna vasudevan
for the question whether you will be able to do a warm up for the bloom and block cache i don't think it is possible now. Regards Ram On Wed, Jun 5, 2013 at 10:57 AM, Asaf Mesika asaf.mes...@gmail.com wrote: If you will read HFile v2 document on HBase site you will understand completely how

Re: Scan + Gets are disk bound

2013-06-04 Thread Anoop John
When you set time range on Scan, some files can get skipped based on the max min ts values in that file. Said this, when u do major compact and do scan based on time range, dont think u will get some advantage. -Anoop- On Wed, Jun 5, 2013 at 10:11 AM, Rahul Ravindran rahu...@yahoo.com wrote:

Re: Scan + Gets are disk bound

2013-06-04 Thread Asaf Mesika
On Tuesday, June 4, 2013, Rahul Ravindran wrote: Hi, We are relatively new to Hbase, and we are hitting a roadblock on our scan performance. I searched through the email archives and applied a bunch of the recommendations there, but they did not improve much. So, I am hoping I am missing

Re: Questions about HBase

2013-06-04 Thread Asaf Mesika
When you do the first read of this region, wouldn't this load all bloom filters? On Wed, Jun 5, 2013 at 8:43 AM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: for the question whether you will be able to do a warm up for the bloom and block cache i don't think it is possible

Re: Scan + Gets are disk bound

2013-06-04 Thread Rahul Ravindran
Thanks for that confirmation. This is what we hypothesized as well. So, if we are dependent on timerange scans, we need to completely avoid major compaction and depend only on minor compactions? Is there any downside? We do have a TTL set on all the rows in the table. ~Rahul.