Re: hbase-master-server slept

2013-02-14 Thread Michel Segel
First, In the world of Hadoop, if it ain't broke don't fix it, may not be the best advice. HBase is still evolving at a good pace and you want to be closer to the latest releases. CDH4 is stable so that I would agree that going to CDH4 would be best. Second. You are running this as a single

Re: Build ycsb failed

2013-02-14 Thread Andrew Purtell
Could not find artifact org .apache .hadopp: Hadoop-core: jar : 2.0.0-cdh4.1.2 This is because the YCSB POM does not include Cloudera repositories. To get anywhere with this, you'll need to open an issue on the YCSB GitHub page. I have tried writing : Hadoop: 2.0.0 Hbase: 0.92.1 But I had

Using HBase for Deduping

2013-02-14 Thread Rahul Ravindran
Hi,    We have events which are delivered into our HDFS cluster which may be duplicated. Each event has a UUID and we were hoping to leverage HBase to dedupe them. We run a MapReduce job which would perform a lookup for each UUID on HBase and then emit the event only if the UUID was absent and

RE: Using HBase for Deduping

2013-02-14 Thread Viral Bajaria
Are all these dupe events expected to be within the same hour or they can happen over multiple hours ? Viral From: Rahul Ravindran Sent: 2/14/2013 11:41 AM To: user@hbase.apache.org Subject: Using HBase for Deduping Hi,    We have events which are delivered into our HDFS cluster which may be

Re: Using HBase for Deduping

2013-02-14 Thread Rahul Ravindran
Most will be in the same hour. Some will be across 3-6 hours. Sent from my phone.Excuse the terseness. On Feb 14, 2013, at 12:19 PM, Viral Bajaria viral.baja...@gmail.com wrote: Are all these dupe events expected to be within the same hour or they can happen over multiple hours ? Viral

Re: Using HBase for Deduping

2013-02-14 Thread Viral Bajaria
You could do with a 2-pronged approach here i.e. some MR and some HBase lookups. I don't think this is the best solution either given the # of events you will get. FWIW, the solution below again relies on the assumption that if a event is duped in the same hour it won't have a dupe outside of

Using Hbase for Dedupping

2013-02-14 Thread Rahul Ravindran
Hi,    We have events which are delivered into our HDFS cluster which may be duplicated. Each event has a UUID and we were hoping to leverage HBase to dedupe them. We run a MapReduce job which would perform a lookup for each UUID on HBase and then emit the event only if the UUID was absent and

Re: Using HBase for Deduping

2013-02-14 Thread Rahul Ravindran
We can't rely on the the assumption event dupes will not dupe outside an hour boundary. So, your take is that, doing a lookup per event within the MR job is going to be bad? From: Viral Bajaria viral.baja...@gmail.com To: Rahul Ravindran rahu...@yahoo.com Cc:

Re: Using HBase for Deduping

2013-02-14 Thread Viral Bajaria
Given the size of the data ( 1B rows) and the frequency of job run (once per hour), I don't think your most optimal solution is to lookup HBase for every single event. You will benefit more by loading the HBase table directly in your MR job. In 1B rows, what's the cardinality ? Is it 100M UUID's

Re: Using HBase for Deduping

2013-02-14 Thread Michael Segel
What constitutes a duplicate? An over simplification is to do a HTable.checkAndPut() where you do the put if the column doesn't exist. Then if the row is inserted (TRUE) return value, you push the event. That will do what you want. At least at first blush. On Feb 14, 2013, at 3:24 PM,

Re: Using HBase for Deduping

2013-02-14 Thread Rahul Ravindran
Checkandput() does not work when the row does not exist, or am I missing something? Sent from my phone.Excuse the terseness. On Feb 14, 2013, at 5:33 PM, Michael Segel michael_se...@hotmail.com wrote: What constitutes a duplicate? An over simplification is to do a HTable.checkAndPut()

Re: Using HBase for Deduping

2013-02-14 Thread Michael Segel
Well, Maybe its a lack of sleep, but this is what I found... checkAndPut public boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put)

question about pre-splitting regions

2013-02-14 Thread Viral Bajaria
Hi, I am creating a new table and want to pre-split the regions and am seeing some weird behavior. My table is designed as a composite of multiple fixed length byte arrays separated by a control character (for simplicity sake we can say the separator is _underscore_). The prefix of this rowkey

RE: Using Hbase for Dedupping

2013-02-14 Thread Anoop Sam John
Hi Rahul When you say that some events can come with duplicate UUID, what is the probability of such duplicate events? Is it like most of the events wont be unique and only few are duplicate? Also whether this same duplicated events come again and again (I mean same UUID for so

Re: question about pre-splitting regions

2013-02-14 Thread Viral Bajaria
I was able to figure it out. I had to use the createTable api which took splitKeys instead of the startKey, endKey and numPartitions. If anyone comes across this issue and needs more feedback feel free to ping me. Thanks, Viral On Thu, Feb 14, 2013 at 7:30 PM, Viral Bajaria

RE: Using HBase for Deduping

2013-02-14 Thread Anoop Sam John
When max versions set as 1 and duplicate key is added, the last added will win removing the old. This is what you want Rahul? I think from his explanation he needs the reverse way -Anoop- From: Asaf Mesika [asaf.mes...@gmail.com] Sent: Friday, February