First,
In the world of Hadoop, if it ain't broke don't fix it, may not be the best
advice.
HBase is still evolving at a good pace and you want to be closer to the latest
releases.
CDH4 is stable so that I would agree that going to CDH4 would be best.
Second.
You are running this as a single
Could not find artifact org .apache .hadopp: Hadoop-core: jar :
2.0.0-cdh4.1.2
This is because the YCSB POM does not include Cloudera repositories. To get
anywhere with this, you'll need to open an issue on the YCSB GitHub page.
I have tried writing : Hadoop: 2.0.0
Hbase: 0.92.1
But I had
Hi,
We have events which are delivered into our HDFS cluster which may be
duplicated. Each event has a UUID and we were hoping to leverage HBase to
dedupe them. We run a MapReduce job which would perform a lookup for each UUID
on HBase and then emit the event only if the UUID was absent and
Are all these dupe events expected to be within the same hour or they
can happen over multiple hours ?
Viral
From: Rahul Ravindran
Sent: 2/14/2013 11:41 AM
To: user@hbase.apache.org
Subject: Using HBase for Deduping
Hi,
We have events which are delivered into our HDFS cluster which may
be
Most will be in the same hour. Some will be across 3-6 hours.
Sent from my phone.Excuse the terseness.
On Feb 14, 2013, at 12:19 PM, Viral Bajaria viral.baja...@gmail.com wrote:
Are all these dupe events expected to be within the same hour or they
can happen over multiple hours ?
Viral
You could do with a 2-pronged approach here i.e. some MR and some HBase
lookups. I don't think this is the best solution either given the # of
events you will get.
FWIW, the solution below again relies on the assumption that if a event is
duped in the same hour it won't have a dupe outside of
Hi,
We have events which are delivered into our HDFS cluster which may be
duplicated. Each event has a UUID and we were hoping to leverage HBase to
dedupe them. We run a MapReduce job which would perform a lookup for each UUID
on HBase and then emit the event only if the UUID was absent and
We can't rely on the the assumption event dupes will not dupe outside an hour
boundary. So, your take is that, doing a lookup per event within the MR job is
going to be bad?
From: Viral Bajaria viral.baja...@gmail.com
To: Rahul Ravindran rahu...@yahoo.com
Cc:
Given the size of the data ( 1B rows) and the frequency of job run (once
per hour), I don't think your most optimal solution is to lookup HBase for
every single event. You will benefit more by loading the HBase table
directly in your MR job.
In 1B rows, what's the cardinality ? Is it 100M UUID's
What constitutes a duplicate?
An over simplification is to do a HTable.checkAndPut() where you do the put if
the column doesn't exist.
Then if the row is inserted (TRUE) return value, you push the event.
That will do what you want.
At least at first blush.
On Feb 14, 2013, at 3:24 PM,
Checkandput() does not work when the row does not exist, or am I missing
something?
Sent from my phone.Excuse the terseness.
On Feb 14, 2013, at 5:33 PM, Michael Segel michael_se...@hotmail.com wrote:
What constitutes a duplicate?
An over simplification is to do a HTable.checkAndPut()
Well,
Maybe its a lack of sleep, but this is what I found...
checkAndPut
public boolean checkAndPut(byte[] row,
byte[] family,
byte[] qualifier,
byte[] value,
Put put)
Hi,
I am creating a new table and want to pre-split the regions and am seeing
some weird behavior.
My table is designed as a composite of multiple fixed length byte arrays
separated by a control character (for simplicity sake we can say the
separator is _underscore_). The prefix of this rowkey
Hi Rahul
When you say that some events can come with duplicate UUID, what
is the probability of such duplicate events? Is it like most of the events
wont be unique and only few are duplicate? Also whether this same duplicated
events come again and again (I mean same UUID for so
I was able to figure it out. I had to use the createTable api which took
splitKeys instead of the startKey, endKey and numPartitions.
If anyone comes across this issue and needs more feedback feel free to ping
me.
Thanks,
Viral
On Thu, Feb 14, 2013 at 7:30 PM, Viral Bajaria
When max versions set as 1 and duplicate key is added, the last added will win
removing the old. This is what you want Rahul? I think from his explanation
he needs the reverse way
-Anoop-
From: Asaf Mesika [asaf.mes...@gmail.com]
Sent: Friday, February
16 matches
Mail list logo