Re: [ hbase ] performance of Get from MR Job

2012-06-27 Thread Michael Segel
I'm not sure as to what you are attempting to do with your data. There are a couple of things to look at. Looking at the issue, you have (K,V) pair. That's Key, Value. But the value isn't necessarily a single element. It could be a set of elements. You have to consider that rather than store

Re: performance of Get from MR Job

2012-06-26 Thread Jean-Daniel Cryans
The increase in data size will be due to your bigger row keys which are stored along every value. It's best to keep them on the small side: http://hbase.apache.org/book.html#keysize Consider writing the numbers in a binary format instead of storing them textually, so that a long takes only 8 bytes

Re: [ hbase ] Re: performance of Get from MR Job

2012-06-25 Thread Marcin Cylke
On 21/06/12 14:33, Michael Segel wrote: > I think the version issue is the killer factor here. > Usually performing a simple get() where you are getting the latest version of > the data on the row/cell occurs in some constant time k. This is constant > regardless of the size of the cluster and s

Re: performance of Get from MR Job

2012-06-25 Thread Marcin Cylke
On 21/06/12 14:33, Michael Segel wrote: > I think the version issue is the killer factor here. > Usually performing a simple get() where you are getting the latest version of > the data on the row/cell occurs in some constant time k. This is constant > regardless of the size of the cluster and s

Re: performance of Get from MR Job

2012-06-21 Thread Michael Segel
I think the version issue is the killer factor here. Usually performing a simple get() where you are getting the latest version of the data on the row/cell occurs in some constant time k. This is constant regardless of the size of the cluster and should scale in a near linear curve. As JD C

Re: performance of Get from MR Job

2012-06-20 Thread Jean-Daniel Cryans
Yeah I've overlooked the versions issue. What I usually recommend is that if the timestamp is part of your data model, it should be in the row key, a qualifier or a value. Since you seem to rely on the timestamp for querying, it should definitely be part of the row key but not at the beginning lik

Re: performance of Get from MR Job

2012-06-19 Thread Marcin Cylke
On 19/06/12 19:31, Jean-Daniel Cryans wrote: > This is a common but hard problem. I do not have a good answer. Thanks for Your writeup. You've given a few suggestions, that I will surely follow. But what is bothering me, is my use of timestamps. As mentioned before, my column family has 214748364

Re: performance of Get from MR Job

2012-06-19 Thread Jean-Daniel Cryans
This is a common but hard problem. I do not have a good answer. This issue with doing random reads for each line you are processing is that there's no way to batch them so you're basically doing this: - Open a socket to a region server - Send the request over the network - The region server seeks

Re: performance of Get from MR Job

2012-06-19 Thread Paul Mackles
One thing we observed with a similar setup was that if we added a reducer and then used something like HRegionPartitioner to partition the data, our GET performance improved dramatically. While you take a hit for adding the reducer, it was worth it in our case. We never quite figured out why that h

performance of Get from MR Job

2012-06-19 Thread Marcin Cylke
Hi I've run into some performance issues with my hadoop MapReduce Job. Basically what I'm doing with it is: - read data from HDFS file - the output goes also to HDFS file (multiple ones in my scenerio) - in my mapper I process each line and enrich it with some data read from HBase table (I do Get