I think we need to take a look different situations.
1. One column gets frequently updated and the others not. If we use row
representation, we will include the unchanged data value for each tuple.
This may cause a large data redundancy. So, I think it can explain why in
my test the multiple data
does performance should differ significantly if row value size is small and
we don't have too much versions.
Assume, that a pack of versions for key is less than recommended HFile
block (8KB to 1MB
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/HFile.html),
which is minimal "read
Hi Yong,
If you want to compare the performances, you need to run way bigger and
longer tests. Dont run them in parallete. Run them at least 10 time each to
make sure you have a good trend. Is the difference between the 2
significant? It should not.
JM
2015-01-19 15:17 GMT-05:00 yonghu :
> Hi,
Hi,
Thanks for your suggestion. I have already considered the first issue that
one row is not allowed to be split between 2 regions.
However, I have made a small scan-test with MapReduce. I first created a
table t1 with 1 million rows and allowed each column to store 10 data
versions. Then, I tr
Hi Yong,
A row will not split between 2 regions. If you plan having thousands of
versions, based on the size of your data, you might end up having a row
bigger than your preferred region size.
If you plan just keep few versions of the history to have a look at it, I
will say go with it. If you pl
Dear all,
I want to record the user history data. I know there exists two options,
one is to store user events in a single row with multiple data versions and
the other one is to use multiple rows. I wonder which one is better for
performance?
Thanks!
Yong