does performance should differ significantly if row value size is small and we don't have too much versions. Assume, that a pack of versions for key is less than recommended HFile block (8KB to 1MB https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/HFile.html), which is minimal "read unit", should we see any difference at all? Am I right?
2015-01-20 0:33 GMT+03:00 Jean-Marc Spaggiari <jean-m...@spaggiari.org>: > Hi Yong, > > If you want to compare the performances, you need to run way bigger and > longer tests. Dont run them in parallete. Run them at least 10 time each to > make sure you have a good trend. Is the difference between the 2 > significant? It should not. > > JM > > 2015-01-19 15:17 GMT-05:00 yonghu <yongyong...@gmail.com>: > > > Hi, > > > > Thanks for your suggestion. I have already considered the first issue > that > > one row is not allowed to be split between 2 regions. > > > > However, I have made a small scan-test with MapReduce. I first created a > > table t1 with 1 million rows and allowed each column to store 10 data > > versions. Then, I translated t1 into t2 in which multiple data versions > in > > t1 were transformed into multiple rows in t2. I wrote two MapReduce > > programs to scan t1 and t2 individually. What I got is the table scanning > > time of t1 is shorter than t2. So, I think for performance reason, > multiple > > data versions may be a better option than multiple rows. > > > > But just as you said, which approach to use depends on how many > historical > > events you want to keep. > > > > regards! > > > > Yong > > > > > > On Mon, Jan 19, 2015 at 8:37 PM, Jean-Marc Spaggiari < > > jean-m...@spaggiari.org> wrote: > > > > > Hi Yong, > > > > > > A row will not split between 2 regions. If you plan having thousands of > > > versions, based on the size of your data, you might end up having a row > > > bigger than your preferred region size. > > > > > > If you plan just keep few versions of the history to have a look at > it, I > > > will say go with it. If you plan to have one million version because > you > > > want to keep all the events history, go with the row approach. > > > > > > You can also consider going with the Column Qualifier approach. This > has > > > the same constraint as the versions regarding the split in 2 regions, > but > > > it might me easier to manage and still give you the consistency of > being > > > within a row. > > > > > > JM > > > > > > 2015-01-19 14:28 GMT-05:00 yonghu <yongyong...@gmail.com>: > > > > > > > Dear all, > > > > > > > > I want to record the user history data. I know there exists two > > options, > > > > one is to store user events in a single row with multiple data > versions > > > and > > > > the other one is to use multiple rows. I wonder which one is better > for > > > > performance? > > > > > > > > Thanks! > > > > > > > > Yong > > > > > > > > > >