I would agree with JMS, to ideally avoid wide tables. Plus, there are still some inconsistent behaviour for versions feature (See HBASE-21596, for example). I would also favour option "a" over "b", as it seems to give more flexibility in the way you can access/delete these columns.
Em dom, 31 de mar de 2019 às 00:12, Jean-Marc Spaggiari < [email protected]> escreveu: > Hi Serkan, > > This is my personal opinion and some might not share it ;) > > I tried to go with the deep versions approach for one project and I found > issues on some of the calls (pagination over versions as an example). So if > for you both (The deep version and wide columns) are the same, I will say, > better go with the wide columns. > > Also, why not good with tall table instead of wide? > > JMS > > Le sam. 30 mars 2019 à 01:14, Serkan Uzunbaz <[email protected]> a écrit : > > > Hi all, > > I have a question regarding the difference between storing a set of data > > as: > > *a) n columns with 1 version each* > > *b) 1 column with n versions* > > > > Since the storage unit in hbase is a cell (rowkey, column family, column > > qualifier, timestamp), is there a difference between the above two > storage > > options in terms of read/write performance, compaction/GC time, etc? > > > > I know it is not recommended to use high number of versions if you do not > > really need them. However, if those n versions of data are really needed > > for reading, then will it cause any problem to store the data in a single > > column with n versions. Also, even if max versions is set to 1 for a > column > > (option a), new values are still stored as a new cell and old cell is > > deleted at compaction time. So, I also feel like compaction-wise two > > options are identical. > > I wonder if there is anything that makes one option superior to the > other. > > > > *Example*: To clarify more, say the data to be stored is set of urls > > visited in certain time ranges and we want to keep the last 100 hours of > > url sets: > > > > *a) store each hour as column name with one url set in it (column names > > will be used in cyclic manner (data for hour 101 will be written into > > column 1))* > > column_qualifier: value > > --------------------------- > > urls_hour1: <abc.com, xyz.com, ...> > > urls_hour2: <urls> > > urls_hour3: <urls> > > ... > > urls_hour100: <urls> > > > > > > *b) store in a single column with 100 versions (one for each hour) (max > > versions for column will be 100 and hbase will do the auto-compaction for > > old versions)* > > column_qualifier: value @ timestamp > > --------------------------- > > urls: <abc.com, xyz.com, ...> @ ts_hour1, <urls> @ ts_hour2, <urls> @ > > ts_hour3, .... , <urls> @ ts_hour100 > > > > Thanks, > > -Serkan > > >
