@Stack: I counted both compressed and uncompressed tables and it's the same, this is really the case where 100GB can be compressed to 7 :) @Lars: I took a look at https://issues.apache.org/jira/browse/HBASE-4218 and it mentions that could make writing and scanning slower, since I write only with bulk load I'm not worried about that but how slower will scanning be ?
On Fri, Jan 17, 2014 at 8:20 PM, lars hofhansl <la...@apache.org> wrote: > Somewhat unrelated, but you might benefit from block encoding in addition > to compression in your case. > Try to set DATA_BLOCK_ENCODING to FAST_DIFF in your column families. > > -- Lars > > > > ----- Original Message ----- > From: Amit Sela <am...@infolinks.com> > To: user@hbase.apache.org > Cc: > Sent: Thursday, January 16, 2014 1:00 AM > Subject: Re: KeyValue size in bytes compared to store files size > > I tried the bulk load and kv size counts with uncompressed table and it > makes sense now. count is equal to store file size. > I took a look at the (uncompressed) files and they seem to be OK. > > Entire bulk load is ~100GB, when using GZ ends up to be 7GB. > > Could such a compression ratio make sense in case of many qualifiers per > row in a table (avg is 16 but in practice there are some rows with much > more and even a small number of rows with hundreds of thousands...) ? If > each KeyValue contains the rowkey, and the rowkeys contain more bytes than > the qualifiers / values, than the rows repeat themselves in the HFile and > actually make most of the HFile, right ? > > > > > > > > On Wed, Jan 15, 2014 at 9:52 PM, Stack <st...@duboce.net> wrote: > > > There can be a lot of duplication in what ends up in HFiles but 500MB -> > > 32MB does seem too good to be true. > > > > Could you try writing without GZIP or mess with the hfile reader[1] to > see > > what your keys look like when at rest in an HFile (and maybe save the > > decompressed hfile to compare sizes?) > > > > St.Ack > > 1. http://hbase.apache.org/book.html#hfile > > > > > > On Wed, Jan 15, 2014 at 7:43 AM, Amit Sela <am...@infolinks.com> wrote: > > > > > I'm talking about the store files size and the ratio between store file > > > size and the byte count as counted in PutSortReducer. > > > > > > > > > On Wed, Jan 15, 2014 at 5:35 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > > > See previous discussion: http://search-hadoop.com/m/85S3A1DgZHP1 > > > > > > > > > > > > On Wed, Jan 15, 2014 at 5:44 AM, Amit Sela <am...@infolinks.com> > > wrote: > > > > > > > > > Hi all, > > > > > I'm trying to measure the size (in bytes) of the data I'm about to > > load > > > > > into HBase. > > > > > I'm using bulk load with PutSortReducer. > > > > > All bulk load data is loaded into new regions and not added to > > existing > > > > > ones. > > > > > > > > > > In order to count the size of all KeyValues in the Put object I > > iterate > > > > > over the Put's familyMap.values() and sum the KeyValue lengths. > > > > > After loading the data, I check the region size by summing the > > > > > RegionLoad.getStorefileSizeMB(). > > > > > Counting the Put objects size predicted ~500MB per region but in > > > > practice I > > > > > got ~32MB per region. > > > > > the table uses GZ compression but this cannot be the cause of such > a > > > > > difference. > > > > > > > > > > Is counting the Put's KeyValues the correct way to count a row > size ? > > > Is > > > > it > > > > > comparable to the store files size ? > > > > > > > > > > Thanks, > > > > > Amit. > > > > > > > > > > > > > > > >