I tried the bulk load and kv size counts with uncompressed table and it
makes sense now. count is equal to store file size.
I took a look at the (uncompressed) files and they seem to be OK.

Entire bulk load is ~100GB, when using GZ ends up to be 7GB.

Could such a compression ratio make sense in case of many qualifiers per
row in a table (avg is 16 but in practice there are some rows with much
more and even a small number of rows with hundreds of thousands...) ? If
each KeyValue contains the rowkey, and the rowkeys contain more bytes than
the qualifiers / values, than the rows repeat themselves in the HFile and
actually make most of the HFile, right ?






On Wed, Jan 15, 2014 at 9:52 PM, Stack <st...@duboce.net> wrote:

> There can be a lot of duplication in what ends up in HFiles but 500MB ->
> 32MB does seem too good to be true.
>
> Could you try writing without GZIP or mess with the hfile reader[1] to see
> what your keys look like when at rest in an HFile (and maybe save the
> decompressed hfile to compare sizes?)
>
> St.Ack
> 1. http://hbase.apache.org/book.html#hfile
>
>
> On Wed, Jan 15, 2014 at 7:43 AM, Amit Sela <am...@infolinks.com> wrote:
>
> > I'm talking about the store files size and the ratio between store file
> > size and the byte count as counted in PutSortReducer.
> >
> >
> > On Wed, Jan 15, 2014 at 5:35 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > See previous discussion: http://search-hadoop.com/m/85S3A1DgZHP1
> > >
> > >
> > > On Wed, Jan 15, 2014 at 5:44 AM, Amit Sela <am...@infolinks.com>
> wrote:
> > >
> > > > Hi all,
> > > > I'm trying to measure the size (in bytes) of the data I'm about to
> load
> > > > into HBase.
> > > > I'm using bulk load with PutSortReducer.
> > > > All bulk load data is loaded into new regions and not added to
> existing
> > > > ones.
> > > >
> > > > In order to count the size of all KeyValues in the Put object I
> iterate
> > > > over the Put's familyMap.values() and sum the KeyValue lengths.
> > > > After loading the data, I check the region size by summing the
> > > > RegionLoad.getStorefileSizeMB().
> > > > Counting the Put objects size predicted ~500MB per region but in
> > > practice I
> > > > got ~32MB per region.
> > > > the table uses GZ compression but this cannot be the cause of such a
> > > > difference.
> > > >
> > > > Is counting the Put's KeyValues the correct way to count a row size ?
> > Is
> > > it
> > > > comparable to the store files size ?
> > > >
> > > > Thanks,
> > > > Amit.
> > > >
> > >
> >
>

Reply via email to