Re: Disk Seeks and Column families

Jason Frantz Tue, 24 Jan 2012 01:31:03 -0800

On Tue, Jan 24, 2012 at 11:45 AM, Praveen Sripati
<praveensrip...@gmail.com>wrote:


> Thanks for the response. I am just getting started with HBase. And before
> getting into the code/api level details, I am trying to understand the
> problem area HBase is trying to address through it's architecture/design.
>
> 1) So, what are the recommendations for having many columns and with dense
> data? Is HBase not the right tool?
>

HBase's data model works great if your set of columns can be split into
separate column families that are only accessed together. If you often
randomly access individual columns, then it might make sense to put your
column qualifiers inside your key.

2) Also, if the data for a column is spread wide across blocks and maybe
> even across nodes how will HBase help in aggregation?
>

If a column family doesn't contain the columns your aggregation wants, then
HBase doesn't need to look at files for those column families. If you want
to run the aggregation on a subset of your key's range, then HBase doesn't
need to look at nodes that only have data outside that range.

In addition, aggregation can often be done locally at each node using
endpoint coprocessors. For example, if I want to count all the rows in my
table, a coprocessor can count all the rows on each node in parallel, and
then those counts are the only thing sent back to node running the query.
To get the total count, I just need to sum the per-node counts.

http://ofps.oreilly.com/titles/9781449396107/clientapisadv.html


> 3) Also, about storing data using an incremental row key, initially there
> will be a hot stop with the data getting to a single region. Even after a
> split of the region into two, the first one won't be getting any data (in
> incremental row key) and the second one will be hammered.
>

Can you split your incremental row key into a hash component and a range
component? Here's a DynamoDB post explaining a use case:

http://aws.typepad.com/aws/2012/01/amazon-dynamodb-internet-scale-data-storage-the-nosql-way.html

This does mean that range scan is only efficient when it stays within a
hash prefix, though.

4) Still not clear why I can't have 10 column families in HBase and why
> only 2 or 3 of them according to this link (1)?
>
> (1) - http://hbase.apache.org/book/number.of.cfs.html
>

See HBASE-3149, for starters. There are probably other JIRAs out there.

-Jason


> Praveen
>
> On Sun, Jan 22, 2012 at 12:02 PM, M. C. Srivas <mcsri...@gmail.com> wrote:
>
> > Praveen,
> >
> >  basically you are correct on all counts. If there are too many columns,
> >  HBase will have to issue more disk-seeks  to extract only the particular
> > columns you need ... and since the data is laid out horizontally there
> are
> > fewer common substrings in a single HBase-block and compression quality
> > starts to degrade due to reduced redundancy.
> >
> >
> > On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati
> > <praveensrip...@gmail.com>wrote:
> >
> > > Thanks for the response.
> > >
> > > > The contents of a row stay together like a regular row-oriented
> > database.
> > >
> > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
> > >
> > > Is the above statement true for a HFile?
> > >
> > > Also from the above example, the data for the column family qualifier
> are
> > > not adjacent to take advantage of compression (
> > > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is
> this
> > a
> > > proper statement?across all of the data.
>
> > >
> > > Regards,
> > > Praveen
> > >
> > > On Sat, Jan 21, 2012 at 9:03 PM, <yuzhih...@gmail.com> wrote:
> > >
> > > > Have you considered using AggregationProtocol to perform aggregation
> ?
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati <
> > praveensrip...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > 1) According to the this url (1), HBase performs well for two or
> > three
> > > > > column families. Why is it so?
> > > > >
> > > > > 2) Dump of a HFile, looks like below. The contents of a row stay
> > > together
> > > > > like a regular row-oriented database. If the column family has 100
> > > column
> > > > > family qualifiers and is dense then the data for a particular
> column
> > > > family
> > > > > qualifier is spread wide. If I want to do an aggregation on a
> > > particular
> > > > > column identifier, the disk seeks doesn't seems to be much better
> > than
> > > a
> > > > > regular row-oriented database.
> > > > >
> > > > > Please correct me if I am wrong.
> > > > >
> > > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
> > > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
> > > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
> > > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
> > > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
> > > > >
> > > > > (1) - http://hbase.apache.org/book/number.of.cfs.html
> > > > >
> > > > > Thanks,
> > > > > Praveen
> > > >
> > >
> >
>

Re: Disk Seeks and Column families

Reply via email to