Re: Rowkey design question

Nick Dimiduk Tue, 07 Apr 2015 15:49:14 -0700

Those rows are written out into HBase blocks on cell boundaries. Your
column family has a BLOCK_SIZE attribute, which you may or may have no
overridden the default of 64k. Cells are written into a block until is it
>= the target block size. So your single 500mb row will be broken down into
thousands of HFile blocks in some number of HFiles. Some of those blocks
may contain just a cell or two and be a couple MB in size, to hold the
largest of your cells. Those blocks will be loaded into the Block Cache as
they're accessed. If your careful with your access patterns and only
request cells that you need to evaluate, you'll only ever load the blocks
containing those cells into the cache.


> Will the entire row be loaded or only the qualifiers I ask for?

So then, the answer to your question is: it depends on how you're
interacting with the row from your coprocessor. The read path will only
load blocks that your scanner requests. If your coprocessor is producing
scanner with to seek to specific qualifiers, you'll only load those blocks.

Related question: Is there a reason you're using a coprocessor instead of a
regular filter, or a simple qualified get/scan to access data from these
rows? The "default stuff" is already tuned to load data sparsely, as would
be desirable for your schema.

-n

On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:

> Sorry I should have explained my use case a bit more.
>
> Yes, it's a pretty big row and it's "close" to worst case. Normally there
> would be fewer qualifiers and the largest qualifiers would be smaller.
>
> The reason why these rows gets big is because they stores aggregated data
> in indexed compressed form. This format allow for extremely fast queries
> (on local disk format) over billions of rows (not rows in HBase speak),
> when touching smaller areas of the data. If would store the data as regular
> HBase rows things would get very slow unless I had many many region
> servers.
>
> The coprocessor is used for doing custom queries on the indexed data inside
> the region servers. These queries are not like a regular row scan, but very
> specific as to how the data is formatted withing each column qualifier.
>
> Yes, this is not possible if HBase loads the whole 500MB each time i want
> to perform this custom query on a row. Hence my question :-)
>
>
>
>
> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <michael_se...@hotmail.com>
> wrote:
>
> > Sorry, but your initial problem statement doesn’t seem to parse …
> >
> > Are you saying that you a single row with approximately 100,000 elements
> > where each element is roughly 1-5KB in size and in addition there are ~5
> > elements which will be between one and five MB in size?
> >
> > And you then mention a coprocessor?
> >
> > Just looking at the numbers… 100K * 5KB means that each row would end up
> > being 500MB in size.
> >
> > That’s a pretty fat row.
> >
> > I would suggest rethinking your strategy.
> >
> > > On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <sto...@gmail.com>
> > wrote:
> > >
> > > Hi
> > >
> > > I have a row with around 100.000 qualifiers with mostly small values
> > around
> > > 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do random
> > > access of 1-10 qualifiers per row.
> > >
> > > I would like to understand how HBase loads the data into memory. Will
> the
> > > entire row be loaded or only the qualifiers I ask for (like pointer
> > access
> > > into a direct ByteBuffer) ?
> > >
> > > Cheers,
> > > -Kristoffer
> >
> > The opinions expressed here are mine, while they may reflect a cognitive
> > thought, that is purely accidental.
> > Use at your own risk.
> > Michael Segel
> > michael_segel (AT) hotmail.com
> >
> >
> >
> >
> >
> >
>

Re: Rowkey design question

Reply via email to