Re: Schema Design - Move second column family to new table

Christian Schäfer Wed, 22 Aug 2012 01:06:04 -0700

Just a short call back.

As noticed I will now use two column families (instead of an addional table) to 
achieve row level atomicity.


Because CF1 has a much higher cardinality than CF2, flushes will likely be 
always triggered by CF1's memstore reaching configured flush size.
Thus, also CF2 will be flushed resulting in very small HFiles because on 1000 
set rows of CF1 comes ~1 row of CF2.


Has anyone experiences if that will become a performance problem when doing a 
scan restricted on CF2 (means checking many small HFiles) assuming bloom 
filters are applied?



regards,
Christian


----- Ursprüngliche Message -----
Von: Christian Schäfer <syrious3...@yahoo.de>
An: "user@hbase.apache.org" <user@hbase.apache.org>
CC: 
Gesendet: 22:54 Montag, 20.August 2012
Betreff: RE: Schema Design - Move second column family to new table

 Thanks Pranav for the Schema Design resource...will check this soon.


&

Thanks Ian for your thoughts..you're right that the point about transactions is 
really important.

On the other hand due to per-region compaction, big scans over CF2 (= CF with 
only few rows set) would result in several disk seeks.

So I still have to find out if big scans over CF2 are really as important as I 
currently expect.
Whereas I guess that (in our use case) transaction security is more important 
than speed of analytics


regards
Chris.



________________________________
Von: Ian Varley <ivar...@salesforce.com>
An: "user@hbase.apache.org" <user@hbase.apache.org> 
CC: Christian Schäfer <syrious3...@yahoo.de> 
Gesendet: 16:37 Montag, 20.August 2012
Betreff: Re: Schema Design - Move second column family to new table

Christian,

Column families are really more "within" rows, not the other way around 
(they're really just a way to physically partition sets of columns in a table). 
In your example, then, it's more correct to say that table1 has millions / 
billions of rows, but only hundreds of them have any columns in CF2. I'm not 
exactly sure how much of a penalty that 2nd column family imposes in this 
case--if you don't include it as a part of your scans / gets, then you won't 
pay any
penalty at read time; but if you're reading from both "just in case" the row 
has data there, you'll always take a hit. I think the same goes for writes. 
(Question for the list: does adding a column family that you *never* use impose 
any penalties?)

The downside to moving it to another table is, writes will no longer be 
transactionally protected (i.e. if you're trying to write to both, it could 
fail after one and before the other). Conversely, if you put them as column 
families in the same row, writes to a single row are transactional. You may or 
may not care about that.

So, putting the lower cardinality data in another table with the same row key 
might be performance win, or it might not, depending on your read & write 
patterns. Try it both ways and compare, and let us know what you find.

Ian

On Aug 20, 2012, at 7:25 AM, Pranav Modi wrote:

This might be useful -
http://java.dzone.com/videos/hbase-schema-design-things-you

On Mon, Aug 20, 2012 at 5:17 PM, Christian Schäfer <syrious3...@yahoo.de>wrote:

Currently I'm about to design HBase tables.

In my case there is table1 with CF1 holding millions/billions of rows and
CF2 with hundreds of rows.
Read use cases include reading both CF data by key or reading only one CF.

Referring to http://hbase.apache.org/book/number.of.cfs.html

Due to the cardinality difference I would change the schema design by
putting CF2 in an extra table (table 2), right?
So after that there are table1 and table2 each with one CF with the same
row key.
Any doubting about that?

Can
anyone recommend resources about HBase-Schema-Design where HBase
Schema Design is explained on different use cases
beyond "HBase- Definitive Guide" and the HBase online reference?

regards,
Christian

Re: Schema Design - Move second column family to new table

Reply via email to