Hi TaeYun, thanks for explain.
On Thu, Aug 7, 2014 at 12:50 PM, innowireless TaeYun Kim < taeyun....@innowireless.co.kr> wrote: > Hi Qiang, > thank you for your help. > > 1. Regarding HBASE-5416, I think it's purpose is simple. > > "Avoid loading column families that is irrelevant to filtering while > scanning." > So, it can be applied to my 'dummy CF' case. > That is, a dummy CF can act like an 'relevant' CF to filtering, provided > that HBase can select it while applying a rowkey filter, since a dummy CF > has the rowkey data in its 'dummy' KeyValue object. > > 2. About rowkey. > > What I meant is, I would include the field name as a component when the > byte array for a rowkey is constructed. > > 3. About read-only-ness and the number of CF. > > Thank you for your suggestion. > But since MemStore and BlockCache is separately managed on each column > family, I'm a little concerned with the memory footprint. > > Thank you. > > -----Original Message----- > From: Qiang Tian [mailto:tian...@gmail.com] > Sent: Thursday, August 07, 2014 11:43 AM > To: user@hbase.apache.org > Subject: Re: Question on the number of column families > > Hi, > the description of hbase-5416 stated why it was introduced, if you only > have 1 CF, dummy CF does not help. it is helpful for multi-CF case, e.g. > "putting them in one column family. And "Non frequently" ones in another. " > > bq. "Field name will be included in rowkey." > Please read the chapter 9 "Advanced usage" in book "HBase Definitive Guide" > about how hbase store data on disk and how to design rowkey based on > specific scenario.(rowkey is the only index you can use, so take care) > > bq. "The table is read-only. It is bulk-loaded once. When a new data is > ready, A new table is created and the old table is deleted." > the scenario is quite different. as hbase is designed for random > read/write. the limitation described at > http://hbase.apache.org/book/number.of.cfs.html is to consider the write > case(flush&compaction), perhaps you could try 140 CFs, as long as you can > presplit your regions well? after that, since no write, there will be no > flush/compaction...anyway, any idea better be tested with your real data. > > > > > > > > > On Wed, Aug 6, 2014 at 7:00 PM, innowireless TaeYun Kim < > taeyun....@innowireless.co.kr> wrote: > > > Hi Ted, > > > > Now I finished reading the filtering section and the source code of > > TestJoinedScanners(0.94). > > > > Facts learned: > > > > - While scanning, an entire row will be read even for a rowkey filtering. > > (Since a rowkey is not a physically separate entity and stored in > > KeyValue object, it's natural. Am I right?) > > - The key API for the essential column family support is > > setLoadColumnFamiliesOnDemand(). > > > > So, now I have questions: > > > > On rowkey filtering, which column family's KeyValue object is read? > > If HBase just reads a KeyValue from a randomly selected (or just the > > first) column family, how is setLoadColumnFamiliesOnDemand() affected? > > Can HBase select a smaller column family intelligently? > > > > If setLoadColumnFamiliesOnDemand() can be applied to a rowkey > > filtering, a 'dummy' column family can be used to minimize the scan cost. > > > > Thank you. > > > > > > -----Original Message----- > > From: innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr] > > Sent: Wednesday, August 06, 2014 1:48 PM > > To: user@hbase.apache.org > > Subject: RE: Question on the number of column families > > > > Thank you. > > > > The 'dummy' column will always hold the value '1' (or even an empty > > string), that only signifies that this row exists. (And the real value > > is in the other 'big' column family) The value is irrelevant since > > with current schema the filtering will be done by rowkey components > > alone. No column value is needed. (I will begin reading the filtering > > section shortly > > - it is only 6 pages ahead. So sorry for my premature thoughts) > > > > > > -----Original Message----- > > From: Ted Yu [mailto:yuzhih...@gmail.com] > > Sent: Wednesday, August 06, 2014 1:38 PM > > To: user@hbase.apache.org > > Subject: Re: Question on the number of column families > > > > bq. add a 'dummy' column family and apply HBASE-5416 technique > > > > Adding dummy column family is not the way to utilize essential column > > family support - what would this dummy column family hold ? > > > > bq. since I have not read the filtering section of the book I'm > > reading yet > > > > Once you finish reading, you can look at the unit test > > (TestJoinedScanners) from HBASE-5416. You would understand this > > feature better. > > > > Cheers > > > > > > On Tue, Aug 5, 2014 at 9:21 PM, innowireless TaeYun Kim < > > taeyun....@innowireless.co.kr> wrote: > > > > > Thank you all. > > > > > > Facts learned: > > > > > > - Having 130 column families is too much. Don't do that. > > > - While scanning, an entire row will be read for filtering, unless > > > HBASE-5416 technique is applied which makes only relevant column > > > family is loaded. (But it seems that still one can't load just a > > > column needed while > > > scanning) > > > - Big row size is maybe not good. > > > > > > Currently it seems appropriate to follow the one-column solution > > > that Alok Singh suggested, in part since currently there is no > > > reasonable grouping of the fields. > > > > > > Here is my current thinking: > > > > > > - One column family, one column. Field name will be included in rowkey. > > > - Eliminate filtering altogether (in most case) by properly ordering > > > rowkey components. > > > - If a filtering is absolutely needed, add a 'dummy' column family > > > and apply HBASE-5416 technique to minimize disk read, since the > > > field value can be large(~5MB). (This dummy column thing may not be > > > right, I'm not sure, since I have not read the filtering section of > > > the book I'm reading yet) > > > > > > Hope that I am not missing or misunderstanding something... > > > (I'm a total newbie. I've started to read a HBase book since last > > > week...) > > > > > > > > > > > > > > > > > > > > > > > >