I think you need to go back a bit further from the problem and ask yourself 
when would you want to have the same row key used for disjoint data. That is 
data that refers to the same object, yet the data in each column family is 
never or rarely used with data from another column family. 

To give you a concrete example... one that I've used in a class... An order 
entry system. 

Think of the life cycle of your order. 

You enter the order, the company then generates pick slips from the 
warehouse(s), then the warehouse(s) issue shipping slips, then as the product 
ships, invoices are issued and the billing process occurs. 

In each part of the process, information that could be shared could be copied 
so that you have an inquiry in to the order, you would see what was done and 
when, but in each process like managing the pick slip, you dont need to bring 
up the entire order. 

Does that make sense? 

In that example, you have 4 column families. 

There are other examples, but that should help you put column families in 
perspective. 

HTH
-Mike

On Aug 5, 2014, at 11:52 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> As Alok mentioned previously, once columns are grouped into several column
> families, you would be able to leverage essential column family feature
> introduced by this JIRA:
> 
> HBASE-5416 Improve performance of scans with some kind of filters
> 
> Cheers
> 
> 
> On Tue, Aug 5, 2014 at 5:26 AM, Alok Kumar <alok...@gmail.com> wrote:
> 
>> You could narrow the number of rows to scan by using Filters. I don't
>> think, you could reach/optimize to column level I/O.
>> 
>> Block Cache is related to actual data read from HDFS per column family. If
>> your scan is fetching random (all) columns, then you are any way going to
>> hit all the column-family-blocks and "irrelevant" data in block cache!!
>> You could limit or set columns you want to fetch on client side after scan,
>> that will save network IO.
>> 
>> Do you have 130 * 5 = 650MB of row size?
>> 
>> Thanks
>> Alok
>> 
>> On Tue, Aug 5, 2014 at 5:17 PM, innowireless TaeYun Kim <
>> taeyun....@innowireless.co.kr> wrote:
>> 
>>> Plus,
>>> Since most of the time a client will display the area that does not fit
>> in
>>> 500x500, Scan operations are required. (Get is not enough)
>>> So, I'm worried that on scanning, many irrelevant column data (those have
>>> the same rowkey, which is the position on the grid) would be read into
>> the
>>> block cache, unless the columns are separated by individual column
>> family.
>>> 
>>> 
>>> -----Original Message-----
>>> From: innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr]
>>> Sent: Tuesday, August 05, 2014 8:36 PM
>>> To: user@hbase.apache.org
>>> Subject: RE: Question on the number of column families
>>> 
>>> Thank you for your reply.
>>> 
>>> I can decrease the size of column value if it's not good for HBase.
>>> BTW, The values are for a point on a grid cell on a map.
>>> 250000 is 500x500, and 500x500 is somewhat related to the size of the
>>> client screen that displays the values on a map.
>>> Normally a client requests the values for the area that is displayed on
>>> the screen.
>>> 
>>> 
>>> -----Original Message-----
>>> From: Alok Kumar [mailto:alok...@gmail.com]
>>> Sent: Tuesday, August 05, 2014 8:24 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: Question on the number of column families
>>> 
>>> Hi,
>>> 
>>> Hbase creates HFile per column-family. Having 130 column-family is really
>>> not recommended.
>>> It will increase number of file pointer ( open file count) underneath.
>>> 
>>> If you are sure which columns are "frequently" accessed by users, you
>>> could consider putting them in one column family. And "Non frequently"
>> ones
>>> in another.
>>> Btw, ~5MB size of column value is something to consider. We should wait
>>> for some expert advise here!!
>>> 
>>> 
>>> Thanks
>>> Alok
>>> 
>>> 
>>> On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim <
>>> taeyun....@innowireless.co.kr> wrote:
>>> 
>>>> Plus,
>>>> the size of the value of each field can be ~5MB, since max 250000
>>>> lines of the source data will be merged into one record, to match the
>>>> request pattern.
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr]
>>>> Sent: Tuesday, August 05, 2014 8:11 PM
>>>> To: user@hbase.apache.org
>>>> Subject: Question on the number of column families
>>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> 
>>>> According to http://hbase.apache.org/book/number.of.cfs.html, having
>>>> more than 2~3 column families are strongly discouraged.
>>>> 
>>>> 
>>>> 
>>>> BTW, in my case, records on a table have the following characteristics:
>>>> 
>>>> 
>>>> 
>>>> - The table is read-only. It is bulk-loaded once. When a new data is
>>>> ready, A new table is created and the old table is deleted.
>>>> 
>>>> - The size of the source data can be hundreds of gigabytes.
>>>> 
>>>> - A record has about 130 fields.
>>>> 
>>>> - The number of fields in a record is fixed.
>>>> 
>>>> - The names of the fields are also fixed. (it's like a table in RDBMS)
>>>> 
>>>> - About 40(it varies) fields mostly have value, while other fields are
>>>> mostly empty(null in RDBMS).
>>>> 
>>>> - It is unknown which field will be dense. It depends on the source
>> data.
>>>> 
>>>> - Fields are accessed independently. Normally a user requests just one
>>>> field. A user can request several fields.
>>>> 
>>>> - The range on the range query is the same for all fields. (No wider,
>>>> no narrower, regardless the data density)
>>>> 
>>>> For me, it seems that it would be more efficient if there is one
>>>> column family for each field, since it would cost less disk I/O, for
>>>> only the needed column data will be read.
>>>> 
>>>> 
>>>> 
>>>> Can the table have 130 column families for this case?
>>>> 
>>>> Or the whole columns must be in one column family?
>>>> 
>>>> 
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Alok Kumar
>>> Email : alok...@gmail.com
>>> http://sharepointorange.blogspot.in/
>>> http://www.linkedin.com/in/alokawi
>>> 
>>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com





Reply via email to