Re: Question on the number of column families

Ted Yu Tue, 05 Aug 2014 09:53:03 -0700

As Alok mentioned previously, once columns are grouped into several column
families, you would be able to leverage essential column family feature
introduced by this JIRA:


HBASE-5416 Improve performance of scans with some kind of filters

Cheers


On Tue, Aug 5, 2014 at 5:26 AM, Alok Kumar <[email protected]> wrote:

> You could narrow the number of rows to scan by using Filters. I don't
> think, you could reach/optimize to column level I/O.
>
> Block Cache is related to actual data read from HDFS per column family. If
> your scan is fetching random (all) columns, then you are any way going to
> hit all the column-family-blocks and "irrelevant" data in block cache!!
> You could limit or set columns you want to fetch on client side after scan,
> that will save network IO.
>
> Do you have 130 * 5 = 650MB of row size?
>
> Thanks
> Alok
>
> On Tue, Aug 5, 2014 at 5:17 PM, innowireless TaeYun Kim <
> [email protected]> wrote:
>
> > Plus,
> > Since most of the time a client will display the area that does not fit
> in
> > 500x500, Scan operations are required. (Get is not enough)
> > So, I'm worried that on scanning, many irrelevant column data (those have
> > the same rowkey, which is the position on the grid) would be read into
> the
> > block cache, unless the columns are separated by individual column
> family.
> >
> >
> > -----Original Message-----
> > From: innowireless TaeYun Kim [mailto:[email protected]]
> > Sent: Tuesday, August 05, 2014 8:36 PM
> > To: [email protected]
> > Subject: RE: Question on the number of column families
> >
> > Thank you for your reply.
> >
> > I can decrease the size of column value if it's not good for HBase.
> > BTW, The values are for a point on a grid cell on a map.
> > 250000 is 500x500, and 500x500 is somewhat related to the size of the
> > client screen that displays the values on a map.
> > Normally a client requests the values for the area that is displayed on
> > the screen.
> >
> >
> > -----Original Message-----
> > From: Alok Kumar [mailto:[email protected]]
> > Sent: Tuesday, August 05, 2014 8:24 PM
> > To: [email protected]
> > Subject: Re: Question on the number of column families
> >
> > Hi,
> >
> > Hbase creates HFile per column-family. Having 130 column-family is really
> > not recommended.
> > It will increase number of file pointer ( open file count) underneath.
> >
> > If you are sure which columns are "frequently" accessed by users, you
> > could consider putting them in one column family. And "Non frequently"
> ones
> > in another.
> > Btw, ~5MB size of column value is something to consider. We should wait
> > for some expert advise here!!
> >
> >
> > Thanks
> > Alok
> >
> >
> > On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim <
> > [email protected]> wrote:
> >
> > > Plus,
> > > the size of the value of each field can be ~5MB, since max 250000
> > > lines of the source data will be merged into one record, to match the
> > > request pattern.
> > >
> > >
> > > -----Original Message-----
> > > From: innowireless TaeYun Kim [mailto:[email protected]]
> > > Sent: Tuesday, August 05, 2014 8:11 PM
> > > To: [email protected]
> > > Subject: Question on the number of column families
> > >
> > > Hi,
> > >
> > >
> > >
> > > According to http://hbase.apache.org/book/number.of.cfs.html, having
> > > more than 2~3 column families are strongly discouraged.
> > >
> > >
> > >
> > > BTW, in my case, records on a table have the following characteristics:
> > >
> > >
> > >
> > > - The table is read-only. It is bulk-loaded once. When a new data is
> > > ready, A new table is created and the old table is deleted.
> > >
> > > - The size of the source data can be hundreds of gigabytes.
> > >
> > > - A record has about 130 fields.
> > >
> > > - The number of fields in a record is fixed.
> > >
> > > - The names of the fields are also fixed. (it's like a table in RDBMS)
> > >
> > > - About 40(it varies) fields mostly have value, while other fields are
> > > mostly empty(null in RDBMS).
> > >
> > > - It is unknown which field will be dense. It depends on the source
> data.
> > >
> > > - Fields are accessed independently. Normally a user requests just one
> > > field. A user can request several fields.
> > >
> > > - The range on the range query is the same for all fields. (No wider,
> > > no narrower, regardless the data density)
> > >
> > > For me, it seems that it would be more efficient if there is one
> > > column family for each field, since it would cost less disk I/O, for
> > > only the needed column data will be read.
> > >
> > >
> > >
> > > Can the table have 130 column families for this case?
> > >
> > > Or the whole columns must be in one column family?
> > >
> > >
> > >
> > > Thanks.
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Alok Kumar
> > Email : [email protected]
> > http://sharepointorange.blogspot.in/
> > http://www.linkedin.com/in/alokawi
> >
> >
>

Re: Question on the number of column families

Reply via email to