As Alok mentioned previously, once columns are grouped into several column families, you would be able to leverage essential column family feature introduced by this JIRA:
HBASE-5416 Improve performance of scans with some kind of filters Cheers On Tue, Aug 5, 2014 at 5:26 AM, Alok Kumar <[email protected]> wrote: > You could narrow the number of rows to scan by using Filters. I don't > think, you could reach/optimize to column level I/O. > > Block Cache is related to actual data read from HDFS per column family. If > your scan is fetching random (all) columns, then you are any way going to > hit all the column-family-blocks and "irrelevant" data in block cache!! > You could limit or set columns you want to fetch on client side after scan, > that will save network IO. > > Do you have 130 * 5 = 650MB of row size? > > Thanks > Alok > > On Tue, Aug 5, 2014 at 5:17 PM, innowireless TaeYun Kim < > [email protected]> wrote: > > > Plus, > > Since most of the time a client will display the area that does not fit > in > > 500x500, Scan operations are required. (Get is not enough) > > So, I'm worried that on scanning, many irrelevant column data (those have > > the same rowkey, which is the position on the grid) would be read into > the > > block cache, unless the columns are separated by individual column > family. > > > > > > -----Original Message----- > > From: innowireless TaeYun Kim [mailto:[email protected]] > > Sent: Tuesday, August 05, 2014 8:36 PM > > To: [email protected] > > Subject: RE: Question on the number of column families > > > > Thank you for your reply. > > > > I can decrease the size of column value if it's not good for HBase. > > BTW, The values are for a point on a grid cell on a map. > > 250000 is 500x500, and 500x500 is somewhat related to the size of the > > client screen that displays the values on a map. > > Normally a client requests the values for the area that is displayed on > > the screen. > > > > > > -----Original Message----- > > From: Alok Kumar [mailto:[email protected]] > > Sent: Tuesday, August 05, 2014 8:24 PM > > To: [email protected] > > Subject: Re: Question on the number of column families > > > > Hi, > > > > Hbase creates HFile per column-family. Having 130 column-family is really > > not recommended. > > It will increase number of file pointer ( open file count) underneath. > > > > If you are sure which columns are "frequently" accessed by users, you > > could consider putting them in one column family. And "Non frequently" > ones > > in another. > > Btw, ~5MB size of column value is something to consider. We should wait > > for some expert advise here!! > > > > > > Thanks > > Alok > > > > > > On Tue, Aug 5, 2014 at 4:50 PM, innowireless TaeYun Kim < > > [email protected]> wrote: > > > > > Plus, > > > the size of the value of each field can be ~5MB, since max 250000 > > > lines of the source data will be merged into one record, to match the > > > request pattern. > > > > > > > > > -----Original Message----- > > > From: innowireless TaeYun Kim [mailto:[email protected]] > > > Sent: Tuesday, August 05, 2014 8:11 PM > > > To: [email protected] > > > Subject: Question on the number of column families > > > > > > Hi, > > > > > > > > > > > > According to http://hbase.apache.org/book/number.of.cfs.html, having > > > more than 2~3 column families are strongly discouraged. > > > > > > > > > > > > BTW, in my case, records on a table have the following characteristics: > > > > > > > > > > > > - The table is read-only. It is bulk-loaded once. When a new data is > > > ready, A new table is created and the old table is deleted. > > > > > > - The size of the source data can be hundreds of gigabytes. > > > > > > - A record has about 130 fields. > > > > > > - The number of fields in a record is fixed. > > > > > > - The names of the fields are also fixed. (it's like a table in RDBMS) > > > > > > - About 40(it varies) fields mostly have value, while other fields are > > > mostly empty(null in RDBMS). > > > > > > - It is unknown which field will be dense. It depends on the source > data. > > > > > > - Fields are accessed independently. Normally a user requests just one > > > field. A user can request several fields. > > > > > > - The range on the range query is the same for all fields. (No wider, > > > no narrower, regardless the data density) > > > > > > For me, it seems that it would be more efficient if there is one > > > column family for each field, since it would cost less disk I/O, for > > > only the needed column data will be read. > > > > > > > > > > > > Can the table have 130 column families for this case? > > > > > > Or the whole columns must be in one column family? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > -- > > Alok Kumar > > Email : [email protected] > > http://sharepointorange.blogspot.in/ > > http://www.linkedin.com/in/alokawi > > > > >
