w.r.t. FuzzyRowFilter, there is a bug fix (HBASE-14269) which is not in any release yet.
Look for future release (1.2.0, 1.1.3, 0.98.15) which would contain the fix. FYI On Thu, Sep 10, 2015 at 10:36 AM, Vladimir Rodionov <vladrodio...@gmail.com> wrote: > It depends on your read pattern. If you mostly read small subset of columns > (you have a lot of them) both approaches are bad. You will need to scan all > your columns and deserialize blobs to extract only few of them (that is 5MB > at least). Consider adding more data (columns) to rowkey and using > FuzzyRowFilter, should be faster. > > From write perf point of view, blobs are better, of course. > > -Vlad > > On Thu, Sep 10, 2015 at 9:33 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > You may have seen this: > http://hbase.apache.org/book.html#schema.smackdown > > > > bq. are part of one column family > > > > Are the columns equally likely to be read ? > > I ask this because you may be able to utilize essential column family > > feature by separating columns which tend to be more frequently accessed > > into their own column family. > > > > 0.94 is quite old. > > Any chance of rerunning your benchmark on hbase 1.x ? > > > > Thanks > > > > On Thu, Sep 10, 2015 at 9:00 AM, Melvin Kanasseril < > > melvin.kanasse...@sophos.com> wrote: > > > > > Hi, > > > > > > This probably has come up before but I wanted to know if there is a > > > recommendation around having tables with all attribute data as separate > > > columns v/s an approach with most of the attribute data stored as a > blob > > in > > > a single column and the rest as separate columns(for column filter > > > searches). I am aware of the limitations with lumping the data into a > > blob > > > but was curious to see if there is an improvement on > throughput/latency. > > > > > > I am leaning towards there not being much of a difference or this > being a > > > micro-optimization not worth the tradeoff but when we ran a set of > > > benchmarks to test this(on ver 0.94), the hybrid approach with the blob > > > data seem to show a 10-12% improvement in write throughput for the same > > > number of client threads with evenly distributed puts over a pre-spit > > table > > > on a 12 node cluster. I used Avro for serialization and all the columns > > > (there are about 40 without the blob column and 10 with it) are part of > > one > > > column family. The size of data for a row is around 5 MB before > > > serialization. Any thoughts whether this is worth pursuing? > > > > > > Thanks, > > > Melvin > > > > > >