Thanks Ted and Vladimir for the response. The read queries almost always return the entire subset of columns at the moment although some of them neednĀ¹t be. If we go down refining the return set, will look at essential CF. Blobs are only okay if there are improvements in write perf (we have a write heavy pattern, reads could be okay taking the additional time to deserialize the avro) and we decide that nothing within the blob will be used in filtering.
We looked at FuzzyRowFilter a while ago and may use it in the future for some of our reads nevertheless. In the current context, it would surely help reducing the number of columns outside of the blob column if we decide to go down that path. Benchmarking on 1.x is a bit difficult at the moment but we hope to do it and will let you know when we get there. On 9/10/15, 2:00 PM, "Ted Yu" <yuzhih...@gmail.com> wrote: >w.r.t. FuzzyRowFilter, there is a bug fix (HBASE-14269) which is not in >any >release yet. > >Look for future release (1.2.0, 1.1.3, 0.98.15) which would contain the >fix. > >FYI > >On Thu, Sep 10, 2015 at 10:36 AM, Vladimir Rodionov ><vladrodio...@gmail.com> >wrote: > >> It depends on your read pattern. If you mostly read small subset of >>columns >> (you have a lot of them) both approaches are bad. You will need to scan >>all >> your columns and deserialize blobs to extract only few of them (that is >>5MB >> at least). Consider adding more data (columns) to rowkey and using >> FuzzyRowFilter, should be faster. >> >> From write perf point of view, blobs are better, of course. >> >> -Vlad >> >> On Thu, Sep 10, 2015 at 9:33 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> > You may have seen this: >> http://hbase.apache.org/book.html#schema.smackdown >> > >> > bq. are part of one column family >> > >> > Are the columns equally likely to be read ? >> > I ask this because you may be able to utilize essential column family >> > feature by separating columns which tend to be more frequently >>accessed >> > into their own column family. >> > >> > 0.94 is quite old. >> > Any chance of rerunning your benchmark on hbase 1.x ? >> > >> > Thanks >> > >> > On Thu, Sep 10, 2015 at 9:00 AM, Melvin Kanasseril < >> > melvin.kanasse...@sophos.com> wrote: >> > >> > > Hi, >> > > >> > > This probably has come up before but I wanted to know if there is a >> > > recommendation around having tables with all attribute data as >>separate >> > > columns v/s an approach with most of the attribute data stored as a >> blob >> > in >> > > a single column and the rest as separate columns(for column filter >> > > searches). I am aware of the limitations with lumping the data into >>a >> > blob >> > > but was curious to see if there is an improvement on >> throughput/latency. >> > > >> > > I am leaning towards there not being much of a difference or this >> being a >> > > micro-optimization not worth the tradeoff but when we ran a set of >> > > benchmarks to test this(on ver 0.94), the hybrid approach with the >>blob >> > > data seem to show a 10-12% improvement in write throughput for the >>same >> > > number of client threads with evenly distributed puts over a >>pre-spit >> > table >> > > on a 12 node cluster. I used Avro for serialization and all the >>columns >> > > (there are about 40 without the blob column and 10 with it) are >>part of >> > one >> > > column family. The size of data for a row is around 5 MB before >> > > serialization. Any thoughts whether this is worth pursuing? >> > > >> > > Thanks, >> > > Melvin >> > > >> > >>