Mike, CDH4.2 will be out shortly, will be based on HBase 0.94, and will include both of the features that Ted mentioned and more.
- Dave On Thu, Feb 7, 2013 at 8:34 PM, Michael Ellery <mell...@opendns.com> wrote: > > thanks for reminding me of the HBASE version in CDH4 - that's something > we'll definitely take into consideration. > > -Mike > > On Feb 7, 2013, at 5:09 PM, Ted Yu wrote: > > > Thanks Michael for this information. > > > > FYI CDH4 (as of now) is based on HBase 0.92.x which doesn't have the two > > features I cited below. > > > > On Thu, Feb 7, 2013 at 5:02 PM, Michael Ellery <mell...@opendns.com> > wrote: > > > >> There is only one CF in this schema. > >> > >> Yes, we are looking at upgrading to CDH4, but it is not trivial since we > >> cannot have cluster downtime. Our current upgrade plans involves > additional > >> hardware with side-by side clusters until everything is > exported/imported. > >> > >> Thanks, > >> Mike > >> > >> On Feb 7, 2013, at 4:34 PM, Ted Yu wrote: > >> > >>> How many column families are involved ? > >>> > >>> Have you considered upgrading to 0.94.4 where you would be able to > >> benefit > >>> from lazy seek, Data Block Encoding, etc ? > >>> > >>> Thanks > >>> > >>> On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <mell...@opendns.com> > >> wrote: > >>> > >>>> I'm looking for some advice about per row CQ (column qualifier) count > >>>> guidelines. Our current schema design means we have a HIGHLY variable > CQ > >>>> count per row -- some rows have one or two CQs and some rows have > >> upwards > >>>> of 1 million. Each CQ is on the order of 100 bytes (for round numbers) > >> and > >>>> the cell values are null. We see highly variable and too often > >>>> unacceptable read performance using this schema. I don't know for a > >> fact > >>>> that the CQ count variability is the source of our problems, but I am > >>>> suspicious. > >>>> > >>>> I'm curious about others' experience with CQ counts per row -- are > there > >>>> some best practices/guidelines about how to optimally size the number > of > >>>> CQs per row. The other obvious solution will involve breaking this > data > >>>> into finer grained rows, which means shifting from GETs to SCANs - are > >>>> there performance trade-offs in such a change? > >>>> > >>>> We are currently using CDH3u4, if that is relevant. All of our loading > >> is > >>>> done via HFILE loading (bulk), so we have not had to tune write > >> performance > >>>> beyond using bulk loads. Any advice appreciated, including what > metrics > >> we > >>>> should be looking at to further diagnose our read performance > >> challenges. > >>>> > >>>> Thanks, > >>>> Mike Ellery > >> > >> > >