Returning all cells to a client is the other extreme and I don't think that would be a great test either.
Personally I think for testing big change sets well we need a range of workloads. The extreme cases (filter all, filter none) are useful data points but not great if measured in isolation. I think YCSB is a reasonable option for that these days now that it is maintained. It comes with 6 or so canned workloads. Not a bad start. > On Jul 20, 2015, at 6:01 AM, lars hofhansl <[email protected]> wrote: > > Personally, I think that is a reasonable way to test the internal friction of > the server. I've been doing a lot of tests like that and found a lot of > inefficiencies in HBase that way.For cases where we return all Cells back to > a (remote) client improving the server by 10 or 20% would mostly go unnoticed. > > Analytics (aggregates via Phoenix of direct coprocessors) will be more > important going forward, so improving that part is important. > I completely agree that end-to-end (by which I mean data shipped to the > client) testing is important, it's just I'd expect us to work on different > areas (put Protobufs on a diet, have a streaming protocol, etc). > -- Lars > > From: Andrew Purtell <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Saturday, July 18, 2015 11:24 AM > Subject: Re: DISCUSSION: lets do a developer workshop on near-term work > > That's not a realistic or useful test scenario, unless the goal is to > accelerate queries where all cells are filtered at the server. > > > > > >> On Jul 18, 2015, at 11:02 AM, Anoop John <[email protected]> wrote: >> >> No Andy. 11425 having doc attached to it. At the end of it, we have added >> perf numbers in a cluster testing. This was done using PE get and scan >> tests with filtering all cells at server (to not consider n/w bandwidth >> constraints) >> >> -Anoop- >> >> On Sat, Jul 18, 2015 at 9:30 PM, Andrew Purtell <[email protected]> >> wrote: >> >>> We have some microbenchmarks, not evidence of differences seen from a >>> client application. I'm not saying that microbenchmarks are not totally >>> necessary and a great start - they are - but that they don't measure an end >>> goal. Furthermore unless I've missed one somewhere we don't have a JIRA or >>> design doc that states a clear end goal metric like the strawman I threw >>> together in my previous mail. A measurable system level goal and some data >>> from full cluster testing would go a lot further toward letting all of us >>> evaluate the potential and payoff of the work. In the meantime we should >>> probably be assembling these changes on a branch instead of in trunk, for >>> as long as the goal is not clearly defined and the payoff and potential for >>> perf regressions is untested and unknown. >>> >>> >>>> On Jul 18, 2015, at 8:05 AM, Anoop John <[email protected]> wrote: >>>> >>>> Thanks Andy and Lars. The parent jira has doc attached which contains >>> some >>>> perf gain numbers.. We will be doing more tests in next 2 weeks (before >>>> end of this month) and will publish them. Yes it will be great if it is >>>> more IST friendly time :-) >>>> >>>> -Anoop- >>>> >>>> On Fri, Jul 17, 2015 at 9:44 PM, Andrew Purtell < >>> [email protected]> >>>> wrote: >>>> >>>>>> I can represent your side Ram (and Anoop). I've been known always argue >>>>> both side of a discussion and to never take sides easily (drives some >>> folks >>>>> crazy). >>>>> >>>>> I can vouch for this (smile) >>>>> >>>>> I also can offer support for off heaping there. At the same time we do >>>>> have a gap where we can't point to a timeline of improvements (yet, >>> anyway) >>>>> with benchmarks showing gains where your goals need them. For example, >>>>> stock HBase in one JVM can address max N GB for response time >>> distribution >>>>> D; dev version of HBase in off heap branch can address max N' GB for >>>>> distribution D', where N' > N and D > D' (distribution D' statistically >>>>> shows better/lower response times). >>>>> >>>>> >>>>> >>>>>> On Jul 17, 2015, at 6:56 AM, lars hofhansl <[email protected]> wrote: >>>>>> >>>>>> I'm in favor of anything that improves performance (and preferably >>>>> doesn't set us back into a world that's worse than C due to the lack of >>>>> pointers in Java).Never said "I don't like it", it's just that I'm >>> perhaps >>>>> asking for more numbers and justification in weighing the pros and cons. >>>>>> I can represent your side Ram (and Anoop). I've been known always argue >>>>> both side of a discussion and to never take sides easily (drives some >>> folks >>>>> crazy). And Stack's there too, he yell at me where needed :) >>>>>> >>>>>> Perhaps we can do it a bit later in the evening so there is a fighting >>>>> chance that folks on IST can participate. I know that some of our folks >>> on >>>>> IST would love to participate in the backup discussion). >>>>>> >>>>>> Like Enis, I'm also happy to host. We're in Downtown SF. I'd just need >>>>> an approx. number of folks. >>>>>> >>>>>> -- Lars >>>>>> >>>>>> From: ramkrishna vasudevan <[email protected]> >>>>>> To: "[email protected]" <[email protected]>; lars hofhansl < >>>>> [email protected]> >>>>>> Sent: Wednesday, July 15, 2015 10:10 AM >>>>>> Subject: Re: DISCUSSION: lets do a developer workshop on near-term work >>>>>> >>>>>> Hi >>>>>> What time will it be on August 26th? >>>>>> @LarsYa. I know that you are not generally in favour of this offheaping >>>>> stuff. May be if we (from India) can attend this meeting remotely your >>>>> thoughts can be discussed and also the current state of this work. >>>>>> RegardsRam >>>>>> >>>>>> >>>>>> On Wed, Jul 15, 2015 at 9:28 PM, lars hofhansl <[email protected]> >>> wrote: >>>>>> >>>>>> Works for me. I'll be back in the Bay Area the week of August 9th. >>>>>> We have done a _lot_ of work on backups as well - ours are more >>>>> complicated as we wanted fast per-tenant restores, so data is "grouped" >>> by >>>>> tenant. Would like to sync up on that (hopefully some of the folks who >>>>> wrote most of the code will be in town, I'll check). >>>>>> >>>>>> Also interested in the "Time" and "offheap" parts (although you folks >>>>> usually do not like what I think about the offheap efforts :) ). >>>>>> Would like to add the following topics: >>>>>> >>>>>> >>>>>> - "Timestamp Resolution". Or making space for more bits in the >>>>> timestamps (happy to cover that, unless it's part of the "Time" topic) >>>>>> >>>>>> >>>>>> - "Replication". We found that replication cannot keep up with high >>>>> write loads, due to the fact that replicated is strictly single threaded >>>>> per regionserver (even though we have multiple region servers on the >>> sink >>>>> side) >>>>>> >>>>>> >>>>>> - "Spark integration" (Ted Malaska?) >>>>>> >>>>>> >>>>>> OK... Out now to make a "bullshit hat". >>>>>> >>>>>> -- Lars >>>>>> >>>>>> ________________________________ >>>>>> From: Sean Busbey <[email protected]> >>>>>> To: dev <[email protected]> >>>>>> Sent: Tuesday, July 14, 2015 7:11 PM >>>>>> Subject: Re: DISCUSSION: lets do a developer workshop on near-term work >>>>>> >>>>>> >>>>>> I'm planning to be in the Bay area the week of the 24th of August. >>>>>> >>>>>> -- >>>>>> Sean >>>>>> >>>>>> >>>>>> >>>>>>> On Jul 14, 2015 7:53 PM, "Andrew Purtell" <[email protected]> >>> wrote: >>>>>>> >>>>>>> I can be up in your area in August. >>>>>>> >>>>>>>>> On Tue, Jul 14, 2015 at 5:31 PM, Stack <[email protected]> wrote: >>>>>>>>> >>>>>>>>> On Tue, Jul 14, 2015 at 3:39 PM, Enis Söztutar <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Sounds good. It has been a while we did the talk-aton. >>>>>>>>> >>>>>>>>> I'll be off starting 25 of July, so I prefer something next week if >>>>>>>>> possible. >>>>>>>>> >>>>>>>>> You ever coming back? If so, when? I'm back on 10th of August >>> (Mikhail >>>>>>> on >>>>>>>> the 20th). >>>>>>>> St.Ack >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Enis >>>>>>>>> >>>>>>>>>> On Tue, Jul 14, 2015 at 3:18 PM, Stack <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> Matteo and I were thinking it time devs got together for a pow-wow. >>>>>>>> There >>>>>>>>>> is a bunch of stuff in flight at the moment (see below list) and it >>>>>>>> would >>>>>>>>>> be good to meet and whiteboard, surface goodo ideas that have gone >>>>>>>>> dormant >>>>>>>>>> in JIRA, or revisit designs/proposals out in JIRA-attached google >>> doc >>>>>>>>> that >>>>>>>>>> need socializing. >>>>>>>>>> >>>>>>>>>> You can only come if you are wearing your bullshit hat. >>>>>>>>>> >>>>>>>>>> Topics we'd go over could include: >>>>>>>>>> >>>>>>>>>> + Our filesystem layout will not work if 1M regions (Matteo/Stack) >>>>>>>>>> + Current state of the offheaping of read path and alternate >>> KeyValue >>>>>>>>>> implementation (Anoop/Ram) >>>>>>>>>> + Append rejigger (Elliott) >>>>>>>>>> + A Pv2-based Assign (Matteo/Steven) >>>>>>>>>> + Splitting meta/1M regions >>>>>>>>>> + The revived Backup (Vladimir) >>>>>>>>>> + Time (Enis) >>>>>>>>>> + The overloaded SequenceId (Stack) >>>>>>>>>> + Upstreaming IT testing (Dima/Sean) >>>>>>>>>> + hbase-2.0.0 >>>>>>>>>> >>>>>>>>>> I put names by folks I know could talk to the topic. If you want to >>>>>>>> take >>>>>>>>>> over a topic or put your name by one, just say. Suggest that >>>>>>>> discussion >>>>>>>>>> lead off with a 5-10minute on current state of >>>>>>>>>> thought/design/implementation. >>>>>>>>>> >>>>>>>>>> What do others think? >>>>>>>>>> >>>>>>>>>> What date would suit folks? >>>>>>>>>> >>>>>>>>>> Anyone want to host? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Matteo and St.Ack >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best regards, >>>>>>> >>>>>>> - Andy >>>>>>> >>>>>>> Problems worthy of attack prove their worth by hitting back. - Piet >>> Hein >>>>>>> (via Tom White) >
