Okie: 10x # of mappers: https://issues.apache.org/jira/browse/HBASE-5401 wrong row count: https://issues.apache.org/jira/browse/HBASE-5402
Oliver On 2012-02-15, at 11:50 AM, yuzhih...@gmail.com wrote: > Oliver: > Thanks for digging. > > Please file Jira's for these issues. > > > > On Feb 15, 2012, at 1:53 AM, "Oliver Meyn (GBIF)" <om...@gbif.org> wrote: > >> On 2012-02-15, at 9:09 AM, Oliver Meyn (GBIF) wrote: >> >>> On 2012-02-15, at 7:32 AM, Stack wrote: >>> >>>> On Tue, Feb 14, 2012 at 8:14 AM, Stack <st...@duboce.net> wrote: >>>>>> 2) With that same randomWrite command line above, I would expect a >>>>>> resulting table with 10 * (1024 * 1024) rows (so 10485700 = roughly 10M >>>>>> rows). Instead what I'm seeing is that the randomWrite job reports >>>>>> writing that many rows (exactly) but running rowcounter against the >>>>>> table reveals only 6549899 rows. A second attempt to build the table >>>>>> produces slightly different results (e.g. 6627689). I see a similar >>>>>> discrepancy when using 50 instead of 10 clients (~35% smaller than >>>>>> expected). Key collision could explain it, but it seems pretty unlikely >>>>>> (given I only need e.g. 10M keys from a potential 2B). >>>>>> >>>>> >>>> >>>> I just tried it here and got similar result. I wonder if its the >>>> randomWrite? What if you do sequentialWrite, do you get our 10M? >>> >>> Thanks for checking into this stack - when using sequentialWrite I get the >>> expected 10485700 rows. I'll hack around a bit on the PE to count the >>> number of collisions, and try to think of a reasonable solution. >> >> So hacking around reveals that key collision is indeed the problem. I >> thought the modulo part of the getRandomRow method was suspect but while >> removing it improved the behaviour (I got ~8M rows instead of ~6.6M) it >> didn't fix it completely. Since that's really what UUIDs are for I gave >> that a shot (i.e UUID.randomUUID()) and sure enough now I get the full 10M >> rows. Those are 16-byte keys now though, instead of the 10-byte that the >> integers produced. But because we're testing scan performance I think using >> a sequentially written table would probably be cheating and so will stick >> with randomWrite with slightly bigger keys. That means it's a little harder >> to compare to the results that other people get, but at least I know my >> internal tests are apples to apples. >> >> Oh and I removed the outer 10x loop and that produced the desired number of >> mappers (ie what I passed in on the commandline) but made no difference in >> the key generation/collision story. >> >> Should I file bugs for these 2 issues? >> >> Thanks, >> Oliver >>