Okie:

10x # of mappers: https://issues.apache.org/jira/browse/HBASE-5401
wrong row count: https://issues.apache.org/jira/browse/HBASE-5402

Oliver

On 2012-02-15, at 11:50 AM, yuzhih...@gmail.com wrote:

> Oliver:
> Thanks for digging. 
> 
> Please file Jira's for these issues. 
> 
> 
> 
> On Feb 15, 2012, at 1:53 AM, "Oliver Meyn (GBIF)" <om...@gbif.org> wrote:
> 
>> On 2012-02-15, at 9:09 AM, Oliver Meyn (GBIF) wrote:
>> 
>>> On 2012-02-15, at 7:32 AM, Stack wrote:
>>> 
>>>> On Tue, Feb 14, 2012 at 8:14 AM, Stack <st...@duboce.net> wrote:
>>>>>> 2) With that same randomWrite command line above, I would expect a 
>>>>>> resulting table with 10 * (1024 * 1024) rows (so 10485700 = roughly 10M 
>>>>>> rows).  Instead what I'm seeing is that the randomWrite job reports 
>>>>>> writing that many rows (exactly) but running rowcounter against the 
>>>>>> table reveals only 6549899 rows.  A second attempt to build the table 
>>>>>> produces slightly different results (e.g. 6627689).  I see a similar 
>>>>>> discrepancy when using 50 instead of 10 clients (~35% smaller than 
>>>>>> expected).  Key collision could explain it, but it seems pretty unlikely 
>>>>>> (given I only need e.g. 10M keys from a potential 2B).
>>>>>> 
>>>>> 
>>>> 
>>>> I just tried it here and got similar result.  I wonder if its the
>>>> randomWrite?  What if you do sequentialWrite, do you get our 10M?
>>> 
>>> Thanks for checking into this stack - when using sequentialWrite I get the 
>>> expected 10485700 rows.  I'll hack around a bit on the PE to count the 
>>> number of collisions, and try to think of a reasonable solution.
>> 
>> So hacking around reveals that key collision is indeed the problem.  I 
>> thought the modulo part of the getRandomRow method was suspect but while 
>> removing it improved the behaviour (I got ~8M rows instead of ~6.6M) it 
>> didn't fix it completely.  Since that's really what UUIDs are for I gave 
>> that a shot (i.e UUID.randomUUID()) and sure enough now I get the full 10M 
>> rows.  Those are 16-byte keys now though, instead of the 10-byte that the 
>> integers produced.  But because we're testing scan performance I think using 
>> a sequentially written table would probably be cheating and so will stick 
>> with randomWrite with slightly bigger keys.  That means it's a little harder 
>> to compare to the results that other people get, but at least I know my 
>> internal tests are apples to apples.
>> 
>> Oh and I removed the outer 10x loop and that produced the desired number of 
>> mappers (ie what I passed in on the commandline) but made no difference in 
>> the key generation/collision story.
>> 
>> Should I file bugs for these 2 issues?
>> 
>> Thanks,
>> Oliver
>> 



Reply via email to