On 2012-02-15, at 9:09 AM, Oliver Meyn (GBIF) wrote:

> On 2012-02-15, at 7:32 AM, Stack wrote:
> 
>> On Tue, Feb 14, 2012 at 8:14 AM, Stack <st...@duboce.net> wrote:
>>>> 2) With that same randomWrite command line above, I would expect a 
>>>> resulting table with 10 * (1024 * 1024) rows (so 10485700 = roughly 10M 
>>>> rows).  Instead what I'm seeing is that the randomWrite job reports 
>>>> writing that many rows (exactly) but running rowcounter against the table 
>>>> reveals only 6549899 rows.  A second attempt to build the table produces 
>>>> slightly different results (e.g. 6627689).  I see a similar discrepancy 
>>>> when using 50 instead of 10 clients (~35% smaller than expected).  Key 
>>>> collision could explain it, but it seems pretty unlikely (given I only 
>>>> need e.g. 10M keys from a potential 2B).
>>>> 
>>> 
>> 
>> I just tried it here and got similar result.  I wonder if its the
>> randomWrite?  What if you do sequentialWrite, do you get our 10M?
> 
> Thanks for checking into this stack - when using sequentialWrite I get the 
> expected 10485700 rows.  I'll hack around a bit on the PE to count the number 
> of collisions, and try to think of a reasonable solution.

So hacking around reveals that key collision is indeed the problem.  I thought 
the modulo part of the getRandomRow method was suspect but while removing it 
improved the behaviour (I got ~8M rows instead of ~6.6M) it didn't fix it 
completely.  Since that's really what UUIDs are for I gave that a shot (i.e 
UUID.randomUUID()) and sure enough now I get the full 10M rows.  Those are 
16-byte keys now though, instead of the 10-byte that the integers produced.  
But because we're testing scan performance I think using a sequentially written 
table would probably be cheating and so will stick with randomWrite with 
slightly bigger keys.  That means it's a little harder to compare to the 
results that other people get, but at least I know my internal tests are apples 
to apples.

Oh and I removed the outer 10x loop and that produced the desired number of 
mappers (ie what I passed in on the commandline) but made no difference in the 
key generation/collision story.

Should I file bugs for these 2 issues?

Thanks,
Oliver

Reply via email to