Hi all,

I've been trying to run a battery of tests to really understand our cluster's 
performance, and I'm employing PerformanceEvaluation to do that (picking up 
where Tim Robertson left off, elsewhere on the list).  I'm seeing two strange 
things that I hope someone can help with:

1) With a command line like 'hbase 
org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 10' I see 100 mappers 
spawned, rather than the expected 10.  I expect 10 because that's what the 
usage text implies, and what the javadoc explicitly states - quoting from 
doMapReduce "Run as many maps as asked-for clients."  The culprit appears to be 
the outer loop in writeInputFile which sets up 10 splits for every "asked-for 
client" - at least, if I'm reading it right.  Is this somehow expected, or is 
that code leftover from some previous iteration/experiment?

2) With that same randomWrite command line above, I would expect a resulting 
table with 10 * (1024 * 1024) rows (so 10485700 = roughly 10M rows).  Instead 
what I'm seeing is that the randomWrite job reports writing that many rows 
(exactly) but running rowcounter against the table reveals only 6549899 rows.  
A second attempt to build the table produces slightly different results (e.g. 
6627689).  I see a similar discrepancy when using 50 instead of 10 clients 
(~35% smaller than expected).  Key collision could explain it, but it seems 
pretty unlikely (given I only need e.g. 10M keys from a potential 2B).

Any and all input appreciated.

Thanks,
Oliver

--
Oliver Meyn
Software Developer
Global Biodiversity Information Facility (GBIF)
+45 35 32 15 12
http://www.gbif.org

Reply via email to