Hi,

I am using Solr 4.10.4, SolrCloud mode (single instance), with the indexes
residing in HDFS. I am currently testing performance and scalability of the
indexing process on my Hadoop cluster using the MapReduceIndexerTool.

Previously, I had been testing on a smaller cluster with 3 datanodes. When
running MRIT (without specifying any number of mappers or reducers), this
would result in 16 mappers and 6 reducers. At the time, I didn't put much
thought into where those numbers were coming from. I just figured those
were the available slots my cluster was reporting.

Now, after growing the cluster from 3 to 5 datanodes, running the same job
on the same input results in the use of 16 mappers and 10 reducers. It
seems that the number of reducers is scaling (2 per node), but the number
of mappers is not. Looking closer at the output, I see a log message saying
the the cluster reports 50 mapper slots available, but only 16 get used...

So, I decided to take a look at the source code (of MRIT) and see where
these numbers are coming from. What I found is that the number of mappers
is essentially getting capped by the number of actual files getting
processed. So I took a look at my input data directory in HDFS, and sure
enough, I have 16 files in there. When I concatenated all of them together
into one single file and ran the job again, that resulted in the use of
only 1 mapper and 1 reducer - and as could be imagined, performance dropped
significantly.

Is there a way that I can tell MRIT to ignore the number of input files,
and instead operate on HDFS blocks? Or is the solution to pre-process my
input data such that it gets split up into as many (block-sized) pieces as
possible? I'm really hoping to get the indexing time to scale with the size
of the cluster.

Thanks,
Doug

Reply via email to