Hi, I am using Solr 4.10.4, SolrCloud mode (single instance), with the indexes residing in HDFS. I am currently testing performance and scalability of the indexing process on my Hadoop cluster using the MapReduceIndexerTool.
Previously, I had been testing on a smaller cluster with 3 datanodes. When running MRIT (without specifying any number of mappers or reducers), this would result in 16 mappers and 6 reducers. At the time, I didn't put much thought into where those numbers were coming from. I just figured those were the available slots my cluster was reporting. Now, after growing the cluster from 3 to 5 datanodes, running the same job on the same input results in the use of 16 mappers and 10 reducers. It seems that the number of reducers is scaling (2 per node), but the number of mappers is not. Looking closer at the output, I see a log message saying the the cluster reports 50 mapper slots available, but only 16 get used... So, I decided to take a look at the source code (of MRIT) and see where these numbers are coming from. What I found is that the number of mappers is essentially getting capped by the number of actual files getting processed. So I took a look at my input data directory in HDFS, and sure enough, I have 16 files in there. When I concatenated all of them together into one single file and ran the job again, that resulted in the use of only 1 mapper and 1 reducer - and as could be imagined, performance dropped significantly. Is there a way that I can tell MRIT to ignore the number of input files, and instead operate on HDFS blocks? Or is the solution to pre-process my input data such that it gets split up into as many (block-sized) pieces as possible? I'm really hoping to get the indexing time to scale with the size of the cluster. Thanks, Doug