I have a very weird problem that I'm going to try to describe here to see if anyone has any "ah-ha" moments or clues. I haven't created a small reproducible project for this but I guess I will have to try in the future if I can't figure it out. (Or I'll need to bisect by running long Hadoop jobs...)
So, the facts: * Have been successfully using Solr mapred to build very large Solr clusters for months * As of Solr 4.10 *some* job sizes repeatably hang in the MTree merge phase in 4.10 * Those same jobs (same input, output, and Hadoop cluster itself) succeed if I only change my Solr deps to 4.9 * The job *does succeed* in 4.10 if I use the same data to create more, but smaller shards (e.g. 12x as many shards each 1/12th the size of the job that fails) * Creating my "normal size" shards (the size I want, that works in 4.9) the job hangs with 2 mappers running, 0 reducers in the MTree merge phase * There are no errors or warning in the syslog/stderr of the MTree mappers, no errors ever echo'd back to the "interactive run" of the job (mapper says 100%, reduce says 0%, will stay forever) * No CPU being used on the boxes running the merge, no GC happening, JVM waiting on a futex, all threads blocked on various queues * No disk usage problems, nothing else obviously wrong with any box in the cluster I diff'ed around between 4.10 and 4.9 and barely see any changes in mapred contrib, mostly some test stuff. I didn't see any transitive dependency changes in Solr/Lucene that look like they would affect me.