I have a very weird problem that I'm going to try to describe here to see
if anyone has any "ah-ha" moments or clues. I haven't created a small
reproducible project for this but I guess I will have to try in the future
if I can't figure it out. (Or I'll need to bisect by running long Hadoop
jobs...)

So, the facts:

* Have been successfully using Solr mapred to build very large Solr
clusters for months
* As of Solr 4.10 *some* job sizes repeatably hang in the MTree merge phase
in 4.10
* Those same jobs (same input, output, and Hadoop cluster itself) succeed
if I only change my Solr deps to 4.9
* The job *does succeed* in 4.10 if I use the same data to create more, but
smaller shards (e.g. 12x as many shards each 1/12th the size of the job
that fails)
* Creating my "normal size" shards (the size I want, that works in 4.9) the
job hangs with 2 mappers running, 0 reducers in the MTree merge phase
* There are no errors or warning in the syslog/stderr of the MTree mappers,
no errors ever echo'd back to the "interactive run" of the job (mapper says
100%, reduce says 0%, will stay forever)
* No CPU being used on the boxes running the merge, no GC happening, JVM
waiting on a futex, all threads blocked on various queues
* No disk usage problems, nothing else obviously wrong with any box in the
cluster

I diff'ed around between 4.10 and 4.9 and barely see any changes in mapred
contrib, mostly some test stuff. I didn't see any transitive dependency
changes in Solr/Lucene that look like they would affect me.

Reply via email to