I believe these are related (they are new to me), anyone seen anything like this in Solr mapred?
Error: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.SolrServerException: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=5fb8f6da actual=8b048ec4 (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip)) at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:307) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:558) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:637) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.SolrServerException: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=5fb8f6da actual=8b048ec4 (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip)) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:223) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168) at org.apache.solr.hadoop.BatchWriter.close(BatchWriter.java:200) at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:295) ... 8 more Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=5fb8f6da actual=8b048ec4 (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip)) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:155) ... 12 more Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=5fb8f6da actual=8b048ec4 (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip)) at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211) at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268) at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:125) at org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254) at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120) at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108) at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143) at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:282) at org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3315) at org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3306) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3020) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3169) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3136) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:582) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1648) at org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1625) at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:157) at org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150) ... 12 more [...snip...] another similar failure: 14/09/23 17:52:55 INFO mapreduce.Job: Task Id : attempt_1411487144915_0006_r_000046_0, Status : FAILED Error: java.io.IOException: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:307) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:558) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:637) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1421) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:615) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1648) at org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1625) at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:157) at org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168) at org.apache.solr.hadoop.BatchWriter.close(BatchWriter.java:200) at org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:295) ... 8 more Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=d9019857 actual=632aa4e2 (resource=BufferedChecksumIndexInput(_1i_Lucene41_0.tip)) at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211) at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268) at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:125) at org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254) at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120) at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108) at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143) at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:237) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:104) at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:426) at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:292) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:277) at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:251) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1476) ... 25 more On Tue, Sep 16, 2014 at 12:54 PM, Brett Hoerner <br...@bretthoerner.com> wrote: > I have a very weird problem that I'm going to try to describe here to see > if anyone has any "ah-ha" moments or clues. I haven't created a small > reproducible project for this but I guess I will have to try in the future > if I can't figure it out. (Or I'll need to bisect by running long Hadoop > jobs...) > > So, the facts: > > * Have been successfully using Solr mapred to build very large Solr > clusters for months > * As of Solr 4.10 *some* job sizes repeatably hang in the MTree merge > phase in 4.10 > * Those same jobs (same input, output, and Hadoop cluster itself) succeed > if I only change my Solr deps to 4.9 > * The job *does succeed* in 4.10 if I use the same data to create more, > but smaller shards (e.g. 12x as many shards each 1/12th the size of the job > that fails) > * Creating my "normal size" shards (the size I want, that works in 4.9) > the job hangs with 2 mappers running, 0 reducers in the MTree merge phase > * There are no errors or warning in the syslog/stderr of the MTree > mappers, no errors ever echo'd back to the "interactive run" of the job > (mapper says 100%, reduce says 0%, will stay forever) > * No CPU being used on the boxes running the merge, no GC happening, JVM > waiting on a futex, all threads blocked on various queues > * No disk usage problems, nothing else obviously wrong with any box in the > cluster > > I diff'ed around between 4.10 and 4.9 and barely see any changes in mapred > contrib, mostly some test stuff. I didn't see any transitive dependency > changes in Solr/Lucene that look like they would affect me. >