I believe these are related (they are new to me), anyone seen anything like
this in Solr mapred?



Error: java.io.IOException:
org.apache.solr.client.solrj.SolrServerException:
org.apache.solr.client.solrj.SolrServerException:
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware
problem?) : expected=5fb8f6da actual=8b048ec4
(resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip))
        at
org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:307)
        at
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:558)
        at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:637)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: org.apache.solr.client.solrj.SolrServerException:
org.apache.solr.client.solrj.SolrServerException:
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware
problem?) : expected=5fb8f6da actual=8b048ec4
(resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip))
        at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:223)
        at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
        at
org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
        at org.apache.solr.hadoop.BatchWriter.close(BatchWriter.java:200)
        at
org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:295)
        ... 8 more
Caused by: org.apache.solr.client.solrj.SolrServerException:
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware
problem?) : expected=5fb8f6da actual=8b048ec4
(resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip))
        at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:155)
        ... 12 more
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
(hardware problem?) : expected=5fb8f6da actual=8b048ec4
(resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip))
        at
org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211)
        at
org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268)
        at
org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:125)
        at
org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441)
        at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197)
        at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254)
        at
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120)
        at
org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
        at
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143)
        at
org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:282)
        at
org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3315)
        at
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3306)
        at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3020)
        at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3169)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3136)
        at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:582)
        at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
        at
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1648)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1625)
        at
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:157)
        at
org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
        at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
        ... 12 more




[...snip...] another similar failure:




14/09/23 17:52:55 INFO mapreduce.Job: Task Id :
attempt_1411487144915_0006_r_000046_0, Status : FAILED
Error: java.io.IOException: org.apache.solr.common.SolrException: Error
opening new searcher
        at
org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:307)
        at
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:558)
        at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:637)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1421)
        at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:615)
        at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
        at
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1648)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1625)
        at
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:157)
        at
org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
        at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
        at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
        at
org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
        at org.apache.solr.hadoop.BatchWriter.close(BatchWriter.java:200)
        at
org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:295)
        ... 8 more
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
(hardware problem?) : expected=d9019857 actual=632aa4e2
(resource=BufferedChecksumIndexInput(_1i_Lucene41_0.tip))
        at
org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211)
        at
org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268)
        at
org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:125)
        at
org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441)
        at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197)
        at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254)
        at
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120)
        at
org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
        at
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143)
        at
org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:237)
        at
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:104)
        at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:426)
        at
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:292)
        at
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:277)
        at
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:251)
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1476)
        ... 25 more


On Tue, Sep 16, 2014 at 12:54 PM, Brett Hoerner <br...@bretthoerner.com>
wrote:

> I have a very weird problem that I'm going to try to describe here to see
> if anyone has any "ah-ha" moments or clues. I haven't created a small
> reproducible project for this but I guess I will have to try in the future
> if I can't figure it out. (Or I'll need to bisect by running long Hadoop
> jobs...)
>
> So, the facts:
>
> * Have been successfully using Solr mapred to build very large Solr
> clusters for months
> * As of Solr 4.10 *some* job sizes repeatably hang in the MTree merge
> phase in 4.10
> * Those same jobs (same input, output, and Hadoop cluster itself) succeed
> if I only change my Solr deps to 4.9
> * The job *does succeed* in 4.10 if I use the same data to create more,
> but smaller shards (e.g. 12x as many shards each 1/12th the size of the job
> that fails)
> * Creating my "normal size" shards (the size I want, that works in 4.9)
> the job hangs with 2 mappers running, 0 reducers in the MTree merge phase
> * There are no errors or warning in the syslog/stderr of the MTree
> mappers, no errors ever echo'd back to the "interactive run" of the job
> (mapper says 100%, reduce says 0%, will stay forever)
> * No CPU being used on the boxes running the merge, no GC happening, JVM
> waiting on a futex, all threads blocked on various queues
> * No disk usage problems, nothing else obviously wrong with any box in the
> cluster
>
> I diff'ed around between 4.10 and 4.9 and barely see any changes in mapred
> contrib, mostly some test stuff. I didn't see any transitive dependency
> changes in Solr/Lucene that look like they would affect me.
>

Reply via email to