corrupt solr index on ec2
Hi, I've been running solr 1.3 on an ec2 instance for a couple of weeks and I've had some stability issues. It seems like I need to bounce the app once a day. That I could live with and ultimately maybe troubleshoot, but what's more disturbing is that three times in the last 2 weeks my index has been corrupted when FileNotFoundExceptions started to appear. I'm running in jetty and had my index on the local file system until I lost the index the first time. Then I moved it to my mounted ebs volume so I could restore from a snapshot if needed. I'm wondering if perhaps there are issues with the locking mechanize on either the local directory (which is really a virual instance), or the mounted xfs volume. Has anyone seem this, or have suggestions re the cause? I'm using the single lockType. I'm running a single solr instance that gets frequent updates from multiple threads, and commits about every hour. A few things I see in the logs: - From time to time I see write lock timeouts: SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SingleInstanceLock: write.lock - I've seen OOM exceptions during warming. I've changed maxWarmingSearchers=1, which I suspect will do he trick - The finally, this is what I fond in the logs today when the index got corrupt: Oct 29, 2008 12:18:39 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true) Oct 29, 2008 12:18:41 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: java.io.FileNotFoundException: /var/local/solr/data/production/index/_2rv.fdt (No such file or directory) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:960) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:368) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:77) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:226) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: java.io.FileNotFoundException: /var/local/solr/data/production/index/_2rv.fdt (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.(RandomAccessFile.java:212) at org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.(FSDirectory.java:552) at org.apache.lucene.store.FSDirectory$FSIndexInput.(FSDirectory.java:582) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488) at org.apache.lucene.index.FieldsReader.(FieldsReader.java:77) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:355) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:304) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:226) at org.apache.lucene.index.MultiSegmentReader.(MultiSegmentReader.java:56) at org.apache.lucene.index.ReadOnlyMultiSegmentReader.(ReadOnlyMultiSegmentReader.java:27) at org.apache.lucene.index.D
Re: corrupt solr index on ec2
On Thu, Oct 30, 2008 at 2:06 AM, Bill Graham <[EMAIL PROTECTED]> wrote: > I've been running solr 1.3 on an ec2 instance for a couple of weeks and I've > had some stability issues. It seems like I need to bounce the app once a day. > That I could live with and ultimately maybe troubleshoot, but what's more > disturbing is that three times in the last 2 weeks my index has been > corrupted when FileNotFoundExceptions started to appear. > > I'm running in jetty and had my index on the local file system until I lost > the index the first time. Then I moved it to my mounted ebs volume so I could > restore from a snapshot if needed. I'm wondering if perhaps there are issues > with the locking mechanize on either the local directory (which is really a > virual instance), or the mounted xfs volume. Has anyone seem this, or have > suggestions re the cause? I'm using the single lockType. > > I'm running a single solr instance that gets frequent updates from multiple > threads, and commits about every hour. > > A few things I see in the logs: > > - From time to time I see write lock timeouts: > SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed > out: SingleInstanceLock: write.lock This is really strange. It suggests that there is another in-process writer that is holding the lock. That should be impossible, unless it's caused by a previous exception trying to open an IndexWriter and the lock is simply stale. What seems to be the first exception that occurs? Also, you might try changing the lock type from single to simple to make it visible cross-process. That would rule out trying to start another solr instance on the same index directory opening two writers on the same directory is one way to get missing files like you appear to have. > - I've seen OOM exceptions during warming. I've changed > maxWarmingSearchers=1, which I suspect will do he trick OOM errors are really tricky - if they happen in the wrong place, it's hard to recover gracefully from. Correctly cleaning up after an OOM error in the IndexWriter recently had some little fixes in lucene trunk - you might want to try the latest dev version of Lucene and see if it helps. -Yonik > - The finally, this is what I fond in the logs today when the index got > corrupt: > > Oct 29, 2008 12:18:39 AM org.apache.solr.update.DirectUpdateHandler2 commit > INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true) > Oct 29, 2008 12:18:41 AM org.apache.solr.common.SolrException log > SEVERE: java.lang.RuntimeException: java.io.FileNotFoundException: > /var/local/solr/data/production/index/_2rv.fdt (No such file or directory) >at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:960) >at > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:368) >at > org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:77) >at > org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:226) >at > org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) >at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) >at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) >at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) >at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) >at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) >at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) >at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) >at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) >at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) >at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) >at > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) >at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) >at org.mortbay.jetty.Server.handle(Server.java:285) >at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) >at > org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) >at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) >at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) >at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) >at > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) >at > org.mortbay.thread.BoundedThreadPool$Pool
Re: corrupt solr index on ec2
One small correction below: Yonik Seeley wrote: - I've seen OOM exceptions during warming. I've changed maxWarmingSearchers=1, which I suspect will do he trick OOM errors are really tricky - if they happen in the wrong place, it's hard to recover gracefully from. Correctly cleaning up after an OOM error in the IndexWriter recently had some little fixes in lucene trunk - you might want to try the latest dev version of Lucene and see if it helps. This change (to not commit index changes after IndexWriter hits OOME) went in Feb 2008. Solr 1.3 should already have it. (I'm working now on adding javadocs to IW explaining this). Mike
Re: corrupt solr index on ec2
, I'm in the process of moving solr to another host with more memory, since the box I'm on is pretty tight on memory. thanks! Bill - Original Message From: Yonik Seeley <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Thursday, October 30, 2008 8:52:32 AM Subject: Re: corrupt solr index on ec2 On Thu, Oct 30, 2008 at 2:06 AM, Bill Graham <[EMAIL PROTECTED]> wrote: > I've been running solr 1.3 on an ec2 instance for a couple of weeks and I've > had some stability issues. It seems like I need to bounce the app once a day. > That I could live with and ultimately maybe troubleshoot, but what's more > disturbing is that three times in the last 2 weeks my index has been > corrupted when FileNotFoundExceptions started to appear. > > I'm running in jetty and had my index on the local file system until I lost > the index the first time. Then I moved it to my mounted ebs volume so I could > restore from a snapshot if needed. I'm wondering if perhaps there are issues > with the locking mechanize on either the local directory (which is really a > virual instance), or the mounted xfs volume. Has anyone seem this, or have > suggestions re the cause? I'm using the single lockType. > > I'm running a single solr instance that gets frequent updates from multiple > threads, and commits about every hour. > > A few things I see in the logs: > > - From time to time I see write lock timeouts: > SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed > out: SingleInstanceLock: write.lock This is really strange. It suggests that there is another in-process writer that is holding the lock. That should be impossible, unless it's caused by a previous exception trying to open an IndexWriter and the lock is simply stale. What seems to be the first exception that occurs? Also, you might try changing the lock type from single to simple to make it visible cross-process. That would rule out trying to start another solr instance on the same index directory opening two writers on the same directory is one way to get missing files like you appear to have. > - I've seen OOM exceptions during warming. I've changed > maxWarmingSearchers=1, which I suspect will do he trick OOM errors are really tricky - if they happen in the wrong place, it's hard to recover gracefully from. Correctly cleaning up after an OOM error in the IndexWriter recently had some little fixes in lucene trunk - you might want to try the latest dev version of Lucene and see if it helps. -Yonik > - The finally, this is what I fond in the logs today when the index got > corrupt: > > Oct 29, 2008 12:18:39 AM org.apache.solr.update.DirectUpdateHandler2 commit > INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true) > Oct 29, 2008 12:18:41 AM org.apache.solr.common.SolrException log > SEVERE: java.lang.RuntimeException: java.io.FileNotFoundException: > /var/local/solr/data/production/index/_2rv.fdt (No such file or directory) >at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:960) >at > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:368) >at > org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:77) >at > org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:226) >at > org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) >at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) >at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) >at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) >at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) >at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) >at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) >at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) >at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) >at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) >at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) >at > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) >at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) >at org.m
Re: corrupt solr index on ec2
Bill Graham wrote: Then it seemed to run well for about an hour and I saw this: Oct 28, 2008 10:38:51 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true) Oct 28, 2008 10:38:51 PM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: after flush: fdx size mismatch: 1156 docs vs 0 length in bytes of _2rv.fdx at org .apache .lucene .index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:94) at org .apache .lucene.index.DocFieldConsumers.closeDocStore(DocFieldConsumers.java: 83) at org .apache .lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java: 47) at org .apache .lucene.index.DocumentsWriter.closeDocStore(DocumentsWriter.java:367) at org.apache.lucene.index.IndexWriter.flushDocStores(IndexWriter.java: 1774) This particular exception is very spooky -- it really looks like something is removing the index files (such as accidentally opening a 2nd writer on the index). Mike