corrupted index
Somehow today one of my indexes became corrupted. I get the following IO exception when trying to open the index: Exception in thread main java.io.IOException: read past EOF at org.en.lucene.store.InputStream.refill(InputStream.java:154) at org.en.lucene.store.InputStream.readByte(InputStream.java:43) at org.en.lucene.store.InputStream.readVInt(InputStream.java:83) at org.en.lucene.index.FieldInfos.read(FieldInfos.java:195) at org.en.lucene.index.FieldInfos.init(FieldInfos.java:55) at org.en.lucene.index.SegmentReader.initialize(SegmentReader.java:109) at org.en.lucene.index.SegmentReader.init(SegmentReader.java:94) at org.en.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480) at org.en.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458) at org.en.lucene.index.IndexWriter.addDocument(IndexWriter.java:310) at org.en.lucene.index.IndexWriter.addDocument(IndexWriter.java:294) at org.en.global.indexer2.Minnow.main(Minnow.java:142) Any ideas on what could cause this type of corruption, and what I can do to avoid it in the future. Also, any ideas on repairing the index if this happens? I removed the index directory and marked the rows to be reindexed from the database, but the data is unavailable to my users while the index rebuilds. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: corrupted index
Hello, Nobody has contributed a tool that verified index integrity, yet. Is this the latest version of Lucene? Are you hitting the 2GB/file limit? Just some ideas. Otis --- H S [EMAIL PROTECTED] wrote: Dear All, We are experiencing a problem with index updates. We have a fairly large index (10 gigabytes). There are no problems searching it. But when we add a single file and then try to optimize, optimization fails with a null pointer exception in RandomAccessFile.seek. Has anybody come across this problem? Is there a way to tell whether an index is corrupted? Thanks very much - Hinrich Schuetze __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: corrupted index
Hinrich, Can you please send a stack trace? As others have mentioned, there isn't an index integrity checker. Doug P.S. Hi! How are you? -Original Message- From: H S [mailto:[EMAIL PROTECTED]] Sent: Monday, April 01, 2002 5:26 PM To: [EMAIL PROTECTED] Subject: corrupted index Dear All, We are experiencing a problem with index updates. We have a fairly large index (10 gigabytes). There are no problems searching it. But when we add a single file and then try to optimize, optimization fails with a null pointer exception in RandomAccessFile.seek. Has anybody come across this problem? Is there a way to tell whether an index is corrupted? Thanks very much - Hinrich Schuetze _ MSN Photos is the easiest way to share and print your photos: http://photos.msn.com/support/worldwide.aspx -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: corrupted index
Otis, You can remove the .lock file and try re-indexing or continuing indexing where you left off. I am not sure about the corrupt index. I have never seen it happen, and I believe I recall reading some messages from Doug Cutting saying that index should never be left in an inconsistent state. Obviously never should be, but if something's pulling the rug out from under his JRE, changes could be only partially written, right? Or is the writing format in some sense transactionally safe? I've never worked directly on something like this, but I worked at a database software company where they used transaction semantics and a journaling scheme to fake a bulletproof file system. Is this how the index-writing code is implemented? In general, I can guess Doug's response - just torch the old index directory and rebuild it; Lucene's indexing is fast enough that you don't need to get clever. This seems to be Doug's stance in general (i.e. don't get fancy, I already put all the fanciness you'll need into extremely fast indexing and searching). So far, it seems to work :-). Yes, but it's not too difficult to make it work even faster. Backup your indexes and give all your imports an option to work incrementally. Then, if something goes wrong, copy from the backup and restart your import in incremental mode. I could be making this up, though, so I suggest you search through lucene-user and lucene-dev archives on www.mail-archive.com. A search for corrupt should do it. Once you figure things out maybe you can post a summary here. I got a little curious, so I went and did the searches. There is exactly one message in each list archive (dev and users) with the keyword corrupt in it. The lucene-users instance is irrelevant: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html The lucene-dev instance is more useful: http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html It's a post from Doug, dated sept 27, 2001, about adding not just thread-safety but process-safety: It should be impossible to corrupt an index through the Lucene API. However if a Lucene process exits unexpectedly it can leave the index locked. The remedy is simply to, at a time when it is certain that no processes are accessing the index, remove all lock files. Note that this assumes that your file system works as advertised in the java.io API. If there occasional moments that it doesn't you'll have to clean up the mess yourself. So it sounds like it's worth trying just removing the lock files. Hm, is there a way to come up with a sanity check you can run on an index to make sure it's not corrupted? This might be an excellent thing to reassure yourself with: something went wrong? Run a sanity check, if it fails just reindex. One sanity check is to delete a document, add it and reoptimize. I have had document ordering/numbering exceptions from the optimize() call, so I concluded optimize() does at least some sanity checks when it performs actual work. This makes optimize() it an even nicer preparation for backup. Regards, Ype -- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: corrupted index
Hey all, Actually, using shutdown hooks might not be the best idea since Lucene is very often used in server-side Java environments. Many app-servers throw security errors when trying to add shutdown hooks, and I've seen Weblogic crash before when having them in a webapp. Has anyone else run into this? This all brings up a key issue with Lucene, which is that there is little way to recover from errors gracefully. I'd love to see a number of checked exceptions added. For example: IndexNotFoundException -- when trying to open an index that doesn't exist IndexLockedException -- when a lock file prevents you from getting an index IndexCorruptException -- maybe this would be thrown when an index appears to be broken? At the moment, Lucene throws many undocumented IOExceptions and even NullPointerExceptions when an error case comes up. I catch these in my app, but there's really not an intelligent way to recover from them. Adding checked exceptions would be a change of the API, but it seems worth it. I'd be happy to make a more specific proposal if other people feel like this would be a worthwhile direction to go in. Regards, Matt Quoting Spencer, Dave [EMAIL PROTECTED]: Runtime.addShutdownHook: http://java.sun.com/j2se/1.3/docs/api/java/lang/Runtime.html#addShutdown Hook(java.lang.Thread) -Original Message- From: Otis Gospodnetic [ mailto:[EMAIL PROTECTED]] Sent: Sunday, March 17, 2002 12:06 AM To: Lucene Users List Subject: Re: corrupted index Oh, I just thought of something (wine does body good). Perhaps one could use Runtime (the class) to catch the JVM shutdown and do whatever is needed to prevent index corruption. I believe there are some shutdown hook methods in there that may let you do that. I'm too lazy to look up the API docs now, but I rememeber reading about that once, and perhaps it was even mentioned on one of the 2 Lucene mailing lists. On the other hand, it would be great to have a tool that can verify an existing index. I don't know enough about the actual file structure yet to write something like that, but maybe somebody else has done that already or would like to contribute. Otis --- Steven J. Owens [EMAIL PROTECTED] wrote: Otis, You can remove the .lock file and try re-indexing or continuing indexing where you left off. I am not sure about the corrupt index. I have never seen it happen, and I believe I recall reading some messages from Doug Cutting saying that index should never be left in an inconsistent state. Obviously never should be, but if something's pulling the rug out from under his JRE, changes could be only partially written, right? Or is the writing format in some sense transactionally safe? I've never worked directly on something like this, but I worked at a database software company where they used transaction semantics and a journaling scheme to fake a bulletproof file system. Is this how the index-writing code is implemented? In general, I can guess Doug's response - just torch the old index directory and rebuild it; Lucene's indexing is fast enough that you don't need to get clever. This seems to be Doug's stance in general (i.e. don't get fancy, I already put all the fanciness you'll need into extremely fast indexing and searching). So far, it seems to work :-). I could be making this up, though, so I suggest you search through lucene-user and lucene-dev archives on www.mail-archive.com. A search for corrupt should do it. Once you figure things out maybe you can post a summary here. I got a little curious, so I went and did the searches. There is exactly one message in each list archive (dev and users) with the keyword corrupt in it. The lucene-users instance is irrelevant: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html The lucene-dev instance is more useful: http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html It's a post from Doug, dated sept 27, 2001, about adding not just thread-safety but process-safety: It should be impossible to corrupt an index through the Lucene API. However if a Lucene process exits unexpectedly it can leave the index locked. The remedy is simply to, at a time when it is certain that no processes are accessing the index, remove all lock files. So it sounds like it's worth trying just removing the lock files. Hm, is there a way to come up with a sanity check you can run on an index to make sure it's not corrupted? This might be an excellent thing to reassure yourself with: something went wrong? Run a sanity check, if it fails just reindex. Steven J. Owens [EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e
Re: corrupted index
Otis, You can remove the .lock file and try re-indexing or continuing indexing where you left off. I am not sure about the corrupt index. I have never seen it happen, and I believe I recall reading some messages from Doug Cutting saying that index should never be left in an inconsistent state. Obviously never should be, but if something's pulling the rug out from under his JRE, changes could be only partially written, right? Or is the writing format in some sense transactionally safe? I've never worked directly on something like this, but I worked at a database software company where they used transaction semantics and a journaling scheme to fake a bulletproof file system. Is this how the index-writing code is implemented? In general, I can guess Doug's response - just torch the old index directory and rebuild it; Lucene's indexing is fast enough that you don't need to get clever. This seems to be Doug's stance in general (i.e. don't get fancy, I already put all the fanciness you'll need into extremely fast indexing and searching). So far, it seems to work :-). I could be making this up, though, so I suggest you search through lucene-user and lucene-dev archives on www.mail-archive.com. A search for corrupt should do it. Once you figure things out maybe you can post a summary here. I got a little curious, so I went and did the searches. There is exactly one message in each list archive (dev and users) with the keyword corrupt in it. The lucene-users instance is irrelevant: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html The lucene-dev instance is more useful: http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html It's a post from Doug, dated sept 27, 2001, about adding not just thread-safety but process-safety: It should be impossible to corrupt an index through the Lucene API. However if a Lucene process exits unexpectedly it can leave the index locked. The remedy is simply to, at a time when it is certain that no processes are accessing the index, remove all lock files. So it sounds like it's worth trying just removing the lock files. Hm, is there a way to come up with a sanity check you can run on an index to make sure it's not corrupted? This might be an excellent thing to reassure yourself with: something went wrong? Run a sanity check, if it fails just reindex. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: corrupted index
Oh, I just thought of something (wine does body good). Perhaps one could use Runtime (the class) to catch the JVM shutdown and do whatever is needed to prevent index corruption. I believe there are some shutdown hook methods in there that may let you do that. I'm too lazy to look up the API docs now, but I rememeber reading about that once, and perhaps it was even mentioned on one of the 2 Lucene mailing lists. On the other hand, it would be great to have a tool that can verify an existing index. I don't know enough about the actual file structure yet to write something like that, but maybe somebody else has done that already or would like to contribute. Otis --- Steven J. Owens [EMAIL PROTECTED] wrote: Otis, You can remove the .lock file and try re-indexing or continuing indexing where you left off. I am not sure about the corrupt index. I have never seen it happen, and I believe I recall reading some messages from Doug Cutting saying that index should never be left in an inconsistent state. Obviously never should be, but if something's pulling the rug out from under his JRE, changes could be only partially written, right? Or is the writing format in some sense transactionally safe? I've never worked directly on something like this, but I worked at a database software company where they used transaction semantics and a journaling scheme to fake a bulletproof file system. Is this how the index-writing code is implemented? In general, I can guess Doug's response - just torch the old index directory and rebuild it; Lucene's indexing is fast enough that you don't need to get clever. This seems to be Doug's stance in general (i.e. don't get fancy, I already put all the fanciness you'll need into extremely fast indexing and searching). So far, it seems to work :-). I could be making this up, though, so I suggest you search through lucene-user and lucene-dev archives on www.mail-archive.com. A search for corrupt should do it. Once you figure things out maybe you can post a summary here. I got a little curious, so I went and did the searches. There is exactly one message in each list archive (dev and users) with the keyword corrupt in it. The lucene-users instance is irrelevant: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html The lucene-dev instance is more useful: http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html It's a post from Doug, dated sept 27, 2001, about adding not just thread-safety but process-safety: It should be impossible to corrupt an index through the Lucene API. However if a Lucene process exits unexpectedly it can leave the index locked. The remedy is simply to, at a time when it is certain that no processes are accessing the index, remove all lock files. So it sounds like it's worth trying just removing the lock files. Hm, is there a way to come up with a sanity check you can run on an index to make sure it's not corrupted? This might be an excellent thing to reassure yourself with: something went wrong? Run a sanity check, if it fails just reindex. Steven J. Owens [EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]