corrupted index

2004-12-04 Thread Justin Swanhart
Somehow today one of my indexes became corrupted.  

I get the following IO exception when trying to open the index:
Exception in thread main java.io.IOException: read past EOF
at org.en.lucene.store.InputStream.refill(InputStream.java:154)
at org.en.lucene.store.InputStream.readByte(InputStream.java:43)
at org.en.lucene.store.InputStream.readVInt(InputStream.java:83)
at org.en.lucene.index.FieldInfos.read(FieldInfos.java:195)
at org.en.lucene.index.FieldInfos.init(FieldInfos.java:55)
at org.en.lucene.index.SegmentReader.initialize(SegmentReader.java:109)
at org.en.lucene.index.SegmentReader.init(SegmentReader.java:94)
at org.en.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480)
at 
org.en.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458)
at org.en.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
at org.en.lucene.index.IndexWriter.addDocument(IndexWriter.java:294)
at org.en.global.indexer2.Minnow.main(Minnow.java:142)

Any ideas on what could cause this type of corruption, and what I can
do to avoid it in the future.  Also, any ideas on repairing the index
if this happens?  I removed the index directory and marked the rows to
be reindexed from the database, but the data is unavailable to my
users while the index rebuilds.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: corrupted index

2002-04-02 Thread Otis Gospodnetic

Hello,

Nobody has contributed a tool that verified index integrity, yet.
Is this the latest version of Lucene?
Are you hitting the 2GB/file limit?
Just some ideas.

Otis


--- H S [EMAIL PROTECTED] wrote:
 Dear All,
 
 We are experiencing a problem with
 index updates. We have a fairly
 large index (10 gigabytes). There
 are no problems searching it. But
 when we add a single file and then
 try to optimize, optimization fails
 with a null pointer exception in
 RandomAccessFile.seek.
 
 Has anybody come across this problem?
 Is there a way to tell whether an index
 is corrupted?
 
 Thanks very much -
 
 Hinrich Schuetze


__
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://http://taxes.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: corrupted index

2002-04-02 Thread Doug Cutting

Hinrich,

Can you please send a stack trace?

As others have mentioned, there isn't an index integrity checker.

Doug

P.S.  Hi!  How are you?

 -Original Message-
 From: H S [mailto:[EMAIL PROTECTED]]
 Sent: Monday, April 01, 2002 5:26 PM
 To: [EMAIL PROTECTED]
 Subject: corrupted index
 
 
 Dear All,
 
 We are experiencing a problem with
 index updates. We have a fairly
 large index (10 gigabytes). There
 are no problems searching it. But
 when we add a single file and then
 try to optimize, optimization fails
 with a null pointer exception in
 RandomAccessFile.seek.
 
 Has anybody come across this problem?
 Is there a way to tell whether an index
 is corrupted?
 
 Thanks very much -
 
 Hinrich Schuetze
 
 
 _
 MSN Photos is the easiest way to share and print your photos: 
 http://photos.msn.com/support/worldwide.aspx
 
 
 --
 To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: corrupted index

2002-03-17 Thread Ype Kingma


Otis,

 You can remove the .lock file and try re-indexing or continuing
 indexing where you left off.
 I am not sure about the corrupt index.  I have never seen it happen,
 and I believe I recall reading some messages from Doug Cutting saying
 that index should never be left in an inconsistent state. 

 Obviously never should be, but if something's pulling the rug
out from under his JRE, changes could be only partially written,
right? 

 Or is the writing format in some sense transactionally safe?
I've never worked directly on something like this, but I worked at a
database software company where they used transaction semantics and a
journaling scheme to fake a bulletproof file system.  Is this how
the index-writing code is implemented?

 In general, I can guess Doug's response - just torch the old
index directory and rebuild it; Lucene's indexing is fast enough that
you don't need to get clever.  This seems to be Doug's stance in
general (i.e. don't get fancy, I already put all the fanciness you'll
need into extremely fast indexing and searching).  So far, it seems
to work :-).

Yes, but it's not too difficult to make it work even faster.
Backup your indexes and give all your imports an option to
work incrementally. Then, if something goes wrong, copy from
the backup and restart your import in incremental mode.

  I could be making this up, though, so I suggest you search through
 lucene-user and lucene-dev archives on www.mail-archive.com.
 A search for corrupt should do it.
 Once you figure things out maybe you can post a summary here.

 I got a little curious, so I went and did the searches.  There is
exactly one message in each list archive (dev and users) with the
keyword corrupt in it.  The lucene-users instance is irrelevant:

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html

 The lucene-dev instance is more useful:

http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html

 It's a post from Doug, dated sept 27, 2001, about adding not just
thread-safety but process-safety:

  It should be impossible to corrupt an index through the Lucene API.
  However if a Lucene process exits unexpectedly it can leave the index
  locked.  The remedy is simply to, at a time when it is certain that no
  processes are accessing the index, remove all lock files.
 

Note that this assumes that your file system works as advertised
in the java.io API. If there occasional moments that it doesn't
you'll have to clean up the mess yourself.

 So it sounds like it's worth trying just removing the lock files.
Hm, is there a way to come up with a sanity check you can run on an
index to make sure it's not corrupted?  This might be an excellent
thing to reassure yourself with: something went wrong?  Run a sanity
check, if it fails just reindex.

One sanity check is to delete a document, add it and reoptimize.
I have had document ordering/numbering exceptions from the optimize() call,
so I concluded optimize() does at least some sanity checks
when it performs actual work.
This makes optimize() it an even nicer preparation for backup.

Regards,
Ype

-- 

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: corrupted index

2002-03-17 Thread Matt Tucker

Hey all,

Actually, using shutdown hooks might not be the best idea since Lucene is very 
often used in server-side Java environments. Many app-servers throw security 
errors when trying to add shutdown hooks, and I've seen Weblogic crash before 
when having them in a webapp. Has anyone else run into this?

This all brings up a key issue with Lucene, which is that there is little way 
to recover from errors gracefully. I'd love to see a number of checked 
exceptions added. For example:

 IndexNotFoundException -- when trying to open an index that doesn't exist
 IndexLockedException -- when a lock file prevents you from getting an index
 IndexCorruptException -- maybe this would be thrown when an index appears to 
be broken?

At the moment, Lucene throws many undocumented IOExceptions and even 
NullPointerExceptions when an error case comes up. I catch these in my app, but 
there's really not an intelligent way to recover from them. Adding checked 
exceptions would be a change of the API, but it seems worth it. I'd be happy to 
make a more specific proposal if other people feel like this would be a 
worthwhile direction to go in.

Regards,
Matt

Quoting Spencer, Dave [EMAIL PROTECTED]:

 Runtime.addShutdownHook:
 
 
 
 http://java.sun.com/j2se/1.3/docs/api/java/lang/Runtime.html#addShutdown
 Hook(java.lang.Thread)
 
 -Original Message-
 From: Otis Gospodnetic [ mailto:[EMAIL PROTECTED]]
 Sent: Sunday, March 17, 2002 12:06 AM
 To: Lucene Users List
 Subject: Re: corrupted index
 
 
 Oh, I just thought of something (wine does body good).
 Perhaps one could use Runtime (the class) to catch the JVM shutdown and
 do whatever is needed to prevent index corruption.  I believe there are
 some shutdown hook methods in there that may let you do that.  I'm too
 lazy to look up the API docs now, but I rememeber reading about that
 once, and perhaps it was even mentioned on one of the 2 Lucene mailing
 lists.
 
 On the other hand, it would be great to have a tool that can verify an
 existing index.  I don't know enough about the actual file structure
 yet to write something like that, but maybe somebody else has done that
 already or would like to contribute.
 
 Otis
 
 
 --- Steven J. Owens [EMAIL PROTECTED] wrote:
  Otis,
 
   You can remove the .lock file and try re-indexing or continuing
   indexing where you left off.
   I am not sure about the corrupt index.  I have never seen it
  happen,
   and I believe I recall reading some messages from Doug Cutting
  saying
   that index should never be left in an inconsistent state. 
 
   Obviously never should be, but if something's pulling the rug
  out from under his JRE, changes could be only partially written,
  right? 
 
   Or is the writing format in some sense transactionally safe?
  I've never worked directly on something like this, but I worked at a
  database software company where they used transaction semantics and a
  journaling scheme to fake a bulletproof file system.  Is this how
  the index-writing code is implemented?
 
   In general, I can guess Doug's response - just torch the old
  index directory and rebuild it; Lucene's indexing is fast enough that
  you don't need to get clever.  This seems to be Doug's stance in
  general (i.e. don't get fancy, I already put all the fanciness
  you'll
  need into extremely fast indexing and searching).  So far, it seems
  to work :-).
 
   I could be making this up, though, so I suggest you search through
   lucene-user and lucene-dev archives on www.mail-archive.com.
   A search for corrupt should do it.
   Once you figure things out maybe you can post a summary here.
 
   I got a little curious, so I went and did the searches.  There
  is
  exactly one message in each list archive (dev and users) with the
  keyword corrupt in it.  The lucene-users instance is irrelevant:
 
 
 http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html
 
   The lucene-dev instance is more useful:
 
 
 http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html
 
   It's a post from Doug, dated sept 27, 2001, about adding not
  just
  thread-safety but process-safety:
 
It should be impossible to corrupt an index through the Lucene API.
However if a Lucene process exits unexpectedly it can leave the
  index
locked.  The remedy is simply to, at a time when it is certain that
  no
processes are accessing the index, remove all lock files.
   
   So it sounds like it's worth trying just removing the lock
  files.
  Hm, is there a way to come up with a sanity check you can run on an
  index to make sure it's not corrupted?  This might be an excellent
  thing to reassure yourself with: something went wrong?  Run a sanity
  check, if it fails just reindex.
 
  Steven J. Owens
  [EMAIL PROTECTED]
 
 
 __
 Do You Yahoo!?
 Yahoo! Sports - live college hoops coverage
 http://sports.yahoo.com/
 
 --
 To unsubscribe, e

Re: corrupted index

2002-03-16 Thread Steven J. Owens

Otis,

 You can remove the .lock file and try re-indexing or continuing
 indexing where you left off.
 I am not sure about the corrupt index.  I have never seen it happen,
 and I believe I recall reading some messages from Doug Cutting saying
 that index should never be left in an inconsistent state.  

 Obviously never should be, but if something's pulling the rug
out from under his JRE, changes could be only partially written,
right?  

 Or is the writing format in some sense transactionally safe?
I've never worked directly on something like this, but I worked at a
database software company where they used transaction semantics and a
journaling scheme to fake a bulletproof file system.  Is this how
the index-writing code is implemented?

 In general, I can guess Doug's response - just torch the old
index directory and rebuild it; Lucene's indexing is fast enough that
you don't need to get clever.  This seems to be Doug's stance in
general (i.e. don't get fancy, I already put all the fanciness you'll
need into extremely fast indexing and searching).  So far, it seems
to work :-).

 I could be making this up, though, so I suggest you search through
 lucene-user and lucene-dev archives on www.mail-archive.com.
 A search for corrupt should do it.
 Once you figure things out maybe you can post a summary here.

 I got a little curious, so I went and did the searches.  There is
exactly one message in each list archive (dev and users) with the
keyword corrupt in it.  The lucene-users instance is irrelevant:

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html

 The lucene-dev instance is more useful:

http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html

 It's a post from Doug, dated sept 27, 2001, about adding not just
thread-safety but process-safety:

  It should be impossible to corrupt an index through the Lucene API.
  However if a Lucene process exits unexpectedly it can leave the index
  locked.  The remedy is simply to, at a time when it is certain that no
  processes are accessing the index, remove all lock files.
  
 So it sounds like it's worth trying just removing the lock files.
Hm, is there a way to come up with a sanity check you can run on an
index to make sure it's not corrupted?  This might be an excellent
thing to reassure yourself with: something went wrong?  Run a sanity
check, if it fails just reindex.

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: corrupted index

2002-03-16 Thread Otis Gospodnetic

Oh, I just thought of something (wine does body good).
Perhaps one could use Runtime (the class) to catch the JVM shutdown and
do whatever is needed to prevent index corruption.  I believe there are
some shutdown hook methods in there that may let you do that.  I'm too
lazy to look up the API docs now, but I rememeber reading about that
once, and perhaps it was even mentioned on one of the 2 Lucene mailing
lists.

On the other hand, it would be great to have a tool that can verify an
existing index.  I don't know enough about the actual file structure
yet to write something like that, but maybe somebody else has done that
already or would like to contribute.

Otis


--- Steven J. Owens [EMAIL PROTECTED] wrote:
 Otis,
 
  You can remove the .lock file and try re-indexing or continuing
  indexing where you left off.
  I am not sure about the corrupt index.  I have never seen it
 happen,
  and I believe I recall reading some messages from Doug Cutting
 saying
  that index should never be left in an inconsistent state.  
 
  Obviously never should be, but if something's pulling the rug
 out from under his JRE, changes could be only partially written,
 right?  
 
  Or is the writing format in some sense transactionally safe?
 I've never worked directly on something like this, but I worked at a
 database software company where they used transaction semantics and a
 journaling scheme to fake a bulletproof file system.  Is this how
 the index-writing code is implemented?
 
  In general, I can guess Doug's response - just torch the old
 index directory and rebuild it; Lucene's indexing is fast enough that
 you don't need to get clever.  This seems to be Doug's stance in
 general (i.e. don't get fancy, I already put all the fanciness
 you'll
 need into extremely fast indexing and searching).  So far, it seems
 to work :-).
 
  I could be making this up, though, so I suggest you search through
  lucene-user and lucene-dev archives on www.mail-archive.com.
  A search for corrupt should do it.
  Once you figure things out maybe you can post a summary here.
 
  I got a little curious, so I went and did the searches.  There
 is
 exactly one message in each list archive (dev and users) with the
 keyword corrupt in it.  The lucene-users instance is irrelevant:
 

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html
 
  The lucene-dev instance is more useful:
 

http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html
 
  It's a post from Doug, dated sept 27, 2001, about adding not
 just
 thread-safety but process-safety:
 
   It should be impossible to corrupt an index through the Lucene API.
   However if a Lucene process exits unexpectedly it can leave the
 index
   locked.  The remedy is simply to, at a time when it is certain that
 no
   processes are accessing the index, remove all lock files.
   
  So it sounds like it's worth trying just removing the lock
 files.
 Hm, is there a way to come up with a sanity check you can run on an
 index to make sure it's not corrupted?  This might be an excellent
 thing to reassure yourself with: something went wrong?  Run a sanity
 check, if it fails just reindex.
 
 Steven J. Owens
 [EMAIL PROTECTED]


__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]