I'm trying to delete a large number of documents
(~15million) from a a large index (30+ million
documents). I've started with an optimized index, and
a list of docIds (our own unique identifier for a
document, not a Lucene doc number) to pass to the
IndexReader.delete(Term t) method. I've had a few
different problems.
The following code is inside the loop that iterates
through the document IDs:
try {
Term t = new Term("docID",
String.valueOf(docID));
deletedCount+=indexReader.delete(t);
}
catch (Exception e)
{
System.out.println("Error while
deleting docID#" + docID);
e.printStackTrace();
}
In order to commit the deletions, I also close and
reopen the IndexReader periodically.
At first I was reopening the IndexReader after every
500K documents deleted. The problem was that after
~60-75K deletions, the delete call began to throw a
NullPointerException:
Error while deleting docID#27136356
java.lang.NullPointerException
at java.lang.String.compareTo(String.java:402)
at
org.apache.lucene.index.Term.compareTo(Term.java:76)
at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143)
at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:132)
at
org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:51)
at
org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:364)
at
org.apache.lucene.index.IndexReader.delete(IndexReader.java:449)
at IndexEraser.main(IndexEraser.java:32)
After a little fiddling around, I tried reducing the
interval between reopens to 5000, and most of the
NullPointerExceptions went away.
A test search of the resulting, unoptimized index
worked fine.
I then optimized the index to reduce the size of the
index. Now, instead of getting data back for many of
the results, I get a null value.
Any ideas? I'm really confused, and the only other
option I can think of is to reindex the documents I
need, which would take much longer than deleting the
ones I dont.
Thanks!
Greg Gershman
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]