I'd have to see your indexing code to see if there are any obvious performance gotchas there. If you can run your indexer under a profiler (OptimizeIt, JProbe, or just the free one with java using -Xprof), it will tell you in which methods most of your CPU time is spent. If you're using StandardAnalyzer, then this may be it -- StandardAnalyzer is a fairly advanced grammar-based parser, but it is pretty slow. If you don't need its functionality, then try using a simpler Analyzer, (like WhitespaceAnalyzer or a subclass).
As far as changing a document within an index -- there is no "update" operation for documents, there's just delete and add (and then optimize). Delete only marks docs as deleted (so they don't come back in search results); they aren't physically removed from the index files until you optimize. Also, it isn't fatal that your current index doesn't have MD5 info in it. It's pretty fast to compute MD5 at search time for each document returned (much faster than the I/O-bound part -- actually retrieving the docs from the Lucene index). So you could try just doing all your duplicate detection at search time. If this is too slow, you could consider caching the computed MD5 for your docs. -chris On 6/12/05, Dave Kor <[EMAIL PROTECTED]> wrote: > Thanks for the quick reply, Chris. > > Yes, when I say "duplicate" sentences, they are exact copies of the same > string. > > The MD5 hash is a good idea, I wish I had thought of it earlier as it would > have > saved me a lot of trouble. Right now it is not feasible to reindex again > because > indexing is a very slow and cpu intensive task for me. I'm adding > part-of-speech, chunk, named entity and coreference information as I index, > which means it takes 4 separate servers and 4-5 days of processing to create a > new index. And as far as I know, you can't change the index once its created. > Am I correct? > > Any other ideas that don't require me to re-index the whole thing? > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]