Re: URGENT: Help indexing large document set

John Wang Wed, 24 Nov 2004 12:21:15 -0800

Thanks Paul!

Using your suggestion, I have changed the update check code to use
only the indexReader:


try {
              localReader = IndexReader.open(path);

              while (keyIter.hasNext()) {
                key = (String) keyIter.next();
                term = new Term("key", key);
                TermDocs tDocs = localReader.termDocs(term);
                if (tDocs != null) {
                  try {
                    while (tDocs.next()) {
                      localReader.delete(tDocs.doc());
                    }
                  } finally {
                    tDocs.close();
                  }
                }
              }
            } finally {

              if (localReader != null) {
                localReader.close();
              }

            }


Unfortunately it didn't seem to make any dramatic difference.

I also see the CPU is only 30-50% busy, so I am guessing it's spending
a lot of time in IO. Anyway of making the CPU work harder?

Is batch size of 500 too small for 1 million documents?

Currently I am seeing a linear speed degredation of 0.3 milliseconds
per document.

Thanks

-John


On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot <[EMAIL PROTECTED]> wrote:
> On Wednesday 24 November 2004 00:37, John Wang wrote:
> 
> 
> > Hi:
> >
> >    I am trying to index 1M documents, with batches of 500 documents.
> >
> >    Each document has an unique text key, which is added as a
> > Field.KeyWord(name,value).
> >
> >    For each batch of 500, I need to make sure I am not adding a
> > document with a key that is already in the current index.
> >
> >   To do this, I am calling IndexSearcher.docFreq for each document and
> > delete the document currently in the index with the same key:
> >
> >        while (keyIter.hasNext()) {
> >             String objectID = (String) keyIter.next();
> >             term = new Term("key", objectID);
> >             int count = localSearcher.docFreq(term);
> 
> To speed this up a bit make sure that the iterator gives
> the terms in sorted order. I'd use an index reader instead
> of a searcher, but that will probably not make a difference.
> 
> Adding the documents can be done with multiple threads.
> Last time I checked that, there was a moderate speed up
> using three threads instead of one on a single CPU machine.
> Tuning the values of minMergeDocs and maxMergeDocs
> may also help to increase performance of adding documents.
> 
> Regards,
> Paul Elschot
> 
> ---------------------------------------------------------------------
> 
> 
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: URGENT: Help indexing large document set

Reply via email to