Re: URGENT: Help indexing large document set

2004-11-27 Thread John Wang
en
> you
> > should work for Rojo!  If you recommend someone and we hire them
> you'll
> > get a free iPod!
> >
> > Kevin A. Burton, Location - San Francisco, CA
> >AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> > GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
> >
> >
> >
> >
> > -
> 
> 
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -Original Message-
> From: John Wang [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 22, 2004 12:35 PM
> To: Lucene Users List
> Subject: Re: Index in RAM - is it realy worthy?
> 
> In my test, I have 12900 documents. Each document is small, a few
> discreet fields (KeyWord type) and 1 Text field containing only 1
> sentence.
> 
> with both mergeFactor and maxMergeDocs being 1000
> 
> using RamDirectory, the indexing job took about 9.2 seconds
> 
> not using RamDirectory, the indexing job took about 122 seconds.
> 
> I am not calling optimize.
> 
> This is on windows Xp running java 1.5.
> 
> Is there something very wrong or different in my setup to cause such a
> big different?
> 
> Thanks
> 
> -John
> 
> On Mon, 22 Nov 2004 09:23:40 -0800 (PST), Otis Gospodnetic
> <[EMAIL PROTECTED]> wrote:
> > For the Lucene book I wrote some test cases that compare FSDirectory
> > and RAMDirectory.  What I found was that with certain settings
> > FSDirectory was almost as fast as RAMDirectory.  Personally, I would
> > push FSDirectory and hope that the OS and the Filesystem do their
> share
> > of work and caching for me before looking for ways to optimize my
> code.
> >
> > Otis
> >
> >
> >
> > --- [EMAIL PROTECTED] wrote:
> >
> > >
> > > I did following test:
> > > I created  the RAM folder on my Red Hat box and copied   c. 1Gb of
> > > indexes
> > > there.
> > > I expected the queries to run much quicker.
> > > In reality it was even sometimes slower(sic!)
> > >
> > > Lucene has it's own RAM disk functionality. If I implement it, would
> > > it
> > > bring any benefits?
> > >
> > > Thanks in advance
> > > J.
> >
> > -
> 
> 
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>  > -Original Message-
>  > From: John Wang [mailto:[EMAIL PROTECTED]
>  > Sent: Saturday, November 27, 2004 11:50 AM
>  > To: Chuck Williams
>  > Subject: Re: URGENT: Help indexing large document set
>  >
>  > I found the reason for the degredation. It is because I was writing
> to
>  > a RamDirectory and then adding to a FSWriter. I guess it makes sense
>  > since the addIndex call would slow down as the index grows.
>  >
>  > I guess it is not a good idea to use RamDirectory if there are many
>  > small batches. Are there some performace numbers that would tell me
>  > when to/not to use a RamDirectory?
>  >
>  > thanks
>  >
>  > -John
>  >
>  >
>  > On Wed, 24 Nov 2004 15:23:49 -0800, John Wang <[EMAIL PROTECTED]>
>  > wrote:
>  > > Hi Chuck:
>  > >
>  > >  The reason I am not using localReader.delete(term) is because
> I
>  > > have some logic to check whether to delete the term based on a
> flag.
>  > >
>  > >  I am testing with the keys to be sorted.
>  > >
>  > >  I am not doing anything weird, just committing a batch of 500
>  > > documents to the index of 2000 batches. I don't what why it is
> having
>  > > this linear slow down...
>  > >
>  > >
>  > >
>  > > Thanks
>  > >
>  > > -John
>  > >
>  > > On Wed, 24 Nov 2004 12:32:52 -0800, Chuck Williams
> <[EMAIL PROTECTED]>
>  > wrote:
>  > > > Does keyIter return the keys in sorted order?  This should
> reduce
>  > seeks,
>  > > > especially if the keys are dense.
>  > > >
>  > > > Also, you should be able

RE: URGENT: Help indexing large document set

2004-11-24 Thread Chuck Williams
Does keyIter return the keys in sorted order?  This should reduce seeks,
especially if the keys are dense.

Also, you should be able to localReader.delete(term) instead of
iterating over the docs (of which I presume there is only one doc since
keys are unique).  This won't improve performance as
IndexReader.delete(Term) does exactly what your code does, but it will
be cleaner.

A linear slowdown with number of docs doesn't make sense, so something
else must be wrong.  I'm not sure what the default buffer size is (it
appears it used to be 128 but is dynamic now I think).  You might find
the slowdown stops after a certain point, especially if you increase
your batch size.

Chuck

  > -Original Message-
  > From: John Wang [mailto:[EMAIL PROTECTED]
  > Sent: Wednesday, November 24, 2004 12:21 PM
  > To: Lucene Users List
  > Subject: Re: URGENT: Help indexing large document set
  > 
  > Thanks Paul!
  > 
  > Using your suggestion, I have changed the update check code to use
  > only the indexReader:
  > 
  > try {
  >   localReader = IndexReader.open(path);
  > 
  >   while (keyIter.hasNext()) {
  > key = (String) keyIter.next();
  > term = new Term("key", key);
  > TermDocs tDocs = localReader.termDocs(term);
  > if (tDocs != null) {
  >   try {
  > while (tDocs.next()) {
  >   localReader.delete(tDocs.doc());
  > }
  >   } finally {
  > tDocs.close();
  >   }
  > }
  >   }
  > } finally {
  > 
  >   if (localReader != null) {
  > localReader.close();
  >   }
  > 
  > }
  > 
  > 
  > Unfortunately it didn't seem to make any dramatic difference.
  > 
  > I also see the CPU is only 30-50% busy, so I am guessing it's
spending
  > a lot of time in IO. Anyway of making the CPU work harder?
  > 
  > Is batch size of 500 too small for 1 million documents?
  > 
  > Currently I am seeing a linear speed degredation of 0.3 milliseconds
  > per document.
  > 
  > Thanks
  > 
  > -John
  > 
  > 
  > On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot
  > <[EMAIL PROTECTED]> wrote:
  > > On Wednesday 24 November 2004 00:37, John Wang wrote:
  > >
  > >
  > > > Hi:
  > > >
  > > >I am trying to index 1M documents, with batches of 500
documents.
  > > >
  > > >Each document has an unique text key, which is added as a
  > > > Field.KeyWord(name,value).
  > > >
  > > >For each batch of 500, I need to make sure I am not adding a
  > > > document with a key that is already in the current index.
  > > >
  > > >   To do this, I am calling IndexSearcher.docFreq for each
document
  > and
  > > > delete the document currently in the index with the same key:
  > > >
  > > >while (keyIter.hasNext()) {
  > > > String objectID = (String) keyIter.next();
  > > > term = new Term("key", objectID);
  > > > int count = localSearcher.docFreq(term);
  > >
  > > To speed this up a bit make sure that the iterator gives
  > > the terms in sorted order. I'd use an index reader instead
  > > of a searcher, but that will probably not make a difference.
  > >
  > > Adding the documents can be done with multiple threads.
  > > Last time I checked that, there was a moderate speed up
  > > using three threads instead of one on a single CPU machine.
  > > Tuning the values of minMergeDocs and maxMergeDocs
  > > may also help to increase performance of adding documents.
  > >
  > > Regards,
  > > Paul Elschot
  > >
  > >
-
  > >
  > >
  > > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > > For additional commands, e-mail:
[EMAIL PROTECTED]
  > >
  > >
  > 
  >
-
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: URGENT: Help indexing large document set

2004-11-24 Thread John Wang
Thanks Paul!

Using your suggestion, I have changed the update check code to use
only the indexReader:

try {
  localReader = IndexReader.open(path);

  while (keyIter.hasNext()) {
key = (String) keyIter.next();
term = new Term("key", key);
TermDocs tDocs = localReader.termDocs(term);
if (tDocs != null) {
  try {
while (tDocs.next()) {
  localReader.delete(tDocs.doc());
}
  } finally {
tDocs.close();
  }
}
  }
} finally {

  if (localReader != null) {
localReader.close();
  }

}


Unfortunately it didn't seem to make any dramatic difference.

I also see the CPU is only 30-50% busy, so I am guessing it's spending
a lot of time in IO. Anyway of making the CPU work harder?

Is batch size of 500 too small for 1 million documents?

Currently I am seeing a linear speed degredation of 0.3 milliseconds
per document.

Thanks

-John


On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot <[EMAIL PROTECTED]> wrote:
> On Wednesday 24 November 2004 00:37, John Wang wrote:
> 
> 
> > Hi:
> >
> >I am trying to index 1M documents, with batches of 500 documents.
> >
> >Each document has an unique text key, which is added as a
> > Field.KeyWord(name,value).
> >
> >For each batch of 500, I need to make sure I am not adding a
> > document with a key that is already in the current index.
> >
> >   To do this, I am calling IndexSearcher.docFreq for each document and
> > delete the document currently in the index with the same key:
> >
> >while (keyIter.hasNext()) {
> > String objectID = (String) keyIter.next();
> > term = new Term("key", objectID);
> > int count = localSearcher.docFreq(term);
> 
> To speed this up a bit make sure that the iterator gives
> the terms in sorted order. I'd use an index reader instead
> of a searcher, but that will probably not make a difference.
> 
> Adding the documents can be done with multiple threads.
> Last time I checked that, there was a moderate speed up
> using three threads instead of one on a single CPU machine.
> Tuning the values of minMergeDocs and maxMergeDocs
> may also help to increase performance of adding documents.
> 
> Regards,
> Paul Elschot
> 
> -
> 
> 
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: URGENT: Help indexing large document set

2004-11-24 Thread Paul Elschot
On Wednesday 24 November 2004 00:37, John Wang wrote:
> Hi:
> 
>I am trying to index 1M documents, with batches of 500 documents.
> 
>Each document has an unique text key, which is added as a
> Field.KeyWord(name,value).
> 
>For each batch of 500, I need to make sure I am not adding a
> document with a key that is already in the current index.
> 
>   To do this, I am calling IndexSearcher.docFreq for each document and
> delete the document currently in the index with the same key:
>  
>while (keyIter.hasNext()) {
> String objectID = (String) keyIter.next();
> term = new Term("key", objectID);
> int count = localSearcher.docFreq(term);

To speed this up a bit make sure that the iterator gives
the terms in sorted order. I'd use an index reader instead
of a searcher, but that will probably not make a difference.

Adding the documents can be done with multiple threads.
Last time I checked that, there was a moderate speed up
using three threads instead of one on a single CPU machine.
Tuning the values of minMergeDocs and maxMergeDocs
may also help to increase performance of adding documents.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: URGENT: Help indexing large document set

2004-11-23 Thread John Wang
Thanks Chuck! I missed the call: getIndexOffset.
I am profiling it again to pin-point where the performance problem is.

-John

On Tue, 23 Nov 2004 16:13:22 -0800, Chuck Williams <[EMAIL PROTECTED]> wrote:
> Are you sure you have a performance problem with
> TermInfosReader.get(Term)?  It looks to me like it scans sequentially
> only within a small buffer window (of size
> SegmentTermEnum.indexInterval) and that it uses binary search otherwise.
> See TermInfosReader.getIndexOffset(Term).
> 
> Chuck
> 
> 
> 
>  > -Original Message-
>  > From: John Wang [mailto:[EMAIL PROTECTED]
>  > Sent: Tuesday, November 23, 2004 3:38 PM
>  > To: [EMAIL PROTECTED]
>  > Subject: URGENT: Help indexing large document set
>  >
>  > Hi:
>  >
>  >I am trying to index 1M documents, with batches of 500 documents.
>  >
>  >Each document has an unique text key, which is added as a
>  > Field.KeyWord(name,value).
>  >
>  >For each batch of 500, I need to make sure I am not adding a
>  > document with a key that is already in the current index.
>  >
>  >   To do this, I am calling IndexSearcher.docFreq for each document
> and
>  > delete the document currently in the index with the same key:
>  >
>  >while (keyIter.hasNext()) {
>  > String objectID = (String) keyIter.next();
>  > term = new Term("key", objectID);
>  > int count = localSearcher.docFreq(term);
>  >
>  > if (count != 0) {
>  > localReader.delete(term);
>  > }
>  >   }
>  >
>  > Then I proceed with adding the documents.
>  >
>  > This turns out to be extremely expensive, I looked into the code and
> I
>  > see in
>  > TermInfosReader.get(Term term) it is doing a linear look up for each
>  > term. So as the index grows, the above operation degrades at a
> linear
>  > rate. So for each commit, we are doing a docFreq for 500 documents.
>  >
>  > I also tried to create a BooleanQuery composed of 500 TermQueries
> and
>  > do 1 search for each batch, and the performance didn't get better.
> And
>  > if the batch size increases to say 50,000, creating a BooleanQuery
>  > composed of 50,000 TermQuery instances may introduce huge memory
>  > costs.
>  >
>  > Is there a better way to do this?
>  >
>  > Can TermInfosReader.get(Term term) be optimized to do a binary
> lookup
>  > instead of a linear walk? Of course that depends on whether the
> terms
>  > are stored in sorted order, are they?
>  >
>  > This is very urgent, thanks in advance for all your help.
>  >
>  > -John
>  >
>  >
> -
>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: URGENT: Help indexing large document set

2004-11-23 Thread Chuck Williams
Are you sure you have a performance problem with
TermInfosReader.get(Term)?  It looks to me like it scans sequentially
only within a small buffer window (of size
SegmentTermEnum.indexInterval) and that it uses binary search otherwise.
See TermInfosReader.getIndexOffset(Term).

Chuck

  > -Original Message-
  > From: John Wang [mailto:[EMAIL PROTECTED]
  > Sent: Tuesday, November 23, 2004 3:38 PM
  > To: [EMAIL PROTECTED]
  > Subject: URGENT: Help indexing large document set
  > 
  > Hi:
  > 
  >I am trying to index 1M documents, with batches of 500 documents.
  > 
  >Each document has an unique text key, which is added as a
  > Field.KeyWord(name,value).
  > 
  >For each batch of 500, I need to make sure I am not adding a
  > document with a key that is already in the current index.
  > 
  >   To do this, I am calling IndexSearcher.docFreq for each document
and
  > delete the document currently in the index with the same key:
  > 
  >while (keyIter.hasNext()) {
  > String objectID = (String) keyIter.next();
  > term = new Term("key", objectID);
  > int count = localSearcher.docFreq(term);
  > 
  > if (count != 0) {
  > localReader.delete(term);
  > }
  >   }
  > 
  > Then I proceed with adding the documents.
  > 
  > This turns out to be extremely expensive, I looked into the code and
I
  > see in
  > TermInfosReader.get(Term term) it is doing a linear look up for each
  > term. So as the index grows, the above operation degrades at a
linear
  > rate. So for each commit, we are doing a docFreq for 500 documents.
  > 
  > I also tried to create a BooleanQuery composed of 500 TermQueries
and
  > do 1 search for each batch, and the performance didn't get better.
And
  > if the batch size increases to say 50,000, creating a BooleanQuery
  > composed of 50,000 TermQuery instances may introduce huge memory
  > costs.
  > 
  > Is there a better way to do this?
  > 
  > Can TermInfosReader.get(Term term) be optimized to do a binary
lookup
  > instead of a linear walk? Of course that depends on whether the
terms
  > are stored in sorted order, are they?
  > 
  > This is very urgent, thanks in advance for all your help.
  > 
  > -John
  > 
  >
-
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



URGENT: Help indexing large document set

2004-11-23 Thread John Wang
Hi:

   I am trying to index 1M documents, with batches of 500 documents.

   Each document has an unique text key, which is added as a
Field.KeyWord(name,value).

   For each batch of 500, I need to make sure I am not adding a
document with a key that is already in the current index.

  To do this, I am calling IndexSearcher.docFreq for each document and
delete the document currently in the index with the same key:
 
   while (keyIter.hasNext()) {
String objectID = (String) keyIter.next();
term = new Term("key", objectID);
int count = localSearcher.docFreq(term);

if (count != 0) {
localReader.delete(term);
}
  }

Then I proceed with adding the documents.

This turns out to be extremely expensive, I looked into the code and I see in 
TermInfosReader.get(Term term) it is doing a linear look up for each
term. So as the index grows, the above operation degrades at a linear
rate. So for each commit, we are doing a docFreq for 500 documents.

I also tried to create a BooleanQuery composed of 500 TermQueries and
do 1 search for each batch, and the performance didn't get better. And
if the batch size increases to say 50,000, creating a BooleanQuery
composed of 50,000 TermQuery instances may introduce huge memory
costs.

Is there a better way to do this?

Can TermInfosReader.get(Term term) be optimized to do a binary lookup
instead of a linear walk? Of course that depends on whether the terms
are stored in sorted order, are they?

This is very urgent, thanks in advance for all your help.

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]