Hi Chuck: Thanks for your help and the info.
By some experimentation, I found when calling FSWriter.addIndex(ramDirectory), it is actually performing a merge with the existing index. So doing 2000 batches of 500, when the index grows after each batch, the time to do the merge increases. I guess in this implementation, doing it this way is not optimal. Thanks -John On Sat, 27 Nov 2004 13:14:31 -0800, Chuck Williams <[EMAIL PROTECTED]> wrote: > Hi John, > > I don't use a RamDirectory and so don't have the answer for you. There > have been a number of messages about RamDirectory performance on > lucene-user, including some reported benchmarks. Some people have > reported a significant benefit from RamDirectory's, but most others have > seen little or no benefit. I'm not sure which factors indicate the > nature or magnitude of impact. You sent the message below just to me > -- you might want to post a question on lucene-user. > > I've included a couple messages below on the subject that I saved. > > Chuck > > Included messages: > > -----Original Message----- > From: Jonathan Hager [mailto:[EMAIL PROTECTED] > Sent: Wednesday, November 24, 2004 2:27 PM > To: Lucene Users List > Subject: Re: Index in RAM - is it realy worthy? > > When comparing RAMDirectory and FSDirectory it is important to mention > what OS you are using. When using linux it will cache the most recent > disk access in memory. Here is a good article that describes its > strategy: http://forums.gentoo.org/viewtopic.php?t=175419 > > The 2% difference you are seeing is the memory copy. With other OSes > you may see a speed up when using the RAMDirectory, because not all > OSes contain a disk cache in memory and must access the disk to read > the index. > > Another consideration is there is currently a 2GB limitation with the > size of the RAMDirectory. Indexes over 2GB causes a overflow in the > int used to create the buffer. [see int len = (int) is.length(); in > RamDirectory] > > I ended up using RAM directory for a very different reason. The index > is 1 to 2MB and is rebuilt every few hours. It takes 3 to 4 minutes > to query the database and rebuild the index. But the search should be > available 100% of the time. Since the index is so small I do the > following: > > on server startup: > - look for semaphore, if it is there delete the index > - if there is no index, build it to FSdirectory > - load the index from FSDirectory into RAMDirectory > > on reindex: > - create semaphore > - rebuild index to FSDirectory > - delete semaphore > - load index from FSDirecttory into RAMDirectory > > to search: > - search the RAMDirectory > > RAMDirectory could be replaced by a regular FSDirectory, but it seemed > silly to copy the index from disk to disk, when it ultimately needs to > be in memory. > > FSDirectory could be replaced by a RAMDirectory, but this means that > it would take the server 3 to 4 minutes longer to startup every time. > By persisting the index, this time would only be necessary if indexing > was interrupted. > > Jonathan > > On Mon, 22 Nov 2004 12:39:07 -0800, Kevin A. Burton > <[EMAIL PROTECTED]> wrote: > > Otis Gospodnetic wrote: > > > > >For the Lucene book I wrote some test cases that compare FSDirectory > > >and RAMDirectory. What I found was that with certain settings > > >FSDirectory was almost as fast as RAMDirectory. Personally, I would > > >push FSDirectory and hope that the OS and the Filesystem do their > share > > >of work and caching for me before looking for ways to optimize my > code. > > > > > > > > Yes... I performed the same benchmark and in my situation RAMDirectory > > for searches was about 2% slower. > > > > I'm willing to bet that it has to do with the fact that its a > Hashtable > > and not a HashMap (which isn't synchronized). > > > > Also adding a constructor for the term size could make loading a > > RAMDirectory faster since you could prevent rehash. > > > > If you're on a modern machine your filesystme cache will end up > > buffering your disk anyway which I'm sure was happening in my > situation. > > > > Kevin > > > > -- > > > > Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an > > invite! Also see irc.freenode.net #rojo if you want to chat. > > > > Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html > > > > If you're interested in RSS, Weblogs, Social Networking, etc... then > you > > should work for Rojo! If you recommend someone and we hire them > you'll > > get a free iPod! > > > > Kevin A. Burton, Location - San Francisco, CA > > AIM/YIM - sfburtonator, Web - http://peerfear.org/ > > GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -----Original Message----- > From: John Wang [mailto:[EMAIL PROTECTED] > Sent: Monday, November 22, 2004 12:35 PM > To: Lucene Users List > Subject: Re: Index in RAM - is it realy worthy? > > In my test, I have 12900 documents. Each document is small, a few > discreet fields (KeyWord type) and 1 Text field containing only 1 > sentence. > > with both mergeFactor and maxMergeDocs being 1000 > > using RamDirectory, the indexing job took about 9.2 seconds > > not using RamDirectory, the indexing job took about 122 seconds. > > I am not calling optimize. > > This is on windows Xp running java 1.5. > > Is there something very wrong or different in my setup to cause such a > big different? > > Thanks > > -John > > On Mon, 22 Nov 2004 09:23:40 -0800 (PST), Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: > > For the Lucene book I wrote some test cases that compare FSDirectory > > and RAMDirectory. What I found was that with certain settings > > FSDirectory was almost as fast as RAMDirectory. Personally, I would > > push FSDirectory and hope that the OS and the Filesystem do their > share > > of work and caching for me before looking for ways to optimize my > code. > > > > Otis > > > > > > > > --- [EMAIL PROTECTED] wrote: > > > > > > > > I did following test: > > > I created the RAM folder on my Red Hat box and copied c. 1Gb of > > > indexes > > > there. > > > I expected the queries to run much quicker. > > > In reality it was even sometimes slower(sic!) > > > > > > Lucene has it's own RAM disk functionality. If I implement it, would > > > it > > > bring any benefits? > > > > > > Thanks in advance > > > J. > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -----Original Message----- > > From: John Wang [mailto:[EMAIL PROTECTED] > > Sent: Saturday, November 27, 2004 11:50 AM > > To: Chuck Williams > > Subject: Re: URGENT: Help indexing large document set > > > > I found the reason for the degredation. It is because I was writing > to > > a RamDirectory and then adding to a FSWriter. I guess it makes sense > > since the addIndex call would slow down as the index grows. > > > > I guess it is not a good idea to use RamDirectory if there are many > > small batches. Are there some performace numbers that would tell me > > when to/not to use a RamDirectory? > > > > thanks > > > > -John > > > > > > On Wed, 24 Nov 2004 15:23:49 -0800, John Wang <[EMAIL PROTECTED]> > > wrote: > > > Hi Chuck: > > > > > > The reason I am not using localReader.delete(term) is because > I > > > have some logic to check whether to delete the term based on a > flag. > > > > > > I am testing with the keys to be sorted. > > > > > > I am not doing anything weird, just committing a batch of 500 > > > documents to the index of 2000 batches. I don't what why it is > having > > > this linear slow down... > > > > > > > > > > > > Thanks > > > > > > -John > > > > > > On Wed, 24 Nov 2004 12:32:52 -0800, Chuck Williams > <[EMAIL PROTECTED]> > > wrote: > > > > Does keyIter return the keys in sorted order? This should > reduce > > seeks, > > > > especially if the keys are dense. > > > > > > > > Also, you should be able to localReader.delete(term) instead of > > > > iterating over the docs (of which I presume there is only one > doc > > since > > > > keys are unique). This won't improve performance as > > > > IndexReader.delete(Term) does exactly what your code does, but > it > > will > > > > be cleaner. > > > > > > > > A linear slowdown with number of docs doesn't make sense, so > > something > > > > else must be wrong. I'm not sure what the default buffer size > is > > (it > > > > appears it used to be 128 but is dynamic now I think). You > might > > find > > > > the slowdown stops after a certain point, especially if you > increase > > > > your batch size. > > > > > > > > > > > > > > > > Chuck > > > > > > > > > -----Original Message----- > > > > > From: John Wang [mailto:[EMAIL PROTECTED] > > > > > Sent: Wednesday, November 24, 2004 12:21 PM > > > > > To: Lucene Users List > > > > > Subject: Re: URGENT: Help indexing large document set > > > > > > > > > > Thanks Paul! > > > > > > > > > > Using your suggestion, I have changed the update check code > to > > use > > > > > only the indexReader: > > > > > > > > > > try { > > > > > localReader = IndexReader.open(path); > > > > > > > > > > while (keyIter.hasNext()) { > > > > > key = (String) keyIter.next(); > > > > > term = new Term("key", key); > > > > > TermDocs tDocs = localReader.termDocs(term); > > > > > if (tDocs != null) { > > > > > try { > > > > > while (tDocs.next()) { > > > > > localReader.delete(tDocs.doc()); > > > > > } > > > > > } finally { > > > > > tDocs.close(); > > > > > } > > > > > } > > > > > } > > > > > } finally { > > > > > > > > > > if (localReader != null) { > > > > > localReader.close(); > > > > > } > > > > > > > > > > } > > > > > > > > > > > > > > > Unfortunately it didn't seem to make any dramatic difference. > > > > > > > > > > I also see the CPU is only 30-50% busy, so I am guessing it's > > > > spending > > > > > a lot of time in IO. Anyway of making the CPU work harder? > > > > > > > > > > Is batch size of 500 too small for 1 million documents? > > > > > > > > > > Currently I am seeing a linear speed degredation of 0.3 > > milliseconds > > > > > per document. > > > > > > > > > > Thanks > > > > > > > > > > -John > > > > > > > > > > > > > > > On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot > > > > > <[EMAIL PROTECTED]> wrote: > > > > > > On Wednesday 24 November 2004 00:37, John Wang wrote: > > > > > > > > > > > > > > > > > > > Hi: > > > > > > > > > > > > > > I am trying to index 1M documents, with batches of 500 > > > > documents. > > > > > > > > > > > > > > Each document has an unique text key, which is added > as a > > > > > > > Field.KeyWord(name,value). > > > > > > > > > > > > > > For each batch of 500, I need to make sure I am not > adding > > a > > > > > > > document with a key that is already in the current index. > > > > > > > > > > > > > > To do this, I am calling IndexSearcher.docFreq for each > > > > document > > > > > and > > > > > > > delete the document currently in the index with the same > key: > > > > > > > > > > > > > > while (keyIter.hasNext()) { > > > > > > > String objectID = (String) keyIter.next(); > > > > > > > term = new Term("key", objectID); > > > > > > > int count = localSearcher.docFreq(term); > > > > > > > > > > > > To speed this up a bit make sure that the iterator gives > > > > > > the terms in sorted order. I'd use an index reader instead > > > > > > of a searcher, but that will probably not make a > difference. > > > > > > > > > > > > Adding the documents can be done with multiple threads. > > > > > > Last time I checked that, there was a moderate speed up > > > > > > using three threads instead of one on a single CPU machine. > > > > > > Tuning the values of minMergeDocs and maxMergeDocs > > > > > > may also help to increase performance of adding documents. > > > > > > > > > > > > Regards, > > > > > > Paul Elschot > > > > > > > > > > > > > > > > > -------------------------------------------------------------------- > > - > > > > > > > > > > > > > > > > > > To unsubscribe, e-mail: lucene-user- > > [EMAIL PROTECTED] > > > > > > For additional commands, e-mail: > > > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------------- > > - > > > > > > > > > > > > > To unsubscribe, e-mail: lucene-user- > > [EMAIL PROTECTED] > > > > > For additional commands, e-mail: lucene-user- > > [EMAIL PROTECTED] > > > > > > > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]