[jira] Resolved: (LUCENE-702) Disk full during addIndexes(Directory[]) can corrupt index
[ http://issues.apache.org/jira/browse/LUCENE-702?page=all ] Michael McCandless resolved LUCENE-702. --- Fix Version/s: 2.1 Resolution: Fixed Disk full during addIndexes(Directory[]) can corrupt index -- Key: LUCENE-702 URL: http://issues.apache.org/jira/browse/LUCENE-702 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.1 Reporter: Michael McCandless Assigned To: Michael McCandless Fix For: 2.1 Attachments: LUCENE-702.patch, LUCENE-702.take2.patch, LUCENE-702.take3.patch This is a spinoff of LUCENE-555 If the disk fills up during this call then the committed segments file can reference segments that were not written. Then the whole index becomes unusable. Does anyone know of any other cases where disk full could corrupt the index? I think disk full should worse lose the documents that were in flight at the time. It shouldn't corrupt the index. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ThreadLocal leak (was Re: Leaking org.apache.lucene.index.* objects)
Otis, i figured out a similar problem when running a very heavy loaded search application in a servlet container. The reasone using ThreadLocals was to get rid of synchronized method calls e.g in TermVectorsReader which would break down the overall search performance. Currently i do not see an easy solution to fix both, the synchronization and ThreadLocal problem. Bernhard Otis Gospodnetic wrote: Moving to java-dev, I think this belongs here. I've been looking at this problem some more today and reading about ThreadLocals. It's easy to misuse them and end up with memory leaks, apparently... and I think we may have this problem here. The problem here is that ThreadLocals are tied to Threads, and I think the assumption in TermInfosReader and SegmentReader is that (search) Threads are short-lived: they come in, scan the index, do the search, return and die. In this scenario, their ThreadLocals go to heaven with them, too, and memory is freed up. But when Threads are long-lived, as they are in thread pools (e.g. those in servlet containers), those ThreadLocals stay alive even after a single search request is done. Moreover, the Thread is reused, and the new TermInfosReader and SegmentReader put some new values in that ThreadLocal on top of the old values (I think) from the previous search request. Because the Thread still has references to ThreadLocals and the values in them, the values never get GCed. I tried making ThreadLocals in TIR and SR static, I tried wrapping values saved in TLs in WeakReference, I've tried using WeakHashMap like in Robert Engel's FixedThreadLocal class from LUCENE-436, but nothing helped. I thought about adding a public static method to TIR and SR, so one could call it at the end of a search request (think servlet filter) and clear the TL for the current thread, but that would require making TIR and SR public and I'm not 100% sure if it would work, plus that exposes the implementation details too much. I don't have a solution yet. But do we *really* need ThreadLocal in TIR and SR? The only thing that TL is doing there is acting as a per-thread storage of some cloned value (in TIR we clone SegmentTermEnum and in SR we clone TermVectorsReader). Why can't we just store those cloned values in instance variables? Isn't whoever is calling TIR and SR going to be calling the same instance of TIR and SR anyway, and thus get access to those cloned values? I'm really amazed that we haven't heard any reports about this before. I am not sure why my application started showing this leak only about 3 weeks ago. It is getting more pounded on than before, so maybe that made the leak more obvious. My guess is that more common Lucene usage is with a single index or a small number of them, and with short-lived threads, where this problem isn't easily visible. In my case I deal with a few tens of thousands of indices and several parallel search threads that live forever in the thread pool. Any thoughts about this or possible suggestions for a fix? Thanks, Otis - Original Message From: Otis Gospodnetic [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Friday, December 15, 2006 12:28:29 PM Subject: Leaking org.apache.lucene.index.* objects Hi, About 2-3 weeks ago I emailed about a memory leak in my application. I then found some problems in my code (I wasn't closing IndexSearchers explicitly) and took care of those. Now I see my app is still leaking memory - jconsole clearly shows the Tenured Gen memory pool getting filled up until I hit the OOM, but I can't seem to pin-point the source. I found that a bunch or o.a.l.index.* objects are not getting GCed, even though they should. For example: $ jmap -histo:live 7825 | grep apache.lucene.index | head -20 | sort -k2 -nr num #instances#bytes class name -- 4: 176484098831040 org.apache.lucene.index.CompoundFileReader$CSIndexInput 5: 211921567814880 org.apache.lucene.index.TermInfo 7: 111245935598688 org.apache.lucene.index.SegmentReader$Norm 9: 213231134116976 org.apache.lucene.index.Term 12: 111789726829528 org.apache.lucene.index.FieldInfo 13:22534018027200 org.apache.lucene.index.SegmentTermEnum 15:58972714153448 org.apache.lucene.index.TermBuffer 21: 86033 8718504 [Lorg.apache.lucene.index.TermInfo; 20: 86033 8718504 [Lorg.apache.lucene.index.Term; 23: 86120 7578560 org.apache.lucene.index.SegmentReader 26: 90501 5068056 org.apache.lucene.store.FSIndexInput 27: 86120 4822720 org.apache.lucene.index.TermInfosReader 33: 86130 3445200 org.apache.lucene.index.SegmentInfo 36: 87355 2795360 org.apache.lucene.store.FSIndexInput$Descriptor 38: 86120 2755840 org.apache.lucene.index.FieldsReader 39: 86050 2753600 org.apache.lucene.index.CompoundFileReader 42: 46903 2251344
[jira] Commented: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock
[ http://issues.apache.org/jira/browse/LUCENE-748?page=comments#action_12459419 ] Michael McCandless commented on LUCENE-748: --- I think this (not releasing write lock on hitting an exception) is actually by design. It's because the writer still has pending changes to commit to disk. And, with the fix for LUCENE-702 (just committed), if we hit an exception during IndexWriter.close(), the IndexWriter is left in a consistent state (this is not quite the case pre-2.1). Meaning, if you caught that exception, fixed the root cause (say freed up disk space), and called close again (successfully), you would not have lost any documents, and the write lock will be released. I can also see that if we did release the write lock on exception, this could dangerously / easily mask the fact that there was an exception. Ie, if the IOException is caught and ignored (or writes a message but nobody sees it), and the write lock was released, then you could go for quite a while before discovering eg that new docs weren't visible in the index. Whereas, keeping the write lock held on exception will cause much faster discovery of the problem (eg when the next writer tries to instantiate). I think this is the right exception semantics to aim for? Ie if the close did not succeed we should not release the write lock (because we still have pending changes). Then, if you want to force releasing of the write lock, you can still do something like this: try { writer.close(); } finally { if (IndexReader.isLocked(directory)) { IndexReader.unlock(directory); } } Exception during IndexWriter.close() prevents release of the write.lock --- Key: LUCENE-748 URL: http://issues.apache.org/jira/browse/LUCENE-748 Project: Lucene - Java Issue Type: Bug Affects Versions: 1.9 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14) Reporter: Jed Wesley-Smith After encountering a case of index corruption - see http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method encounters an exception in the flushRamSegments() method, the index write.lock is not released (ie. it is not really closed). The writelock is only released when the IndexWriter is GC'd and finalize() is called. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Stephen Hussey is out of the office.
I will be out of the office starting 12/18/2006 and will not return until 12/20/2006. I will respond to your message when I return.
[jira] Closed: (LUCENE-658) upload major releases to ibiblio
[ http://issues.apache.org/jira/browse/LUCENE-658?page=all ] Michael McCandless closed LUCENE-658. - Resolution: Duplicate Dup of LUCENE-551 upload major releases to ibiblio Key: LUCENE-658 URL: http://issues.apache.org/jira/browse/LUCENE-658 Project: Lucene - Java Issue Type: Task Components: Other Affects Versions: 1.9, 2.0.0 Reporter: Ryan Sonnek i'm a current user of maven and the latest 1.9 and 2.0 releases are not available on ibiblio. http://www.ibiblio.org/maven2/lucene/lucene/ Could someone upload the latest versions so that use maven-heads can access the new features? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-734) Upload Lucene 2.0 artifacts in the Maven 1 repository
[ http://issues.apache.org/jira/browse/LUCENE-734?page=all ] Michael McCandless resolved LUCENE-734. --- Resolution: Fixed From the last comment, it looks like the 2.0 Lucene core JAR is in maven 1. Upload Lucene 2.0 artifacts in the Maven 1 repository - Key: LUCENE-734 URL: http://issues.apache.org/jira/browse/LUCENE-734 Project: Lucene - Java Issue Type: Task Components: Other Reporter: Jukka Zitting Priority: Minor The Lucene 2.0 artifacts can be found in the Maven 2 repository, but not in the Maven 1 repository. There are still projects using Maven 1 who might be interested in upgrading to Lucene 2, so having the artifacts also in the Maven 1 repository would be very helpful. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ThreadLocal leak (was Re: Leaking org.apache.lucene.index.* objects)
There is no inherent problem with ThreadLocal. It is a viable solution to synchronization issues in most cases. On Dec 18, 2006, at 11:25 AM, Bernhard Messer wrote: Otis, i figured out a similar problem when running a very heavy loaded search application in a servlet container. The reasone using ThreadLocals was to get rid of synchronized method calls e.g in TermVectorsReader which would break down the overall search performance. Currently i do not see an easy solution to fix both, the synchronization and ThreadLocal problem. Bernhard Otis Gospodnetic wrote: Moving to java-dev, I think this belongs here. I've been looking at this problem some more today and reading about ThreadLocals. It's easy to misuse them and end up with memory leaks, apparently... and I think we may have this problem here. The problem here is that ThreadLocals are tied to Threads, and I think the assumption in TermInfosReader and SegmentReader is that (search) Threads are short-lived: they come in, scan the index, do the search, return and die. In this scenario, their ThreadLocals go to heaven with them, too, and memory is freed up. But when Threads are long-lived, as they are in thread pools (e.g. those in servlet containers), those ThreadLocals stay alive even after a single search request is done. Moreover, the Thread is reused, and the new TermInfosReader and SegmentReader put some new values in that ThreadLocal on top of the old values (I think) from the previous search request. Because the Thread still has references to ThreadLocals and the values in them, the values never get GCed. I tried making ThreadLocals in TIR and SR static, I tried wrapping values saved in TLs in WeakReference, I've tried using WeakHashMap like in Robert Engel's FixedThreadLocal class from LUCENE-436, but nothing helped. I thought about adding a public static method to TIR and SR, so one could call it at the end of a search request (think servlet filter) and clear the TL for the current thread, but that would require making TIR and SR public and I'm not 100% sure if it would work, plus that exposes the implementation details too much. I don't have a solution yet. But do we *really* need ThreadLocal in TIR and SR? The only thing that TL is doing there is acting as a per-thread storage of some cloned value (in TIR we clone SegmentTermEnum and in SR we clone TermVectorsReader). Why can't we just store those cloned values in instance variables? Isn't whoever is calling TIR and SR going to be calling the same instance of TIR and SR anyway, and thus get access to those cloned values? I'm really amazed that we haven't heard any reports about this before. I am not sure why my application started showing this leak only about 3 weeks ago. It is getting more pounded on than before, so maybe that made the leak more obvious. My guess is that more common Lucene usage is with a single index or a small number of them, and with short-lived threads, where this problem isn't easily visible. In my case I deal with a few tens of thousands of indices and several parallel search threads that live forever in the thread pool. Any thoughts about this or possible suggestions for a fix? Thanks, Otis - Original Message From: Otis Gospodnetic [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Friday, December 15, 2006 12:28:29 PM Subject: Leaking org.apache.lucene.index.* objects Hi, About 2-3 weeks ago I emailed about a memory leak in my application. I then found some problems in my code (I wasn't closing IndexSearchers explicitly) and took care of those. Now I see my app is still leaking memory - jconsole clearly shows the Tenured Gen memory pool getting filled up until I hit the OOM, but I can't seem to pin-point the source. I found that a bunch or o.a.l.index.* objects are not getting GCed, even though they should. For example: $ jmap -histo:live 7825 | grep apache.lucene.index | head -20 | sort -k2 -nr num #instances#bytes class name -- 4: 176484098831040 org.apache.lucene.index.CompoundFileReader$CSIndexInput 5: 211921567814880 org.apache.lucene.index.TermInfo 7: 111245935598688 org.apache.lucene.index.SegmentReader $Norm 9: 213231134116976 org.apache.lucene.index.Term 12: 111789726829528 org.apache.lucene.index.FieldInfo 13:22534018027200 org.apache.lucene.index.SegmentTermEnum 15:58972714153448 org.apache.lucene.index.TermBuffer 21: 86033 8718504 [Lorg.apache.lucene.index.TermInfo; 20: 86033 8718504 [Lorg.apache.lucene.index.Term; 23: 86120 7578560 org.apache.lucene.index.SegmentReader 26: 90501 5068056 org.apache.lucene.store.FSIndexInput 27: 86120 4822720 org.apache.lucene.index.TermInfosReader 33: 86130 3445200 org.apache.lucene.index.SegmentInfo
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
A word of caution here... Using a shared FileChannel.pread actually performs a synchronization under Windows. See JDK bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 I submitted this, and it was verified using the supplied test case. On Dec 17, 2006, at 1:31 PM, Doug Cutting wrote: Doron Cohen wrote: Also, if nio proves to be faster in this scenario, it might make sense to keep current FSDirectory, and just add FSDirectoryNio implementation. If nio isn't considerably slower for single-threaded applications, I'd vote to simply switch FSDirectory to use nio, simplifying the public API by reducing choices. But if classic io is faster for single-threaded apps, and nio faster for multi-threaded, that would suggest adding a new, public, nio-based Directory implementation. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
robert engels wrote: Using a shared FileChannel.pread actually performs a synchronization under Windows. Sigh. Still, it'd be no worse than current FSDirectory on Windows. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential indexing perormance improvement for compound index - cut IO - have more files though
I think the important issues are index size, stability and number of concurrent readers. We achieved the best performance by using a pool of file descriptors to a segment so we could avoid the synchronization block, but this only worked for large, relatively unchanging segments. On Dec 18, 2006, at 2:51 PM, Doug Cutting wrote: robert engels wrote: Using a shared FileChannel.pread actually performs a synchronization under Windows. Sigh. Still, it'd be no worse than current FSDirectory on Windows. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Closed: (LUCENE-603) index optimize problem
[ http://issues.apache.org/jira/browse/LUCENE-603?page=all ] Michael McCandless closed LUCENE-603. - Resolution: Duplicate This looks like a dup of LUCENE-140 index optimize problem -- Key: LUCENE-603 URL: http://issues.apache.org/jira/browse/LUCENE-603 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 1.9 Environment: CentOS 4.0 , Lucene 1.9, Eclipse 3.1 Reporter: Dedian Guo have a function whichi is loop to index batches of documents, after each indexing, the function IndexWriter.optimize will be applied. for several times (not sure how many, but should be many), following exception was thrown out. Exception in thread Thread-0 java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:335) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:298) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:272) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:236) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:89) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:681) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-140) docs out of order
[ http://issues.apache.org/jira/browse/LUCENE-140?page=comments#action_12459457 ] Michael McCandless commented on LUCENE-140: --- I just resolved LUCENE-603 as a dup of this issue. It would be awesome if we could get a test case that shows this happening. Enough people seem to hit it that it seems likely something is lurking out there so I'd like to get it fixed!! docs out of order - Key: LUCENE-140 URL: http://issues.apache.org/jira/browse/LUCENE-140 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: unspecified Environment: Operating System: Linux Platform: PC Reporter: legez Assigned To: Lucene Developers Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar Hello, I can not find out, why (and what) it is happening all the time. I got an exception: java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250) at Optimize.main(Optimize.java:29) It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not find it neither in download nor in version list in this form). Everything seems OK. I can search through index, but I can not optimize it. Even worse after this exception every time I add new documents and close IndexWriter new segments is created! I think it has all documents added before, because of its size. My index is quite big: 500.000 docs, about 5gb of index directory. It is _repeatable_. I drop index, reindex everything. Afterwards I add a few docs, try to optimize and receive above exception. My documents' structure is: static Document indexIt(String id_strony, Reader reader, String data_wydania, String id_wydania, String id_gazety, String data_wstawienia) { Document doc = new Document(); doc.add(Field.Keyword(id, id_strony )); doc.add(Field.Keyword(data_wydania, data_wydania)); doc.add(Field.Keyword(id_wydania, id_wydania)); doc.add(Field.Text(id_gazety, id_gazety)); doc.add(Field.Keyword(data_wstawienia, data_wstawienia)); doc.add(Field.Text(tresc, reader)); return doc; } Sincerely, legez -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-140) docs out of order
[ http://issues.apache.org/jira/browse/LUCENE-140?page=comments#action_12459457 ] Michael McCandless commented on LUCENE-140: --- I just resolved LUCENE-603 as a dup of this issue. It would be awesome if we could get a test case that shows this happening. Enough people seem to hit it that it seems likely something is lurking out there so I'd like to get it fixed!! docs out of order - Key: LUCENE-140 URL: http://issues.apache.org/jira/browse/LUCENE-140 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: unspecified Environment: Operating System: Linux Platform: PC Reporter: legez Assigned To: Lucene Developers Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar Hello, I can not find out, why (and what) it is happening all the time. I got an exception: java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250) at Optimize.main(Optimize.java:29) It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not find it neither in download nor in version list in this form). Everything seems OK. I can search through index, but I can not optimize it. Even worse after this exception every time I add new documents and close IndexWriter new segments is created! I think it has all documents added before, because of its size. My index is quite big: 500.000 docs, about 5gb of index directory. It is _repeatable_. I drop index, reindex everything. Afterwards I add a few docs, try to optimize and receive above exception. My documents' structure is: static Document indexIt(String id_strony, Reader reader, String data_wydania, String id_wydania, String id_gazety, String data_wstawienia) { Document doc = new Document(); doc.add(Field.Keyword(id, id_strony )); doc.add(Field.Keyword(data_wydania, data_wydania)); doc.add(Field.Keyword(id_wydania, id_wydania)); doc.add(Field.Text(id_gazety, id_gazety)); doc.add(Field.Keyword(data_wstawienia, data_wstawienia)); doc.add(Field.Text(tresc, reader)); return doc; } Sincerely, legez -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-129) Finalizers are non-canonical
[ http://issues.apache.org/jira/browse/LUCENE-129?page=all ] Michael McCandless reassigned LUCENE-129: - Assignee: Michael McCandless (was: Lucene Developers) Finalizers are non-canonical Key: LUCENE-129 URL: http://issues.apache.org/jira/browse/LUCENE-129 Project: Lucene - Java Issue Type: Bug Components: Other Affects Versions: unspecified Environment: Operating System: other Platform: All Reporter: Esmond Pitt Assigned To: Michael McCandless Priority: Minor The canonical form of a Java finalizer is: protected void finalize() throws Throwable() { try { // ... local code to finalize this class } catch (Throwable t) { } super.finalize(); // finalize base class. } The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. This is probably minor or null in effect, but the principle is important. As a matter of fact FSDirectory.finaliz() is entirely redundant and could be removed, as it doesn't do anything that RandomAccessFile.finalize would do automatically. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-129) Finalizers are non-canonical
[ http://issues.apache.org/jira/browse/LUCENE-129?page=all ] Michael McCandless reassigned LUCENE-129: - Assignee: Michael McCandless (was: Lucene Developers) Finalizers are non-canonical Key: LUCENE-129 URL: http://issues.apache.org/jira/browse/LUCENE-129 Project: Lucene - Java Issue Type: Bug Components: Other Affects Versions: unspecified Environment: Operating System: other Platform: All Reporter: Esmond Pitt Assigned To: Michael McCandless Priority: Minor The canonical form of a Java finalizer is: protected void finalize() throws Throwable() { try { // ... local code to finalize this class } catch (Throwable t) { } super.finalize(); // finalize base class. } The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. This is probably minor or null in effect, but the principle is important. As a matter of fact FSDirectory.finaliz() is entirely redundant and could be removed, as it doesn't do anything that RandomAccessFile.finalize would do automatically. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-301) Index Writer constructor flags unclear - and annoying in certain cases
[ http://issues.apache.org/jira/browse/LUCENE-301?page=comments#action_12459479 ] Michael McCandless commented on LUCENE-301: --- I like Doug's solution add a new constructor, IndexWriter(Directory, Analyzer) which, if no such index exists, creates one -- I think this makses sense. I will commit this. Index Writer constructor flags unclear - and annoying in certain cases -- Key: LUCENE-301 URL: http://issues.apache.org/jira/browse/LUCENE-301 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 1.4 Environment: Operating System: other Platform: Other Reporter: Dan Armbrust Assigned To: Lucene Developers Priority: Minor Wouldn't it make more sense if the constructor for the IndexWriter always created an index if it doesn't exist - and the boolean parameter should be clear (instead of create) So instead of this (from javadoc): IndexWriter public IndexWriter(Directory d, Analyzer a, boolean create) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If create is true, then a new, empty index will be created in d, replacing the index already there, if any. Parameters: d - the index directory a - the analyzer to use create - true to create the index or overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist, and create is false We would have this: IndexWriter public IndexWriter(Directory d, Analyzer a, boolean clear) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If clear is true, and a index exists at location d, then it will be erased, and a new, empty index will be created in d. Parameters: d - the index directory a - the analyzer to use clear - true to overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist. Its current behavior is kind of annoying, because I have an app that should never clear an existing index, it should always append. So I want create set to false. But when I am starting a brand new index, then I have to change the create flag to keep it from throwing an exception... I guess for now I will have to write code to check if a index actually has content yet, and if it doesn't, change the flag on the fly. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-301) Index Writer constructor flags unclear - and annoying in certain cases
[ http://issues.apache.org/jira/browse/LUCENE-301?page=comments#action_12459479 ] Michael McCandless commented on LUCENE-301: --- I like Doug's solution add a new constructor, IndexWriter(Directory, Analyzer) which, if no such index exists, creates one -- I think this makses sense. I will commit this. Index Writer constructor flags unclear - and annoying in certain cases -- Key: LUCENE-301 URL: http://issues.apache.org/jira/browse/LUCENE-301 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 1.4 Environment: Operating System: other Platform: Other Reporter: Dan Armbrust Assigned To: Lucene Developers Priority: Minor Wouldn't it make more sense if the constructor for the IndexWriter always created an index if it doesn't exist - and the boolean parameter should be clear (instead of create) So instead of this (from javadoc): IndexWriter public IndexWriter(Directory d, Analyzer a, boolean create) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If create is true, then a new, empty index will be created in d, replacing the index already there, if any. Parameters: d - the index directory a - the analyzer to use create - true to create the index or overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist, and create is false We would have this: IndexWriter public IndexWriter(Directory d, Analyzer a, boolean clear) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If clear is true, and a index exists at location d, then it will be erased, and a new, empty index will be created in d. Parameters: d - the index directory a - the analyzer to use clear - true to overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist. Its current behavior is kind of annoying, because I have an app that should never clear an existing index, it should always append. So I want create set to false. But when I am starting a brand new index, then I have to change the create flag to keep it from throwing an exception... I guess for now I will have to write code to check if a index actually has content yet, and if it doesn't, change the flag on the fly. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-301) Index Writer constructor flags unclear - and annoying in certain cases
[ http://issues.apache.org/jira/browse/LUCENE-301?page=all ] Michael McCandless reassigned LUCENE-301: - Assignee: Michael McCandless (was: Lucene Developers) Index Writer constructor flags unclear - and annoying in certain cases -- Key: LUCENE-301 URL: http://issues.apache.org/jira/browse/LUCENE-301 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 1.4 Environment: Operating System: other Platform: Other Reporter: Dan Armbrust Assigned To: Michael McCandless Priority: Minor Wouldn't it make more sense if the constructor for the IndexWriter always created an index if it doesn't exist - and the boolean parameter should be clear (instead of create) So instead of this (from javadoc): IndexWriter public IndexWriter(Directory d, Analyzer a, boolean create) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If create is true, then a new, empty index will be created in d, replacing the index already there, if any. Parameters: d - the index directory a - the analyzer to use create - true to create the index or overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist, and create is false We would have this: IndexWriter public IndexWriter(Directory d, Analyzer a, boolean clear) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If clear is true, and a index exists at location d, then it will be erased, and a new, empty index will be created in d. Parameters: d - the index directory a - the analyzer to use clear - true to overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist. Its current behavior is kind of annoying, because I have an app that should never clear an existing index, it should always append. So I want create set to false. But when I am starting a brand new index, then I have to change the create flag to keep it from throwing an exception... I guess for now I will have to write code to check if a index actually has content yet, and if it doesn't, change the flag on the fly. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-301) Index Writer constructor flags unclear - and annoying in certain cases
[ http://issues.apache.org/jira/browse/LUCENE-301?page=all ] Michael McCandless reassigned LUCENE-301: - Assignee: Michael McCandless (was: Lucene Developers) Index Writer constructor flags unclear - and annoying in certain cases -- Key: LUCENE-301 URL: http://issues.apache.org/jira/browse/LUCENE-301 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 1.4 Environment: Operating System: other Platform: Other Reporter: Dan Armbrust Assigned To: Michael McCandless Priority: Minor Wouldn't it make more sense if the constructor for the IndexWriter always created an index if it doesn't exist - and the boolean parameter should be clear (instead of create) So instead of this (from javadoc): IndexWriter public IndexWriter(Directory d, Analyzer a, boolean create) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If create is true, then a new, empty index will be created in d, replacing the index already there, if any. Parameters: d - the index directory a - the analyzer to use create - true to create the index or overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist, and create is false We would have this: IndexWriter public IndexWriter(Directory d, Analyzer a, boolean clear) throws IOException Constructs an IndexWriter for the index in d. Text will be analyzed with a. If clear is true, and a index exists at location d, then it will be erased, and a new, empty index will be created in d. Parameters: d - the index directory a - the analyzer to use clear - true to overwrite the existing one; false to append to the existing index Throws: IOException - if the directory cannot be read/written to, or if it does not exist. Its current behavior is kind of annoying, because I have an app that should never clear an existing index, it should always append. So I want create set to false. But when I am starting a brand new index, then I have to change the create flag to keep it from throwing an exception... I guess for now I will have to write code to check if a index actually has content yet, and if it doesn't, change the flag on the fly. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12459482 ] Paul Elschot commented on LUCENE-565: - I'd like to give this a try over the upcoming holidays. Would it be possible to post a single patch? A single patch can be made by locally svn add'ing all new files and then doing an svn diff on all files involved from the top directory. Regards, Paul Elschot Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: IndexWriter.java, IndexWriter.July09.patch, IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center
[jira] Commented: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock
[ http://issues.apache.org/jira/browse/LUCENE-748?page=comments#action_12459483 ] Hoss Man commented on LUCENE-748: - given the changes made in LUCENE-702, i concur with your assesment Michael: keeping the lock open so that the caller can attempt to deal with the problem then retry makes sense. even if we decided that the consistent state of the IndexWriter isn't an invarient that the user can rely on, asking users to forcably unlock in the event of an exception on close seems like a more reasonable expectation then to forcably unlock for them automatically. Exception during IndexWriter.close() prevents release of the write.lock --- Key: LUCENE-748 URL: http://issues.apache.org/jira/browse/LUCENE-748 Project: Lucene - Java Issue Type: Bug Affects Versions: 1.9 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14) Reporter: Jed Wesley-Smith After encountering a case of index corruption - see http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method encounters an exception in the flushRamSegments() method, the index write.lock is not released (ie. it is not really closed). The writelock is only released when the IndexWriter is GC'd and finalize() is called. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: IndexWriter.java) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: IndexWriter.July09.patch) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: IndexWriter.patch) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: KeepDocCount0Segment.Sept15.patch, NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: NewIndexModifier.July09.patch) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: KeepDocCount0Segment.Sept15.patch, NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: NewIndexWriter.Aug23.patch) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: KeepDocCount0Segment.Sept15.patch, NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: NewIndexWriter.July18.patch) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: KeepDocCount0Segment.Sept15.patch, NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: TestWriterDelete.java) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: KeepDocCount0Segment.Sept15.patch, NewIndexModifier.Sept21.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: KeepDocCount0Segment.Sept15.patch) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: (was: newMergePolicy.Sept08.patch) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock
[ http://issues.apache.org/jira/browse/LUCENE-748?page=comments#action_12459489 ] Jed Wesley-Smith commented on LUCENE-748: - I guess, particularly in light of LUCENE-702 that this behavior is OK - and the IndexReader.unlock(dir) is a good suggestion. My real problem was that the finalize() method does eventually remove the write lock. For me then the suggestion would be to document the exceptional behavior of the close() method (ie. it means that changes haven't been written and the write lock is still held) and link to the IndexReader.unlock(Directory) method. Exception during IndexWriter.close() prevents release of the write.lock --- Key: LUCENE-748 URL: http://issues.apache.org/jira/browse/LUCENE-748 Project: Lucene - Java Issue Type: Bug Affects Versions: 1.9 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14) Reporter: Jed Wesley-Smith After encountering a case of index corruption - see http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method encounters an exception in the flushRamSegments() method, the index write.lock is not released (ie. it is not really closed). The writelock is only released when the IndexWriter is GC'd and finalize() is called. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12459490 ] Ning Li commented on LUCENE-565: Many versions of the patch were submitted as new code was committed to IndexWriter.java. For each version, all changes made were included in a single patch file. I removed all but the latest version of the patch. Even this one is outdated by the commit of LUCENE-701 (lock-less commits). I was waiting for the commit of LUCENE-702 before submitting another patch. LUCENE-702 was committed this morning. So I'll submit an up-to-date patch over the holidays. On 12/18/06, Paul Elschot (JIRA) [EMAIL PROTECTED] wrote: I'd like to give this a try over the upcoming holidays. That's great! We can discuss/compare the designs then. Or, we can discuss/compare the designs before submitting new patches. Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build. We experimented with three workloads: - Insert only. 1.6M documents were inserted and the final index size was 2.3GB. - Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted. - Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted. current current new Workload IndexWriter IndexModifier IndexWriter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min As the experiments show, with the
[jira] Commented: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock
[ http://issues.apache.org/jira/browse/LUCENE-748?page=comments#action_12459491 ] Michael McCandless commented on LUCENE-748: --- OK I will update the javadoc for IndexWriter.close to make this clear. Thanks! Exception during IndexWriter.close() prevents release of the write.lock --- Key: LUCENE-748 URL: http://issues.apache.org/jira/browse/LUCENE-748 Project: Lucene - Java Issue Type: Bug Affects Versions: 1.9 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14) Reporter: Jed Wesley-Smith After encountering a case of index corruption - see http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method encounters an exception in the flushRamSegments() method, the index write.lock is not released (ie. it is not really closed). The writelock is only released when the IndexWriter is GC'd and finalize() is called. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock
[ http://issues.apache.org/jira/browse/LUCENE-748?page=all ] Michael McCandless reassigned LUCENE-748: - Assignee: Michael McCandless Exception during IndexWriter.close() prevents release of the write.lock --- Key: LUCENE-748 URL: http://issues.apache.org/jira/browse/LUCENE-748 Project: Lucene - Java Issue Type: Bug Affects Versions: 1.9 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14) Reporter: Jed Wesley-Smith Assigned To: Michael McCandless After encountering a case of index corruption - see http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method encounters an exception in the flushRamSegments() method, the index write.lock is not released (ie. it is not really closed). The writelock is only released when the IndexWriter is GC'd and finalize() is called. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-748) Exception during IndexWriter.close() prevents release of the write.lock
[ http://issues.apache.org/jira/browse/LUCENE-748?page=comments#action_12459502 ] Jed Wesley-Smith commented on LUCENE-748: - Awesome, thanks! Exception during IndexWriter.close() prevents release of the write.lock --- Key: LUCENE-748 URL: http://issues.apache.org/jira/browse/LUCENE-748 Project: Lucene - Java Issue Type: Bug Affects Versions: 1.9 Environment: Lucene 1.4 through 2.1 HEAD (as of 2006-12-14) Reporter: Jed Wesley-Smith Assigned To: Michael McCandless After encountering a case of index corruption - see http://issues.apache.org/jira/browse/LUCENE-140 - when the close() method encounters an exception in the flushRamSegments() method, the index write.lock is not released (ie. it is not really closed). The writelock is only released when the IndexWriter is GC'd and finalize() is called. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12459506 ] Ning Li commented on LUCENE-565: Here is the design overview. Minor changes were made because of lock-less commits. In the current IndexWriter, newly added documents are buffered in ram in the form of one-doc segments. When a flush is triggered, all ram documents are merged into a single segment and written to disk. Further merges of disk segments may be triggered. NewIndexModifier extends IndexWriter and supports document deletion in addition to document addition. NewIndexModifier not only buffers newly added documents in ram, but also buffers deletes in ram. The following describes what happens when a flush is triggered: 1 merge ram documents into one segment and written to disk do not commit - segmentInfos is updated in memory, but not written to disk 2 for each disk segment to which a delete may apply open reader delete docs*, write new .delN file (* Care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized.) close reader, but do not commit - segmentInfos is updated in memory, but not written to disk 3 commit - write new segments_N to disk Further merges for disk segments work the same as before. As an option, we can cache readers to minimize the number of reader opens/closes. In other words, we can trade memory for better performance. The design would be modified as follows: 1 same as above 2 for each disk segment to which a delete may apply open reader and cache it if not already opened/cached delete docs*, write new .delN file 3 commit - write new segments_N to disk The logic for disk segment merge changes accordingly: open reader if not already opened/cached; after a merge is complete, close readers for the segments that have been merged. Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) - Key: LUCENE-565 URL: http://issues.apache.org/jira/browse/LUCENE-565 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Ning Li Attachments: NewIndexModifier.Sept21.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged. We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below. API Changes --- We propose adding a deleteDocuments(Term term) method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter. Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document. Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated. Coding Changes -- Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized. We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by CHANGE. We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action. Performance Results --- To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before
access policy for Java Open Review Project
Hi all, I've been busy creating JOR accounts this weekend, and it was cool to see so many names from Lucene. Lucene, Solr, and Nutch have the lowest defect rates among the projects we've looked at, and I'm beginning to see why. One of the things JOR is doing is inviting people to come and help review issues we find with static analysis. We've had a fair number of signups since the project was on slashdot. My question is, would you like to allow outsiders to go through results and help sort the real bugs from the chaff? The upside is that volunteers may perform useful work and that it may be another avenue to get people involved with the code. The down side is that things like XSS in admin pages may lead them to make more ruckus than is really appropriate. The situation may change if we can establish a mechanism for efficiently moving issues into Jira, but for now, I could imagine a number of different policies, including: - Allow anyone access who asks for it. - Allow access on a case-by-case basis. - Don't allow access to outsiders. Here are the outsiders who've requested access so far, along with a few words to summarize what they've told me about themselves. Lucene -- Varun Nair [EMAIL PROTECTED]: budding code auditor at TCS Martin Englund [EMAIL PROTECTED]: Experienced auditor at Sun [EMAIL PROTECTED]: Looks like he's just testing the waters Lucene, Nutch, Solr -- Thierry De Leeuw [EMAIL PROTECTED]: experienced vulnerability hunter Michael Bunzel [EMAIL PROTECTED]: experienced auditor, but new to auditing Java Thoughts? Brian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]