[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12428035 ] 
            
Doron Cohen commented on LUCENE-565:
------------------------------------

I tried out this patch (July18), and have a few comments...

First, it is nice to be able to add/remove documents with no need to care for 
switching between readers and writers, and without worrying for performance 
issues as result of that switching. I did not test for performance yet.

This post is quite long, so here is an outline...
(1) Compile error in test code
(2) Failing tests - is this patch actually fixing a bug in current 
flushRamSegments()?
(3) Additional tests I ran
(4) Javadocs remarks
(5) deleteDocument(int doc) not implemented
(6) flush() not implemented
(7) Method name - batchDeleteDocuments(Term[]) 
(8) Class name and placement + What's Next for this patch

------------ (1) Compile error in test code 
The new TestWriterDelete does not reflect recent name change from IndexWriter 
to NewIndexModifier. Easily fixed by renaming accordingly in that file.

------------ (2) Failing tests - does this patch also fix a bug in current 
flushRamSegments()?
"ant test" has one failure: TestIndexModifier.testIndex().
This is the same issue that Ning described above. However I think it exposes a 
bug in current flushRamSegments(): when an index with 1 segment on disk that 
has 2 documents, one of which is deleted, and 1 segment in memory, is closed, 
this method decides to merge - prematurely - the two segments into one. This 
wrong behavior (if I understand things correctly) is - by "mistake" - causing 
TestIndexModifier.testIndex() to pass in the current implementation of 
flushRamSegments(). But this comes with the cost of too many merges. If one is 
interleaving adds and deletes this bug would become costly. I will post a 
separate question on this to the dev forum, to discuss if this is indeed a bug.

------------ (3) Additional tests I ran 
I wanted to verify that all existing functions (well, at least tested ones..) 
are working with the new class (NewIndexModifier). So I temporarily renamed the 
existing IndexWriter to IndexWriter0, and renamed NewIndexModifier to 
IndexWriter (now extending IndexWriter0). For compiling, now, also had to 
temporarily modify args from IndexWriter to IndexWriter0 in 3 classes - 
DocumentWriter, SegmentMerger, and also from NewIndexModifier to IndexWriter in 
the new TestWriteDelete. (Note again: these modifications are temporary, just 
for the sake of testing this new class as if it was the new IndexWriter, which 
it is not.) Now all the tests were using the new class instead of the original 
IndexWriter. 

All tests passed, except for TestIndexModifier.testIndex() - this is the same 
failure as above - so, no problem detected in new class.

------------ (4) Javadocs Remarks
Current Javadocs for the new class focus on changes to the implementation. I 
think this description of implementation changes should be made regular Java 
comments (for developers), and instead should add a shorter javadoc that 
describes the API for users, and the implications on behavior as result of 
buffering deletes.

------------ (5) deleteDocument(int doc) not implemented
Original IndexModifier has a delete(int docs), the new class doesn't. At first 
this seems ok, since internal doc IDs are not accessible through index writer 
(unlike index reader). But IndexModifier also does not provide access to 
doc-ids. So why was delete-by-id enabled in IndexModifier? Perhaps there's a 
good reason for it, that I fail to see - if so, it should probably be added to 
the new class as well. Adding this is required if the new class would 
eventually replace the implementation of current index modifier.

------------ (6) flush() not implemented
Original IndexModifier has a flush(int docs) method, allowing to commit any 
pending changes. I think it would be nice to have this feature here as well, 
for forcing any pending changes (without caller having to modify the 
max-bufferred value). This would allow more control when using this class. 
Again, adding this is required if the new class would eventually replace the 
implementation of current index modifier.

------------ (7) Method name - batchDeleteDocuments(Term[]) 
I would prefer it to be called deleteDocuments(Term[]), and let Java decide 
which method to call. Main reason is developers would expect that methods with 
similar semantics are named similarly, especially when using IDEs like Eclipse, 
where users type "i.del" and the IDE lets them select from all the methods that 
start with "del".

------------ (8) Class name and placement + What's Next for this patch
Performance test should be added for this new class. Also, I did not code 
review the actual code changes to IndexWriter and the code of NewIndexModifier 
itself.

It seems to me that this class would be very useful for users, either as a new 
class or if it replaces the current implementation of IndexModifier. Latter 
would be possible only if the 2 missing methods mentioned above are added. In 
this case, the "immediate delete" behavior of current IndexModifier should be 
possible to achieve by users, by setting maxBefferedDeleteTerms to 1.

One disadvantage of this class vs. current IndexModifier is the ability to add 
access to further methods of IndexReader. With current IndexModifier this is 
very simple (though not always efficient) - just follow code template in 
existing methods, i.e. close writer/reader and open reader/writer as required. 
With the new class, exposing further methods of IndexReader would be more of a 
challenge. Perhaps having a multiReader on all segment readers can do. I am not 
sure to what extent this should be a consideration, so just bringing it up. 

So, If this class replaces IndexModifier - fine. If not, how about calling it 
BufferredIndexWriter?

- Doron

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-565
>                 URL: http://issues.apache.org/jira/browse/LUCENE-565
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Ning Li
>         Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, NewIndexModifier.July09.patch, 
> NewIndexWriter.July18.patch, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> -----------
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --------------
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> -------------------
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
>     index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
>     inserted, but 25% were deleted. 1000 documents were
>     deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
>     were deleted for every 20 inserted.
>                                 current       current          new
> Workload                      IndexWriter  IndexModifier   IndexWriter
> -----------------------------------------------------------------------
> Insert only                     116 min       119 min        116 min
> Insert/delete (big batches)       --          135 min        125 min
> Insert/delete (small batches)     --          338 min        134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to