[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

John Wang (JIRA) Tue, 10 Nov 2009 13:29:59 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776085#action_12776085
 ]


John Wang commented on LUCENE-1526:
-----------------------------------

bq.  Zoie will take 64 msec longer than Lucene, due to the extra check.

That is not true. If you look at the report closely, it is 20ms difference, 
64ms is the total size. (after I turned on -server, the diff is about 10ms). 
This is running on my laptop, hardly a production server.

This is also assuming the entire corpus is returned, where we should really 
take an average of the result set from the query log.

However, to save this "overhead", using BitVector is wasting a lot of memory, 
which is expensive to clone, new and gc. In a running system, much of that cost 
is hard to measure. This is simply a question of trade-offs.

Again, I would suggest to run the tests yourself, afterall, it is open source 
:) and make decisions for yourself, this way, we can get a better understanding 
from concrete numbers and scenarios.

BTW, is there a performance benchmark/setup for lucene NRT?

bq. The tests so far are really testing Zoie's reopen time vs Lucene's

That is not true either. This test is simply testing searching with indexing 
turned on. Not specific to re-open. I don't think the statement that the 
performance difference is solely due to reopen is substantiated. I am seeing 
the following with NRT:

e.g. 
1) file handle leak - Our prod-quality machine fell over after 1 hr of running 
using NRT due to file handle leaking.
2) cpu and memory starvation - monitoring cpu and memory usage, the machine 
seems very starved, and I think that leads to performance differences more than 
the extra array look.
3) I am seeing also correctness issues as well, e.g. deletes don't get applied 
correctly. I am not sure about the unit test coverage for NRT to comment 
specifically.

Again, this can all be specific to my usage of NRT or the test setup. That is 
why I urge you guys to run our tests yourself and correct us if you see areas 
we are missing to make a fair comparison.




> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Reply via email to