[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775319#action_12775319
 ] 

John Wang commented on LUCENE-1526:
-----------------------------------

bq. I'd love to see how the worst-case queries (matching millions of hits) 
perform with each of these three options.

I wrote a small program on my laptop, 100 docs in the set, iterates thru 5M 
numbers and calls contains().
I see 44 ms with BitVector and 64ms with IntAccelerator backed by 
IntOpenHashSet (from fastUtil)

This is however an extreme case, so test 2, I chose 5000 docs from the set, 
e.g. mod 1000 to be a candidate for check. And both sets performed equally, 
around 45ms.

So with the memory cost, and the allocations and clones of the BitVector, I 
think for us at least, using the IntSetAccelerator works well.

bq. why does each thread make a full clone of the AcceleratedBitSet?

These are for updates, e.g. you updated doc x, it is updated to the ramdir, but 
it is already on the disk dir. So at query time, you need this set for dup 
removal.

bq. I'd love to see this too.

Some more details on the test we ran:

NRT - indexing only
***********************************************************
SUMMARY:
***********************************************************
TOTAL TRANSACTIONS: 622201
TOTAL EXECUTIONS: 622201
TOTAL SUCCESSFUL EXECUTIONS: 622201
TOTAL FAILED EXECUTIONS: 0
TOTAL RUNTIME IN MINS: 30.07
INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
***********************************************************

zoie - indexing only
SUMMARY:
***********************************************************
TOTAL TRANSACTIONS: 6265384
TOTAL EXECUTIONS: 6265384
TOTAL SUCCESSFUL EXECUTIONS: 6265384
TOTAL FAILED EXECUTIONS: 0
TOTAL RUNTIME IN MINS: 30.07
INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
***********************************************************

zoie - update
SUMMARY:
***********************************************************
TOTAL TRANSACTIONS: 1923592
TOTAL EXECUTIONS: 1923592
TOTAL SUCCESSFUL EXECUTIONS: 1923592
TOTAL FAILED EXECUTIONS: 0
TOTAL RUNTIME IN MINS: 30.07
INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
***********************************************************

nrt - update

SUMMARY:
***********************************************************
TOTAL TRANSACTIONS: 399893
TOTAL EXECUTIONS: 399893
TOTAL SUCCESSFUL EXECUTIONS: 399893
TOTAL FAILED EXECUTIONS: 0
TOTAL RUNTIME IN MINS: 30.07
INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
***********************************************************

Latencies:

Zoie - insert test:  linear growth from 1 ms to 5 ms as index grows in the 
duration of the test from 0 docs to 660k docs.
Zoie - update test: averaged at 9ms, as index with continuous update and stayed 
in 1M docs
NRT - insert test: fluctuated between 17 ms to 50 ms as index grows in the 
duration of the test from 0 docs to 220 docs.
NRT - update test: big peak when query started, latency spiked up to 550ms and 
then dropped and stayed steadily at 50ms, with continuous updates to stay in 1M 
docs.

Some observation at the NRT update test, I am seeing some delete issues, e.g. 
realtime deletes does not seem to reflect, and indexing speed sharply dropped.

It's quite possible that I am not using NRT the most optimal way in my setup. 
Feel free to run the tests yourself. I'd happy to help with the setup.
One thing with Zoie is that it is a full stream indexing system with a 
pluggable realtime engine, so you can actually use zoie for perf testing for 
NRT.

One thing about the test to stress, we are testing realtime updates, so 
buffered indexing events up and flush once it a while is not realtime, and 
katta has already achieved good results with batch indexing with just minutes 
of delay, without making any internal changes to lucene.


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to