[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775319#action_12775319 ]
John Wang commented on LUCENE-1526: ----------------------------------- bq. I'd love to see how the worst-case queries (matching millions of hits) perform with each of these three options. I wrote a small program on my laptop, 100 docs in the set, iterates thru 5M numbers and calls contains(). I see 44 ms with BitVector and 64ms with IntAccelerator backed by IntOpenHashSet (from fastUtil) This is however an extreme case, so test 2, I chose 5000 docs from the set, e.g. mod 1000 to be a candidate for check. And both sets performed equally, around 45ms. So with the memory cost, and the allocations and clones of the BitVector, I think for us at least, using the IntSetAccelerator works well. bq. why does each thread make a full clone of the AcceleratedBitSet? These are for updates, e.g. you updated doc x, it is updated to the ramdir, but it is already on the disk dir. So at query time, you need this set for dup removal. bq. I'd love to see this too. Some more details on the test we ran: NRT - indexing only *********************************************************** SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 622201 TOTAL EXECUTIONS: 622201 TOTAL SUCCESSFUL EXECUTIONS: 622201 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 *********************************************************** zoie - indexing only SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 6265384 TOTAL EXECUTIONS: 6265384 TOTAL SUCCESSFUL EXECUTIONS: 6265384 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 *********************************************************** zoie - update SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 1923592 TOTAL EXECUTIONS: 1923592 TOTAL SUCCESSFUL EXECUTIONS: 1923592 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 *********************************************************** nrt - update SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 399893 TOTAL EXECUTIONS: 399893 TOTAL SUCCESSFUL EXECUTIONS: 399893 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 *********************************************************** Latencies: Zoie - insert test: linear growth from 1 ms to 5 ms as index grows in the duration of the test from 0 docs to 660k docs. Zoie - update test: averaged at 9ms, as index with continuous update and stayed in 1M docs NRT - insert test: fluctuated between 17 ms to 50 ms as index grows in the duration of the test from 0 docs to 220 docs. NRT - update test: big peak when query started, latency spiked up to 550ms and then dropped and stayed steadily at 50ms, with continuous updates to stay in 1M docs. Some observation at the NRT update test, I am seeing some delete issues, e.g. realtime deletes does not seem to reflect, and indexing speed sharply dropped. It's quite possible that I am not using NRT the most optimal way in my setup. Feel free to run the tests yourself. I'd happy to help with the setup. One thing with Zoie is that it is a full stream indexing system with a pluggable realtime engine, so you can actually use zoie for perf testing for NRT. One thing about the test to stress, we are testing realtime updates, so buffered indexing events up and flush once it a while is not realtime, and katta has already achieved good results with batch indexing with just minutes of delay, without making any internal changes to lucene. > For near real-time search, use paged copy-on-write BitVector impl > ----------------------------------------------------------------- > > Key: LUCENE-1526 > URL: https://issues.apache.org/jira/browse/LUCENE-1526 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Priority: Minor > Attachments: LUCENE-1526.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > SegmentReader currently uses a BitVector to represent deleted docs. > When performing rapid clone (see LUCENE-1314) and delete operations, > performing a copy on write of the BitVector can become costly because > the entire underlying byte array must be created and copied. A way to > make this clone delete process faster is to implement tombstones, a > term coined by Marvin Humphrey. Tombstones represent new deletions > plus the incremental deletions from previously reopened readers in > the current reader. > The proposed implementation of tombstones is to accumulate deletions > into an int array represented as a DocIdSet. With LUCENE-1476, > SegmentTermDocs iterates over deleted docs using a DocIdSet rather > than accessing the BitVector by calling get. This allows a BitVector > and a set of tombstones to by ANDed together as the current reader's > delete docs. > A tombstone merge policy needs to be defined to determine when to > merge tombstone DocIdSets into a new deleted docs BitVector as too > many tombstones would eventually be detrimental to performance. A > probable implementation will merge tombstones based on the number of > tombstones and the total number of documents in the tombstones. The > merge policy may be set in the clone/reopen methods or on the > IndexReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org