[
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777068#action_12777068
]
Jake Mannix commented on LUCENE-1526:
-------------------------------------
bq. OK. It's clear Zoie's design is optimized for insanely fast reopen.
That, and maxing out QPS and indexing rate while keeping query latency
degredation to a minimum. From trying to turn off the extra deleted check, the
latency overhead on a 5M doc index is a difference of queries taking 12-13ms
with the extra check turned on, and 10ms without it, and you only really start
to notice on the extreme edges (the queries hitting all 5million docs by way of
an actual query (not MatchAllDocs)), when your performance goes from maybe
100ms to 140-150ms.
bq. EG what I'd love to see is, as a function of reopen rate, the "curve" of
QPS vs docs per sec. Ie, if you reopen 1X per second, that consumes some of
your machine's resources. What's left can be spent indexing or searching or
both, so, it's a curve/line. So we should set up fixed rate indexing, and then
redline the QPS to see what's possible, and do this for multiple indexing
rates, and for multiple reopen rates.
Yes, that curve would be a very useful benchmark. Now that I think of it, it
wouldn't be too hard to just sneak some reader caching into the ZoieSystem with
a tunable parameter for how long you hang onto it, so that we could see how
much that can help. One of the nice things that we can do in Zoie by using
this kind of index-latency backoff, is that because we have an in-memory
two-way mapping of zoie-specific UID to docId, if we actually have time (in the
background, since we're caching these readers now) to zip through and update
the real delete BitVectors on the segments, and lose the extra check at query
time, only using that if you have the index-latency time set below some
threshold (determined by how long it takes the system to do this resolution -
mapping docId to UID is an array lookup, the reverse is a little slower).
bq. Right, Zoie is making determined tradeoffs. I would expect that most apps
are fine with controlled reopen frequency, ie, they would choose to not lose
indexing and searching performance if it means they can "only" reopen, eg, 2X
per second.
In theory Zoie is making tradeoffs - in practice, at least against what is on
trunk, Zoie's just going way faster in both indexing and querying in the
redline perf test. I agree that in principle, once LUCENE-1313 and other
improvements and bugs have been worked out of NRT, that query performance
should be faster, and if zoie's default BalancedMergePolicy (nee
ZoieMergePolicy) is in use for NRT, the indexing performance should be faster
too - it's just not quite there yet at this point.
bq. I agree - having such well defined API semantics ("once updateDoc returns,
searches can see it") is wonderful. But I think they can be cleanly built on
top of Lucene NRT as it is today, with a pre-determined (reopen) latency.
Of course! These api semantics are already built up on top of plain-old Lucene
- even without NRT, so I can't imagine how NRT would *remove* this ability! :)
bq. I think the "large merge just finished" case is the most costly for such
apps (which the "merged segment warmer" on IW should take care of)? (Because
otherwise the segments are tiny, assuming everything is cutover to "per
segment").
Definitely. One thing that Zoie benefited from, from an API standpoint, which
might be nice in Lucene, now that 1.5 is in place, is that the
IndexReaderWarmer could replace the raw SegmentReader with a warmed
user-specified subclass of SegmentReader:
{code}
public abstract class IndexReaderWarmer<R extends IndexReader> {
public abstract T warm(IndexReader rawReader);
}
{code}
Which could replace the reader in the readerPool with the
possibly-user-overridden subclass of SegmentReader (now that SegmentReader is
as public as IndexReader itself is) which has now been warmed. For users who
like to decorate their readers to keep additional state, instead of use them as
keys into WeakHashMaps kept separate, this could be extremely useful (I know
that the people I talked to at Apple's iTunes store do this, as well as in
bobo, and zoie, to name a few examples off the top of my head).
bq. I think Lucene could handle this well, if we made an IndexReader impl that
directly searches DocumentWriter's RAM buffer. But that's somewhat challenging
Jason mentioned this approach in his talk at ApacheCon, but I'm not at all
convinced it's necessary - if a single box can handle indexing a couple hundred
smallish documents a second (into a RAMDirectory), and could be sped up by
using multiple IndexWriters (writing into multiple RAMDirecotries in parallel,
if you were willing to give up some CPU cores to indexing), and you can search
against them without having to do any deduplification / bloomfilter check
against the disk, then I'd be surprised if searching the pre-indexed RAM buffer
would really be much of a speedup in comparison to just doing it the simple
way. But I could be wrong, as I'm not sure how much faster such a search could
be.
> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]