[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Michael McCandless (JIRA) Thu, 12 Nov 2009 07:50:17 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777025#action_12777025
 ]


Michael McCandless commented on LUCENE-1526:
--------------------------------------------

bq. Due to the bloomfilter living on top of the hashSet, at least at the scales 
we're dealing with, we didn't see any change in cost due to the number of 
deletions (zoie by default keeps no more than 10k modifications in memory 
before flushing to disk, so the biggest the delSet is going to be is that, and 
we don't see the more-than-constant scaling yet at that size).

Blooom filters are nice :)

{quote}
bq. But your test is missing a dimension: frequency of reopen. If you reopen 
once per second, how do Zoie/Lucene compare? Twice per second? Once every 5 
seconds? Etc.

Yep, this is true. It's a little more invasive to put this into Zoie, because 
the reopen time is so fast that there's no pooling, so it would need to be 
kinda hacked in, or tacked on to the outside. Not rocket science, but not just 
the change of a parameter.
{quote}

OK.  It's clear Zoie's design is optimized for insanely fast reopen.

LUCENE-2050 should make it easy to test this for pure Lucene NRT.

bq. LinkedIn doesn't have any hard requirements of having to reopen hundreds of 
times per second, we're just stressing the system, to see what's going on.

Redline tests are very important, to understand how the system will
behave at extremes.

But I think it'd be useful to controll which dimension to redline.

EG what I'd love to see is, as a function of reopen rate, the "curve"
of QPS vs docs per sec.  Ie, if you reopen 1X per second, that
consumes some of your machine's resources.  What's left can be spent
indexing or searching or both, so, it's a curve/line.  So we should
set up fixed rate indexing, and then redline the QPS to see what's
possible, and do this for multiple indexing rates, and for multiple
reopen rates.

Then this all becomes a capacity question for apps.

bq. As you can see, nobody's filing a bug here that Lucene NRT is "broken" 
because it can't handle zero-latency updates.

Right, Zoie is making determined tradeoffs.  I would expect that most
apps are fine with controlled reopen frequency, ie, they would choose
to not lose indexing and searching performance if it means they can
"only" reopen, eg, 2X per second.

(Of course we will need to test, with LUCENE-2050, at what reopen
frequency you really eat into your indexing/searching performance,
given fixed hardware).

{quote}
What we did try to make sure was in the system was determinism: not knowing 
whether an update will be seen because there is some background process doing 
addIndexes from another thread which hasn't completed, or not knowing how fresh 
the pooled reader is, that kind of thing.

This kind of determinism can certainly be gotten with NRT, by locking down the 
IndexWriter wrapped up in another class to keep it from being monkeyed with by 
other threads, and then tuning exactly how often the reader is reopened, and 
then dictate to clients that the freshness is exactly at or better than this 
freshness timeout, sure. This kind of user-friendliness is one of Zoie's main 
points - it provides an indexing system which manages all this, and certainly 
for some clients, we should add in the ability to pool the readers for less 
real-timeness, that's a good idea.
{quote}

I agree -- having such well defined API semantics ("once updateDoc
returns, searches can see it") is wonderful.  But I think they can be
cleanly built on top of Lucene NRT as it is today, with a
pre-determined (reopen) latency.

{quote}
Of course, if your reopen() time is pretty heavy (lots of FieldCache data / 
other custom faceting data needs to be loaded for a bunch of fields), then at 
least for us, even not needing zero-latency updates means that the more 
realistically 5-10% degredation in query performance for normal queries is 
negligable, and we get deterministic zero-latency updates as a consequence.
{quote}

I think the "large merge just finished" case is the most costly for
such apps (which the "merged segment warmer" on IW should take care
of)?  (Because otherwise the segments are tiny, assuming everything is
cutover to "per segment").

{quote}
This whole discussion reminded me that there's another realtime update case, 
which neither Zoie nor NRT is properly optimized for: the absolutely zero 
deletes case with very fast indexing load and the desire for minimal latency of 
updates (imagine that you're indexing twitter - no changes, just adds), and you 
want to be able to provide a totally stream-oriented view on things as they're 
being added (matching some query, naturally) with sub-second turnaround. A 
subclass of SegmentReader which is constructed which doesn't even have a 
deletedSet could be instantiated, and the deleted check could be removed 
entirely, speeding things up even further.
{quote}

I think Lucene could handle this well, if we made an IndexReader impl
that directly searches DocumentWriter's RAM buffer.  But that's
somewhat challenging ;)


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Reply via email to