[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

Michael Busch (JIRA) Mon, 10 Jan 2011 08:47:21 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979633#action_12979633
 ]


Michael Busch commented on LUCENE-2312:
---------------------------------------

bq. I believe the goal for RT readers is still "point in time" reader semantics.

True.  At twitter our RT solution also guarantees point-in-time readers (with 
one exception; see below).  We have to provide at least a fixed macDoc 
per-query to guarantee consistency across terms (posting lists).  Eg.  imagine 
your query is 'a AND NOT b'. Say a occurs in doc 100. Now you don't find a 
posting in b's posting list for doc 100.  Did doc 100 not have term b, or is 
doc 100 still being processed and that particular posting hasn't been written 
yet?  If the reader's maxDoc however is set to 99 (the last completely indexed 
document) you can't get into this situation.

Before every query we reopen the readers, which effectively simply updates the 
maxDoc.

The one exception to point-in-time-ness are the df values in the dictionary, 
which for obvious reasons is tricky.  I think a straightforward way to solve 
this problem is to count the df by iterating the corresponding posting list 
when requested. We could add a special counting method that just uses the skip 
lists to perform this task. Here the term buffer becomes even more important, 
and also documenting that docFreq() can be expensive in RT mode, ie. not O(1) 
as in non-RT mode, but rather O(log indexSize) in case we can get multi-level 
skip lists working in RT.

> Search on IndexWriter's RAM Buffer
> ----------------------------------
>
>                 Key: LUCENE-2312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2312
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>            Assignee: Michael Busch
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

Reply via email to