[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

Michael McCandless (JIRA) Sat, 22 Jan 2011 03:12:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985089#action_12985089
 ]


Michael McCandless commented on LUCENE-2324:
--------------------------------------------

bq. Right, it'll eat into the RAM buffer but it's not extreme (or is it?!).

I think this could be acceptable (2 bytes per doc) as long as it's only for the 
docIDs in iW's RAM buffer, and not for docs in flushed segments?

bq. I did propose that a while back, and I'm not sure why, but I don't think 
you were a big fan: LUCENE-1574

Ugh, you're right!  Back then I wasn't a fan.... but, back then I didn't 
realize we could also reuse the contents of the bit vector (not just the 
allocated RAM), using a replay log.

bq. Would this also be used for DW's deletes?

It's tempting -- but let's first see how it works out for the flushed segments.

bq. The paged approach I think'll have issues in a low reader latency enviro, 
ie, create overhead from all the changes. Whereas an array is fast to change, 
and fast to copy.

You mean paged BV right?  I think that, and more generally any transactional 
data structure (eg like Zoie's wrapped bloom filter / HashSet approach) is too 
much added cost for searching.  Using RT/NRT shouldn't slow down searching, ie 
I prefer the cost be front loaded into the reopen than backloaded into all 
searches.

bq. Couldn't we simply use System.arraycopy and be done?

Well... System.arraycopy, while fast, is still O(N).  Yes, it has a small 
constant in front, but for a large index that cost will start to dominate.  Vs 
the cost of replaying the log, assuming the log is "smallish", is linear in the 
number of deletes since this BV's last reader.  Still I expect we'll need a 
hybrid approach -- if the number of deletes in the log is too many then we 
fallback to System.arraycopy.


> Per thread DocumentsWriters that write their own private segments
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2324
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2324
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324.patch, LUCENE-2324.patch, LUCENE-2324.patch, lucene-2324.patch, 
> lucene-2324.patch, LUCENE-2324.patch, test.out, test.out, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

Reply via email to