[ 
https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492814
 ] 

Steven Parkes commented on LUCENE-845:
--------------------------------------

Following up on this, it's basically the idea that segments ought to be 
created/merged both either by-segment-size or by-doc-count but not by a 
mixture? That wouldn't be suprising ...

It does impact the APIs, though. It's easy enough to imagine, with factored 
merge policies, both by-doc-count and by-segment-size policies. But the initial 
segment creation is going to be handled by IndexWriter, so you have to manually 
make sure you don't set that algorithm and the merge policy in conflict. Not 
great, but I don't have any great ideas. Could put in an API handshake, but I'm 
not sure if it's worth the mess?

Also, it sounds like, so far, there's no good way of managing parallel-reader 
setups w/by-segment-size algorithms, since the algorithm for creating/merging 
segments has to be globally consistent, not just per index, right?

If that is right, what does that say about making by-segment-size the default? 
It's gonna break (as in bad results) people relying on that behavior that don't 
change their code. Is there a community consensus on this? It's not really an 
API change that would cause a compile/class-load failure, but in some ways, 
it's worse ...

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to