[ 
https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520268
 ] 

Steven Parkes commented on LUCENE-845:
--------------------------------------

I understand the merge problem but I'm still concerned about the increased 
number of file descriptors. Is this a concern?

It seems like there are ways of approaching this, that might be able to fix 
both problems?

For example, right now (pre-fix), if you have maxBufferedDocs set to 1000, 
mergeFactor set to 10, and add (for the sake of obvious example) 10 single doc 
segments, it's going to do a merge to one segment of size 1010, which is not 
great.

One solution to this would be in cases like this to merge the small segments to 
one but not include the big segments. So you get [1000 10] where the last 
segment keeps growing until it reaches 1000. This does more copies than the 
current case, but always on small segments, with the advantage of a lower bound 
on the number of file descriptors?

Of course, if no one's worried about this "moderate" (not exactly large, not 
exactly small) change in file descriptor usage, then it's not a big deal. It 
doesn't impact my work but I'm not sure about the greater community.

> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
>                 Key: LUCENE-845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-845
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this.  Maybe we can look at net size (bytes)
> of a segment and "infer" level from this?  Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to