[ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520343 ]
Michael McCandless commented on LUCENE-845: ------------------------------------------- > You may avoid the cost of a bunch of small merges, but then you pay > the price in searching performance. I'm not sure that's the right > tradeoff because if someone wanted to optimize for indexing > performance, they would do more in batches. Agreed. It's like we would want to run "partial optimize" (ie, merge the tail of "small" segments) on demand, only when a reader is about to refresh. Or here's another random idea: maybe IndexReaders should load the tail of "small segments" into a RAMDirectory, for each one. Ie, an IndexReader is given a RAM buffer "budget" and it spends it on any numerous small segments in the index....? > How does this work when flushing by MB? If you set > setRamBufferSizeMB(32), are you guaranteed that you never have more > than 10 segments less than 32MB (ignoring LEVEL_LOG_SPAN for now) if > mergeFactor is 10? No, we have the same challenge of avoiding O(N^2) merge cost. When merging by "byte size" of the segments, I don't look at the current RAM buffer size of the writer. I feel like there should be a strong separation of "flush params" from "merge params". > Almost seems like we need a minSegmentSize parameter too - using > setRamBufferSizeMB confuses two different but related issues. Exactly! I'm thinking that I add "minSegmentSize" to the LogMergePolicy, which is separate from "maxBufferedDocs" and "ramBufferSizeMB". And, we default it to values that seem like an "acceptable" tradeoff of the cost of O(N^2) merging (based on tests I will run) vs the cost of slowdown to readers... I'll run some perf tests. O(N^2) should be acceptable for certain segment sizes.... > If you "flush by RAM usage" then IndexWriter may over-merge > ----------------------------------------------------------- > > Key: LUCENE-845 > URL: https://issues.apache.org/jira/browse/LUCENE-845 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Affects Versions: 2.1 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-845.patch > > > I think a good way to maximize performance of Lucene's indexing for a > given amount of RAM is to flush (writer.flush()) the added documents > whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max > RAM you can afford. > But, this can confuse the merge policy and cause over-merging, unless > you set maxBufferedDocs properly. > This is because the merge policy looks at the current maxBufferedDocs > to figure out which segments are level 0 (first flushed) or level 1 > (merged from <mergeFactor> level 0 segments). > I'm not sure how to fix this. Maybe we can look at net size (bytes) > of a segment and "infer" level from this? Still we would have to be > resilient to the application suddenly increasing the RAM allowed. > The good news is to workaround this bug I think you just need to > ensure that your maxBufferedDocs is less than mergeFactor * > typical-number-of-docs-flushed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]