[
https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526403
]
Michael McCandless commented on LUCENE-845:
-------------------------------------------
In the latest patch on LUCENE-847 I've added methods to
LogDocMergePolicy (setMinMergeDocs) and LogByteSizeMergePolicy
(setMinMergeMB) to set a floor on the segment levels such that all
segments below this min size are aggressively merged as if they were in
one level. This effectively "truncates" what would otherwise be a
long tail of segment sizes, when you are flushing many tiny segments
into your index.
In order to pick reasonable defaults for the min segment size, I ran
some benchmarks to measure the indexing cost of truncating the tail.
I processed Wiki content into ~4 KB plain text documents and then
indexed the first 10,000 docs using this alg:
analyzer=org.apache.lucene.analysis.SimpleAnalyzer
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
directory=FSDirectory
docs.file=/lucene/wiki4K.txt
max.buffered = 500
ResetSystemErase
CreateIndex
{AddDoc >: 10000
CloseIndex
RepSumByName
I'm using the SerialMergeScheduler.
I modified contrib/benchmark to always flush a new segment after each
added document: this simulates the "worst case" of tiny segments, ie,
lowest latency indexing where every added doc must then be visible to
searchers.
Each time is best of 2 runs. This is run on Linux (2.6.22.1) Core II
Duo 2.4 Ghz machine with 4 GB RAM, RAID 5 IO system using Java 1.5
-server.
maxBufferedDocs seconds slowdown
10 40 1.0
100 50 1.3
200 59 1.5
300 64 1.6
400 72 1.8
500 80 2.0
750 97 2.4
1000 114 2.9
1500 138 3.5
2000 169 4.2
3000 205 5.1
4000 264 6.6
5000 320 8.0
7500 404 10.1
10000 645 16.1
Here's my thinking:
* If you are flushing zillions of such tiny segments I think it's OK
to accept a net/net sizable slowdown of your overall indexing
speed. I'll choose a 4X slowdown "tolerance" to choose default
values. This corresponds roughly to the "2000" line above.
However, because I tested on a fairly fast CPU & IO system I'll
multiply the numbers by 0.5.
* Given this, I propose we default the minMergeMB
(LogByteSizeMergePolicy) to 1.6 MB (avg size of real segments at
the 2000 point above was 3.2 MB) and default minMergeDocs
(LogDocMergePolicy) to 1000.
* Note that when you are flushing large segments (larger than these
min size settings) then there is no slowdown at all because the
flushed segments are already above the minimum size.
These are just defaults, so a given application can always change
their "min merge size" as needed.
> If you "flush by RAM usage" then IndexWriter may over-merge
> -----------------------------------------------------------
>
> Key: LUCENE-845
> URL: https://issues.apache.org/jira/browse/LUCENE-845
> Project: Lucene - Java
> Issue Type: Bug
> Components: Index
> Affects Versions: 2.1
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-845.patch
>
>
> I think a good way to maximize performance of Lucene's indexing for a
> given amount of RAM is to flush (writer.flush()) the added documents
> whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
> RAM you can afford.
> But, this can confuse the merge policy and cause over-merging, unless
> you set maxBufferedDocs properly.
> This is because the merge policy looks at the current maxBufferedDocs
> to figure out which segments are level 0 (first flushed) or level 1
> (merged from <mergeFactor> level 0 segments).
> I'm not sure how to fix this. Maybe we can look at net size (bytes)
> of a segment and "infer" level from this? Still we would have to be
> resilient to the application suddenly increasing the RAM allowed.
> The good news is to workaround this bug I think you just need to
> ensure that your maxBufferedDocs is less than mergeFactor *
> typical-number-of-docs-flushed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]