Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Grant Ingersoll Fri, 23 Mar 2007 11:04:21 -0800

Hi Mike,

Your timing is ironic. I was just running some benchmarks forApacheCon (using contrib/benchmarker) and noticed what I think aresimilar happenings, so maybe you can validate my assumptions. I'mnot sure if it is because I'm hitting RAM issues or not.

Below is the algorithm file for use w/ benchmarker. To run it, savethe file, cd into contrib/benchamarker (make sure you get the lastestcommits) and run

ant run-task -Dtask.mem=XXXXm -Dtask.alg=<path to file>

The basic idea is, there are ~21580 docs in the Reuters, so I wantedto run some experiments around them with different merge factors andmax.buffered. Granted, some of the factors are ridiculous, but Iwanted to look at these a bit b/c you see people on the user listfrom time to time talking/asking about setting really high numbersfor mergeFactor and maxBufferedDocs.

The sweet spot on my machine seems to be mergeFactor == 100,maxBD=1000. I ran with -Dtask.mem=1024M on a machine with 2gb ofRAM. If I am understanding the numbers correctly, and what you arearguing, this sweet spot happens to coincide approximately with theamount of memory I gave the process. I probably could play a littlebit more with options to reach the inflection point. So, to someextent, I think your approach for RAM based modeling is worth pursuing.

Mostly this is just food for thought. I think what I am doing iscorrect, but am open to suggestions.


Here are my results:
 [java] ------------> Report Sum By (any) Name (6 about 66 out of 66)

[java] Operation round merge max.buffered runCntrecsPerRun rec/s elapsedSec avgUsedMem avgTotalMem[java] Rounds_13 0 10 10 1286039 183.0 1,563.30 956,043,840 1,065,484,288[java] Populate-Opt - - - - - - - - - - - - 13 - -22003 - - 184.6 - 1,549.36 - 347,786,464 - 461,652,288[java] CreateIndex - - -13 1 43.9 0.30 103,676,920 380,309,824[java] MAddDocs_22000 - - - - - - - - - - - 13 - -22000 - - 195.9 - 1,459.75 - 358,755,040 - 461,652,288[java] Optimize - - -13 1 0.1 89.29 365,944,832 461,652,288[java] CloseIndex - - - - - - - - - - - - - 13 - - -- 1 - - 866.7 - - 0.01 - 347,786,464 - 461,652,288

[java] ------------> Report sum by Prefix (MAddDocs) and Round(13 about 13 out of 66)[java] Operation round merge max.buffered runCntrecsPerRun rec/s elapsedSec avgUsedMem avgTotalMem[java] MAddDocs_22000 0 10 10 122000 142.3 154.59 6,969,024 12,271,616[java] MAddDocs_22000 - 1 - 50 - - - 10 - - 1 - -22000 - - 159.7 - - 137.75 - 7,517,728 - 12,861,440[java] MAddDocs_22000 2 100 10 122000 156.7 140.38 9,460,648 13,668,352[java] MAddDocs_22000 - 3 1000 - - - 10 - - 1 - -22000 - - 145.4 - - 151.33 - 29,072,880 - 36,892,672[java] MAddDocs_22000 4 2000 10 122000 112.0 196.47 38,067,048 51,974,144[java] MAddDocs_22000 - 5 - 10 - - - 100 - - 1 - -22000 - - 161.9 - - 135.89 - 40,896,336 - 51,974,144[java] MAddDocs_22000 6 10 1000 122000 266.9 82.44 53,033,616 71,766,016[java] MAddDocs_22000 - 7 - 10 - - 10000 - - 1 - -22000 - - 288.9 - - 76.14 - 392,512,032 - 422,649,856[java] MAddDocs_22000 8 10 21580 122000 272.0 80.89 708,970,944 1,065,484,288[java] MAddDocs_22000 - 9 - 100 - - 21580 - - 1 - -22000 - - 271.9 - - 80.91 - 767,851,072 1,065,484,288[java] MAddDocs_22000 10 1000 21580 122000 275.4 79.89 767,510,464 1,065,484,288

#Sweet Spot for this test

[java] MAddDocs_22000 - 11 - 100 - - - 1000 - - 1 - -22000 - - 316.5 - - 69.52 - 924,356,864 1,065,484,288[java] MAddDocs_22000 12 100 10000 122000 299.1 73.56 917,596,992 1,065,484,288

[java] ------------> Report sum by Prefix (Populate-Opt) andRound (13 about 13 out of 66)[java] Operation round merge max.buffered runCntrecsPerRun rec/s elapsedSec avgUsedMem avgTotalMem[java] Populate-Opt 0 10 10 122003 136.0 161.75 7,331,992 12,271,616[java] Populate-Opt - 1 - 50 - - - 10 - - 1 - -22003 - - 151.8 - - 144.99 - 8,065,640 - 12,861,440[java] Populate-Opt 2 100 10 122003 149.6 147.06 9,927,872 13,668,352[java] Populate-Opt - 3 1000 - - - 10 - - 1 - -22003 - - 138.9 - - 158.38 - 32,094,624 - 36,892,672[java] Populate-Opt 4 2000 10 122003 105.8 207.91 41,058,208 51,974,144[java] Populate-Opt - 5 - 10 - - - 100 - - 1 - -22003 - - 156.0 - - 141.03 - 41,375,032 - 51,974,144[java] Populate-Opt 6 10 1000 122003 249.5 88.20 53,494,472 71,766,016[java] Populate-Opt - 7 - 10 - - 10000 - - 1 - -22003 - - 259.5 - - 84.78 - 226,485,280 - 422,649,856[java] Populate-Opt 8 10 21580 122003 254.6 86.44 675,577,344 1,065,484,288[java] Populate-Opt - 9 - 100 - - 21580 - - 1 - -22003 - - 253.5 - - 86.78 - 791,214,016 1,065,484,288[java] Populate-Opt 10 1000 21580 122003 258.7 85.06 790,837,440 1,065,484,288[java] Populate-Opt - 11 - 100 - - - 1000 - - 1 - -22003 - - 289.9 - - 75.89 - 887,718,272 1,065,484,288[java] Populate-Opt 12 100 10000 122003 271.3 81.09 956,043,840 1,065,484,288




#last value is more than all the docs in reuters
merge.factor=merge:10:100:1000:5000:10:10:10:10:100:1000
max.buffered=max.buffered:10:10:10:10:100:1000:10000:21580:21580:21580
compound=true

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
#directory=RamDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=1000

docs.dir=reuters-out
#docs.dir=reuters-111

#doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker

#query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=true

#-------------------------------------------------------------------------------------


{ "Rounds"

    ResetSystemErase

    { "Populate-Opt"
        CreateIndex
        { "MAddDocs" AddDoc > : 22000
        Optimize
        CloseIndex
    }

    NewRound

} : 10

RepSumByName
RepSumByPrefRound MAddDocs
RepSumByPrefRound Populate-Opt


On Mar 23, 2007, at 11:27 AM, Michael McCandless (JIRA) wrote:

[ https://issues.apache.org/jira/browse/LUCENE-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483631 ]


Michael McCandless commented on LUCENE-845:
-------------------------------------------

This bug is actually rather serious.

If you set maxBufferedDocs to a very large number (on the expectation
that it's not used since you will manually flush by RAM usage) then
the merge policy will always merge the index down to 1 segment as soon
as it hits mergeFactor segments.

This will be an O(N^2) slowdown.  EG if based on RAM you are
flushing every 100 docs, then at 1000 docs you will merge to 1
segment.  Then at 1900 docs, you merge to 1 segment again.  At 2800,
3700, 4600, ... (every 900 docs) you keep merging to 1 segment.  Your
indexing process will get very slow because every 900 documents the
entire index is effectively being optimized.

With LUCENE-843 I'm thinking we should deprecate maxBufferedDocs
entirely and switch to flushing by RAM usage instead (you can always
manually flush every N documents in your app if for some reason you
need that).  But obviously we need to resolve this bug first.

If you "flush by RAM usage" then IndexWriter may over-merge
-----------------------------------------------------------

                Key: LUCENE-845
                URL: https://issues.apache.org/jira/browse/LUCENE-845
            Project: Lucene - Java
         Issue Type: Bug
         Components: Index
   Affects Versions: 2.1
           Reporter: Michael McCandless
        Assigned To: Michael McCandless
           Priority: Minor

I think a good way to maximize performance of Lucene's indexing for a
given amount of RAM is to flush (writer.flush()) the added documents
whenever the RAM usage (writer.ramSizeInBytes()) has crossed the max
RAM you can afford.
But, this can confuse the merge policy and cause over-merging, unless
you set maxBufferedDocs properly.
This is because the merge policy looks at the current maxBufferedDocs
to figure out which segments are level 0 (first flushed) or level 1
(merged from <mergeFactor> level 0 segments).
I'm not sure how to fix this.  Maybe we can look at net size (bytes)
of a segment and "infer" level from this?  Still we would have to be
resilient to the application suddenly increasing the RAM allowed.
The good news is to workaround this bug I think you just need to
ensure that your maxBufferedDocs is less than mergeFactor *
typical-number-of-docs-flushed.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

Reply via email to