[GitHub] [lucene] MichelLiu commented on pull request #92: Expunge big segment with oversize deletePct caused by continuously updating a batch of data
MichelLiu commented on pull request #92: URL: https://github.com/apache/lucene/pull/92#issuecomment-825347763 I had a problem with the tiered merge policy. As I continuously updated a batch of data over time and time, then I got a lot of segments with 4.9G which segDelPct already greater than deletePctAllowed and cannot be merged by tiered merge policy. Then I found the code here and figured out the reason: ` if (segSizeDocs.sizeInBytes > maxMergedSegmentBytes / 2 && (totalDelPct <= deletesPctAllowed || segDelPct <= deletesPctAllowed)) { iter.remove(); tooBigCount++; // Just for reporting purposes. totIndexBytes -= segSizeDocs.sizeInBytes; allowedDelCount -= segSizeDocs.delCount; } ` Here was the segments I had met before: 1613741580098 0 p 10.10.112.123 _2h 891224440 5693304.9gb 4905832 true true 8.4.0 false 1613741580098 0 p 10.10.112.123 _4v1752383463 4259194.9gb 5636245 true true 8.4.0 false 1613741580098 0 p 10.10.112.123 _6n2392891298 3802124.9gb 5617940 true true 8.4.0 false 1613741580098 0 p 10.10.112.123 _1lwc75036 468350 3641044.3gb 3718611 true true 8.4.0 false 1613741580098 0 p 10.10.112.123 _1xh290038 678187 2527793.6gb 3453739 true true 8.4.0 false 1613741580098 0 p 10.10.112.123 _25u8 100880 482795 2372754.1gb 3370799 true true 8.4.0 false 1613741580098 0 p 10.10.112.123 _2fld 113521 721503 2251604.1gb 3776954 true true 8.4.0 false 1613741580098 0 p 10.10.112.123 _2m9h 122165 831574 1275724.2gb 3812013 true true 8.4.0 false 1613741580098 0 p 10.10.112.123 _2n01 123121 34000 27437 345.3mb 543426 true true 8.4.0 true 1613741580098 0 p 10.10.112.123 _2nq6 124062 36985 19838 319.2mb 515882 true true 8.4.0 true 1613741580098 0 p 10.10.112.123 _2o7d 124681 52725 40581 556.3mb 632128 true true 8.4.0 true 1613741580098 0 p 10.10.112.123 _2ouj 125515 11158 6330114mb 235396 true true 8.4.0 true And I had an index with 564G, after bulk updating for a month, then grows up to 1400G. That caused significant waste of disk, and also highed up the search delay to 450ms. So we have to reindex the index per month now. My solution is to merge the large segments as low-frequency as possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on a change in pull request #2482: SOLR-15356: Deprecate UninvertDocValuesMergePolicyFactory
dsmiley commented on a change in pull request #2482: URL: https://github.com/apache/lucene-solr/pull/2482#discussion_r618630451 ## File path: solr/core/src/java/org/apache/solr/update/SolrIndexConfig.java ## @@ -303,6 +304,10 @@ private MergePolicy buildMergePolicy(SolrResourceLoader resourceLoader, IndexSch new Class[] { SolrResourceLoader.class, MergePolicyFactoryArgs.class, IndexSchema.class }, new Object[] {resourceLoader, mpfArgs, schema }); +if (mpf instanceof UninvertDocValuesMergePolicyFactory) { Review comment: When Solr loads a deprecated class (see SolrResourceLoader line 537), it detects this and logs a warning. I think that's fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building
jpountz commented on pull request #91: URL: https://github.com/apache/lucene/pull/91#issuecomment-825004979 Oh, this is disappointing, maybe the arrays are too small for TimSorter to actually perform better than InPlaceMergeSorter. I'd be keen to proceed with the change that always performs a stable sort with InPlaceMergeSorter. Some cases do get slower but only by a few percents, and it's going to be unlikely noticed through the full indexing chain. On the other hand, some cases are getting several times faster, which I'm sure is going to be noticeable. We could still iterate later, but for now this sounds to me like a good performance-simplicity trade-off. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cpoerschke opened a new pull request #2482: SOLR-15356: Deprecate UninvertDocValuesMergePolicyFactory
cpoerschke opened a new pull request #2482: URL: https://github.com/apache/lucene-solr/pull/2482 I think a `solr/CHANGES.txt` entry for Solr 8.x is not warranted given the specialised use of the class but it would seem nice to provide some indication to anyone using it w.r.t. the class being removed in future Solr 9 via the https://github.com/apache/solr/pull/83 change? https://issues.apache.org/jira/browse/SOLR-15356 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] neoremind commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building
neoremind commented on pull request #91: URL: https://github.com/apache/lucene/pull/91#issuecomment-824891023 I use `TimSort` instead of `InPlaceMergeSorter`, expect it to be faster, but it turns out to be slower. @jpountz would you check my latest commit to see if I implement Tim Sort correctly? Below is the latest benchmark of `MSBRadixSort` with stable reorder(isDocIdIncremental = N) and `StableMSBRadixSort` (isDocIdIncremental = Y) ``` - | bytesPerDim | isDocIdIncremental | avg time(us) | - | 1 | N |995541.5 | | 1 | Y | 60399.2 | | 2 | N |951085.9 | | 2 | Y |322054.3 | | 3 | N | 1333992.5 | | 3 | Y |756951.4 | | 4 | N | 1340422.4 | | 4 | Y | 1528955.5 | | 8 | N | 1323878.8 | | 8 | Y | 1494004.5 | | 16 | N | 1305548.1 | | 16 | Y | 1480329.4 | | 32 | N | 1326447.5 | | 32 | Y | 1589089.8 | - ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9928) speed up analysis/icu regeneration
[ https://issues.apache.org/jira/browse/LUCENE-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327961#comment-17327961 ] ASF subversion and git services commented on LUCENE-9928: - Commit 044d152d954f1e22aac5a53792011da54c680617 in lucene's branch refs/heads/main from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=044d152 ] LUCENE-9928: speed up analysis/icu regeneration (#82) The compilation of the library is slow, disable optimization as it doesn't speed up our usage of the gennorm2 tool. Use better heuristic for make parallelism (tests.jvms rather than just hardcoded value of four). > speed up analysis/icu regeneration > -- > > Key: LUCENE-9928 > URL: https://issues.apache.org/jira/browse/LUCENE-9928 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > This is relatively slow, on linux/mac we have to compile the icu4c library > and then use the built tools to do the regeneration. Especially the > compilation of the large library is currently slow. > Let's make it a little less painful, e.g. use {{-O0}} as optimization isn't > helpful and slows it down (its a throwaway method to get correctly versioned > tools and run them once). > Before: > {noformat} > > Task :lucene:analysis:icu:regenerate > Aggregate task times (possibly running in parallel!): > 160.78 sec. compileIcuLinux > 15.09 sec. compileJava >1.51 sec. genUtr30DataFiles >1.49 sec. jar >0.79 sec. genRbbi >0.57 sec. gitStatus >0.25 sec. compileToolsJava >0.16 sec. processResources >0.04 sec. genRbbiChecksumLoad >0.02 sec. genRbbiChecksumSave >0.01 sec. genUtr30DataFilesChecksumLoad >0.01 sec. genUtr30DataFilesChecksumSave >0.00 sec. genUtr30DataFilesIfChanged >0.00 sec. genRbbiIfChanged >0.00 sec. errorProneSkipped > {noformat} > After: > {noformat} > > Task :lucene:analysis:icu:regenerate > Aggregate task times (possibly running in parallel!): > 126.86 sec. compileIcuLinux > 15.78 sec. compileJava >1.57 sec. jar >1.35 sec. genUtr30DataFiles >0.81 sec. genRbbi >0.60 sec. gitStatus >0.24 sec. compileToolsJava >0.15 sec. processResources >0.04 sec. genRbbiChecksumLoad >0.02 sec. genRbbiChecksumSave >0.01 sec. genUtr30DataFilesChecksumLoad >0.00 sec. genUtr30DataFilesChecksumSave >0.00 sec. genRbbiIfChanged >0.00 sec. genUtr30DataFilesIfChanged >0.00 sec. errorProneSkipped > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #82: LUCENE-9928: speed up analysis/icu regeneration
rmuir merged pull request #82: URL: https://github.com/apache/lucene/pull/82 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9928) speed up analysis/icu regeneration
[ https://issues.apache.org/jira/browse/LUCENE-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-9928. - Fix Version/s: main (9.0) Resolution: Fixed > speed up analysis/icu regeneration > -- > > Key: LUCENE-9928 > URL: https://issues.apache.org/jira/browse/LUCENE-9928 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Fix For: main (9.0) > > Time Spent: 1h 40m > Remaining Estimate: 0h > > This is relatively slow, on linux/mac we have to compile the icu4c library > and then use the built tools to do the regeneration. Especially the > compilation of the large library is currently slow. > Let's make it a little less painful, e.g. use {{-O0}} as optimization isn't > helpful and slows it down (its a throwaway method to get correctly versioned > tools and run them once). > Before: > {noformat} > > Task :lucene:analysis:icu:regenerate > Aggregate task times (possibly running in parallel!): > 160.78 sec. compileIcuLinux > 15.09 sec. compileJava >1.51 sec. genUtr30DataFiles >1.49 sec. jar >0.79 sec. genRbbi >0.57 sec. gitStatus >0.25 sec. compileToolsJava >0.16 sec. processResources >0.04 sec. genRbbiChecksumLoad >0.02 sec. genRbbiChecksumSave >0.01 sec. genUtr30DataFilesChecksumLoad >0.01 sec. genUtr30DataFilesChecksumSave >0.00 sec. genUtr30DataFilesIfChanged >0.00 sec. genRbbiIfChanged >0.00 sec. errorProneSkipped > {noformat} > After: > {noformat} > > Task :lucene:analysis:icu:regenerate > Aggregate task times (possibly running in parallel!): > 126.86 sec. compileIcuLinux > 15.78 sec. compileJava >1.57 sec. jar >1.35 sec. genUtr30DataFiles >0.81 sec. genRbbi >0.60 sec. gitStatus >0.24 sec. compileToolsJava >0.15 sec. processResources >0.04 sec. genRbbiChecksumLoad >0.02 sec. genRbbiChecksumSave >0.01 sec. genUtr30DataFilesChecksumLoad >0.00 sec. genUtr30DataFilesChecksumSave >0.00 sec. genRbbiIfChanged >0.00 sec. genUtr30DataFilesIfChanged >0.00 sec. errorProneSkipped > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] zacharymorn commented on pull request #56: Add Zach Chen to committer list
zacharymorn commented on pull request #56: URL: https://github.com/apache/lucene-site/pull/56#issuecomment-824588404 The change looks good at staging https://lucene.staged.apache.org/whoweare.html . Hi @janhoy, I have a quick question. I see you had a few recent PRs to merge from `main` to `production` to deploy site changes. Are these changes typically batched up and deployed on a regular basis, or it's more on demand? I'm asking since I see there are a few commits in April before mine that are not deployed yet, so I wasn't sure if I sure go ahead and create a PR to merge all those commits into `production` including this one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327129#comment-17327129 ] Zach Chen commented on LUCENE-9335: --- Makes sense. I guess the general strategy then would be to implement BMM in the BulkScorer, and do the maxScore initialization and essential / non-essential lists partition once and valid only within that 2048 documents boundary. I'll give that a try! > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org