[GitHub] [lucene] MichelLiu commented on pull request #92: Expunge big segment with oversize deletePct caused by continuously updating a batch of data

2021-04-22 Thread GitBox


MichelLiu commented on pull request #92:
URL: https://github.com/apache/lucene/pull/92#issuecomment-825347763


   I had a problem with the tiered merge policy. As I continuously updated a 
batch of data over time and time, then I got a lot of segments with 4.9G which 
segDelPct already greater than deletePctAllowed and cannot be merged by tiered 
merge policy.
   Then I found the code here and figured out the reason:
   `
   if (segSizeDocs.sizeInBytes > maxMergedSegmentBytes / 2 && (totalDelPct <= 
deletesPctAllowed || segDelPct <= deletesPctAllowed)) {
   iter.remove();
   tooBigCount++; // Just for reporting purposes.
   totIndexBytes -= segSizeDocs.sizeInBytes;
   allowedDelCount -= segSizeDocs.delCount;
 }
   `
   
   Here was the segments I had met before:
   
   1613741580098 0 p  10.10.112.123 _2h 891224440   
5693304.9gb 4905832 true  true   8.4.0   false
   1613741580098 0 p  10.10.112.123 _4v1752383463   
4259194.9gb 5636245 true  true   8.4.0   false
   1613741580098 0 p  10.10.112.123 _6n2392891298   
3802124.9gb 5617940 true  true   8.4.0   false
   1613741580098 0 p  10.10.112.123 _1lwc75036 468350   
3641044.3gb 3718611 true  true   8.4.0   false
   1613741580098 0 p  10.10.112.123 _1xh290038 678187   
2527793.6gb 3453739 true  true   8.4.0   false
   1613741580098 0 p  10.10.112.123 _25u8   100880 482795   
2372754.1gb 3370799 true  true   8.4.0   false
   1613741580098 0 p  10.10.112.123 _2fld   113521 721503   
2251604.1gb 3776954 true  true   8.4.0   false
   1613741580098 0 p  10.10.112.123 _2m9h   122165 831574   
1275724.2gb 3812013 true  true   8.4.0   false
   1613741580098 0 p  10.10.112.123 _2n01   123121  34000   
 27437  345.3mb  543426 true  true   8.4.0   true
   1613741580098 0 p  10.10.112.123 _2nq6   124062  36985   
 19838  319.2mb  515882 true  true   8.4.0   true
   1613741580098 0 p  10.10.112.123 _2o7d   124681  52725   
 40581  556.3mb  632128 true  true   8.4.0   true
   1613741580098 0 p  10.10.112.123 _2ouj   125515  11158   
  6330114mb  235396 true  true   8.4.0   true
   
   
   And I had an index with 564G, after bulk updating for a month, then grows up 
to 1400G. That caused significant waste of disk, and also highed up the search 
delay to 450ms. So we have to  reindex the index per month now.
   
   My solution is to merge the large segments as low-frequency as possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley commented on a change in pull request #2482: SOLR-15356: Deprecate UninvertDocValuesMergePolicyFactory

2021-04-22 Thread GitBox


dsmiley commented on a change in pull request #2482:
URL: https://github.com/apache/lucene-solr/pull/2482#discussion_r618630451



##
File path: solr/core/src/java/org/apache/solr/update/SolrIndexConfig.java
##
@@ -303,6 +304,10 @@ private MergePolicy buildMergePolicy(SolrResourceLoader 
resourceLoader, IndexSch
 new Class[] { SolrResourceLoader.class, MergePolicyFactoryArgs.class, 
IndexSchema.class },
 new Object[] {resourceLoader, mpfArgs, schema });
 
+if (mpf instanceof UninvertDocValuesMergePolicyFactory) {

Review comment:
   When Solr loads a deprecated class (see SolrResourceLoader line 537), it 
detects this and logs a warning. I think that's fine.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building

2021-04-22 Thread GitBox


jpountz commented on pull request #91:
URL: https://github.com/apache/lucene/pull/91#issuecomment-825004979


   Oh, this is disappointing, maybe the arrays are too small for TimSorter to 
actually perform better than InPlaceMergeSorter.
   
   I'd be keen to proceed with the change that always performs a stable sort 
with InPlaceMergeSorter. Some cases do get slower but only by a few percents, 
and it's going to be unlikely noticed through the full indexing chain. On the 
other hand, some cases are getting several times faster, which I'm sure is 
going to be noticeable. We could still iterate later, but for now this sounds 
to me like a good performance-simplicity trade-off. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] cpoerschke opened a new pull request #2482: SOLR-15356: Deprecate UninvertDocValuesMergePolicyFactory

2021-04-22 Thread GitBox


cpoerschke opened a new pull request #2482:
URL: https://github.com/apache/lucene-solr/pull/2482


   I think a `solr/CHANGES.txt` entry for Solr 8.x is not warranted given the 
specialised use of the class but it would seem nice to provide some indication 
to anyone using it w.r.t. the class being removed in future Solr 9 via the 
https://github.com/apache/solr/pull/83 change?
   
   https://issues.apache.org/jira/browse/SOLR-15356


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] neoremind commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building

2021-04-22 Thread GitBox


neoremind commented on pull request #91:
URL: https://github.com/apache/lucene/pull/91#issuecomment-824891023


   I use `TimSort` instead of `InPlaceMergeSorter`, expect it to be faster, but 
it turns out to be slower. @jpountz would you check my latest commit to see if 
I implement Tim Sort correctly? 
   
   Below is the latest benchmark of `MSBRadixSort` with stable 
reorder(isDocIdIncremental = N) and `StableMSBRadixSort` (isDocIdIncremental = 
Y) 
   ```
-
   | bytesPerDim | isDocIdIncremental | avg time(us) |
-
   |  1  | N  |995541.5  |
   |  1  | Y  | 60399.2  |
   |  2  | N  |951085.9  |
   |  2  | Y  |322054.3  |
   |  3  | N  |   1333992.5  |
   |  3  | Y  |756951.4  |
   |  4  | N  |   1340422.4  |
   |  4  | Y  |   1528955.5  |
   |  8  | N  |   1323878.8  |
   |  8  | Y  |   1494004.5  |
   | 16  | N  |   1305548.1  |
   | 16  | Y  |   1480329.4  |
   | 32  | N  |   1326447.5  |
   | 32  | Y  |   1589089.8  |
-
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9928) speed up analysis/icu regeneration

2021-04-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327961#comment-17327961
 ] 

ASF subversion and git services commented on LUCENE-9928:
-

Commit 044d152d954f1e22aac5a53792011da54c680617 in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=044d152 ]

LUCENE-9928: speed up analysis/icu regeneration (#82)

The compilation of the library is slow, disable optimization as it doesn't 
speed up our usage of the gennorm2 tool.
Use better heuristic for make parallelism (tests.jvms rather than just 
hardcoded value of four).

> speed up analysis/icu regeneration
> --
>
> Key: LUCENE-9928
> URL: https://issues.apache.org/jira/browse/LUCENE-9928
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This is relatively slow, on linux/mac we have to compile the icu4c library 
> and then use the built tools to do the regeneration. Especially the 
> compilation of the large library is currently slow.
> Let's make it a little less painful, e.g. use {{-O0}} as optimization isn't 
> helpful and slows it down (its a throwaway method to get correctly versioned 
> tools and run them once).
> Before:
> {noformat}
> > Task :lucene:analysis:icu:regenerate
> Aggregate task times (possibly running in parallel!):
>  160.78 sec.  compileIcuLinux
>   15.09 sec.  compileJava
>1.51 sec.  genUtr30DataFiles
>1.49 sec.  jar
>0.79 sec.  genRbbi
>0.57 sec.  gitStatus
>0.25 sec.  compileToolsJava
>0.16 sec.  processResources
>0.04 sec.  genRbbiChecksumLoad
>0.02 sec.  genRbbiChecksumSave
>0.01 sec.  genUtr30DataFilesChecksumLoad
>0.01 sec.  genUtr30DataFilesChecksumSave
>0.00 sec.  genUtr30DataFilesIfChanged
>0.00 sec.  genRbbiIfChanged
>0.00 sec.  errorProneSkipped
> {noformat}
> After:
> {noformat}
> > Task :lucene:analysis:icu:regenerate
> Aggregate task times (possibly running in parallel!):
>  126.86 sec.  compileIcuLinux
>   15.78 sec.  compileJava
>1.57 sec.  jar
>1.35 sec.  genUtr30DataFiles
>0.81 sec.  genRbbi
>0.60 sec.  gitStatus
>0.24 sec.  compileToolsJava
>0.15 sec.  processResources
>0.04 sec.  genRbbiChecksumLoad
>0.02 sec.  genRbbiChecksumSave
>0.01 sec.  genUtr30DataFilesChecksumLoad
>0.00 sec.  genUtr30DataFilesChecksumSave
>0.00 sec.  genRbbiIfChanged
>0.00 sec.  genUtr30DataFilesIfChanged
>0.00 sec.  errorProneSkipped
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #82: LUCENE-9928: speed up analysis/icu regeneration

2021-04-22 Thread GitBox


rmuir merged pull request #82:
URL: https://github.com/apache/lucene/pull/82


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9928) speed up analysis/icu regeneration

2021-04-22 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-9928.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> speed up analysis/icu regeneration
> --
>
> Key: LUCENE-9928
> URL: https://issues.apache.org/jira/browse/LUCENE-9928
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> This is relatively slow, on linux/mac we have to compile the icu4c library 
> and then use the built tools to do the regeneration. Especially the 
> compilation of the large library is currently slow.
> Let's make it a little less painful, e.g. use {{-O0}} as optimization isn't 
> helpful and slows it down (its a throwaway method to get correctly versioned 
> tools and run them once).
> Before:
> {noformat}
> > Task :lucene:analysis:icu:regenerate
> Aggregate task times (possibly running in parallel!):
>  160.78 sec.  compileIcuLinux
>   15.09 sec.  compileJava
>1.51 sec.  genUtr30DataFiles
>1.49 sec.  jar
>0.79 sec.  genRbbi
>0.57 sec.  gitStatus
>0.25 sec.  compileToolsJava
>0.16 sec.  processResources
>0.04 sec.  genRbbiChecksumLoad
>0.02 sec.  genRbbiChecksumSave
>0.01 sec.  genUtr30DataFilesChecksumLoad
>0.01 sec.  genUtr30DataFilesChecksumSave
>0.00 sec.  genUtr30DataFilesIfChanged
>0.00 sec.  genRbbiIfChanged
>0.00 sec.  errorProneSkipped
> {noformat}
> After:
> {noformat}
> > Task :lucene:analysis:icu:regenerate
> Aggregate task times (possibly running in parallel!):
>  126.86 sec.  compileIcuLinux
>   15.78 sec.  compileJava
>1.57 sec.  jar
>1.35 sec.  genUtr30DataFiles
>0.81 sec.  genRbbi
>0.60 sec.  gitStatus
>0.24 sec.  compileToolsJava
>0.15 sec.  processResources
>0.04 sec.  genRbbiChecksumLoad
>0.02 sec.  genRbbiChecksumSave
>0.01 sec.  genUtr30DataFilesChecksumLoad
>0.00 sec.  genUtr30DataFilesChecksumSave
>0.00 sec.  genRbbiIfChanged
>0.00 sec.  genUtr30DataFilesIfChanged
>0.00 sec.  errorProneSkipped
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] zacharymorn commented on pull request #56: Add Zach Chen to committer list

2021-04-22 Thread GitBox


zacharymorn commented on pull request #56:
URL: https://github.com/apache/lucene-site/pull/56#issuecomment-824588404


   The change looks good at staging 
https://lucene.staged.apache.org/whoweare.html . 
   
   Hi @janhoy, I have a quick question. I see you had a few recent PRs to merge 
from `main` to `production` to deploy site changes. Are these changes typically 
batched up and deployed on a regular basis, or it's more on demand? I'm asking 
since I see there are a few commits in April before mine that are not deployed 
yet, so I wasn't sure if I sure go ahead and create a PR to merge all those 
commits into `production` including this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-04-22 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327129#comment-17327129
 ] 

Zach Chen commented on LUCENE-9335:
---

Makes sense. I guess the general strategy then would be to implement BMM in the 
BulkScorer, and do the maxScore initialization and essential / non-essential 
lists partition once and valid only within that 2048 documents boundary. I'll 
give that a try!

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org