[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-528262274 Thank you for merging and reviewing, @jpountz ! and thank you @jimczi and @mikemccand for reviewing! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-528237811 Thanks for approving, @mikemccand ! @jpountz , @jimczi Does this look ok to merge now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527765667 Folks, post the discussion above, I am assuming this is ready to merge? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527515974 > In the unsorted index case, where we skip by impacts once we collect more than the 1000 by default, we are also still correct because we continue collecting in that slice, just skipping by impact based on that thread's private PQ bottom. We can make further improvements e.g. to share the global PQ bottom across all searcher threads, but that should come later. Yes, that is on the line. Next up will be a PR with shared global PQ :) > > So net/net I think the change is correct, and should be a big performance gain for concurrent searching. Sorry for the confusion ;) No sweat, thank you for reviewing! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527464137 @jpountz I updated the PR per your comments -- please take a look and let me know if it seems fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527285002 > > https://issues.apache.org/jira/browse/LUCENE-8681 > > I am not sure I see how that solves the problem? > > The core change targeted in this PR is allowing `TOTAL_HITS_THRESHOLD` to be accurately counted across all slices. Today, we collect `TOTAL_HITS_THRESHOLD` per slice, which is not what the API definition is. Post this PR, we will collect `TOTAL_HITS_THRESHOLD` in aggregate. `TOTAL_HITS_THRESHOLD`' s definition does not guarantee any order of collection of hits in the concurrent case -- we inadvertently define one today by collection the threshold number of hits per slice. > > RE: Proration, I believe that is a custom logic that can be added on top of this change. In any case, the proration logic also works on a bunch of static values + fudge factors, so it can go wrong and we might end up collecting lesser hits from a more valuable segment. To help prevent this scenario, I believe proration might also do well to build upon this PR and use the shared counter. But, I am unable to see why proration and accurate counting across slices are mutually exclusive. > > In any case, unlike proration, this PR does not propose any algorithmic changes to the way collection is done -- it simply reduces extra work done across slices that we do not even advertise today, so might be something that the user is unaware of. To summarize my monologue, this PR is aimed at accurate counting of hits across all slices -- whereas proration targets a different use case of trying to "distributing" hits across slices based on some parameters. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527284505 > https://issues.apache.org/jira/browse/LUCENE-8681 I am not sure I see how that solves the problem? The core change targeted in this PR is allowing `TOTAL_HITS_THRESHOLD` to be accurately counted across all slices. Today, we collect `TOTAL_HITS_THRESHOLD` per slice, which is not what the API definition is. Post this PR, we will collect `TOTAL_HITS_THRESHOLD` in aggregate. `TOTAL_HITS_THRESHOLD`' s definition does not guarantee any order of collection of hits in the concurrent case -- we inadvertently define one today by collection the threshold number of hits per slice. RE: Proration, I believe that is a custom logic that can be added on top of this change. In any case, the proration logic also works on a bunch of static values + fudge factors, so it can go wrong and we might end up collecting lesser hits from a more valuable segment. To help prevent this scenario, I believe proration might also do well to build upon this PR and use the shared counter. But, I am unable to see why proration and accurate counting across slices are mutually exclusive. In any case, unlike proration, this PR does not propose any algorithmic changes to the way collection is done -- it simply reduces extra work done across slices that we do not even advertise today, so might be something that the user is unaware of. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527109501 > > I think the output I pasted above does mention the tasks run? > > Hmm the first few `luceneutil` results you posted didn't seem to state which task (`wikimedium10k`)? Or maybe I missed it, sorry ... in general when posting benchmarks it's vital to give enough details and use/share public benchmark tools/data (like `luceneutil` and `wikimedia`'s corpus) so that someone else could go and replicate your results. +1, thanks for the input -- will ensure that henceforth. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526845639 > I'm trying to understand the behavior change Lucene users will see with this, when using concurrent searching for one query (passing `ExecutorService` to `IndexSearcher`): > > It looks like with the change such users will see their search precisely when the total collected hits exceeds the limit (1000 by default?), versus today where we will try to collect 1000 per segment and then reduce that to the top 1000 overall? So this means the results will change depending on thread execution/timing? Looking at the documentation around `TOTAL_HITS_THRESHOLD`, I see that it intends to restrict the number of documents scored in total before the query is early terminated. If we do a single threaded search today, that is the behavior we get. However, for concurrent search, we actually look at N * `TOTAL_HITS_THRESHOLD`, where N is the number of slices. So, I believe that we are not doing the advertised behavior for concurrent searches in the status quo. This change should fix that. However, you are correct that thread timing will come into play here -- different slices may have different contributions to the overall number of hits. However, since we are anyways not scoring all documents, I do not believe we offer any guarantees on the documents that we return -- even today, the best documents might be the ones which just came in and hence are on the last segments to be traversed, so never even get looked. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526845091 > Please don't run "real" benchmarks with `wikimedium10k` -- that may be fast to execute but the results are nonsense since dominant time is query setup, versus visiting postings lists that you'd see with `wikimediumall`. For real results, use `wikimediumall` or `wikibigall`. Thanks for highlighting that. > Also, always report which task you ran when you post results :) I think the output I pasted above does mention the tasks run? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526540019 Thanks @jimczi ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526179514 @jimczi Updated per our discussion. Although, a user can potentially implement `HitsThresholdChecker` and give his custom limits, but that would be an overkill for just a different value for a limit, so agree with your proposal. Please let me know your thoughts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526164337 > Package protected might not be a good idea after all, sorry, users that wants to set the `totalHitsThreshold` different than the default would need to access the `GlobalHitsThresholdChecker`. Maybe we can have a single class and expose two static functions that creates anonymous impl: `create(int)` and `createShared(int)` ? This way users can still modify totalHitsThreshold and share it if they use the managed search ? If I understood correctly, the class would be public and return instances of local checker or global checker with the threshold set as the parameter passed in to the static methods? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526157715 @jimczi Updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526152703 Ran precommit on the latest iteration -- came in clean. @jimczi Looks good to be committed? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526122213 > > Also, I believe having the abstraction will let future implementations customize the threshold logic without needing to make any core changes or introduce a new collector, hence we should let the threshold checker implementations be public. WDYT? > > What kind of customization you have in mind ? Since this is now a `BooleanSupplier`, collectors can choose when to early terminate (a hacky idea might be to just terminate after some time has elapsed? Not saying it is a good idea though). > IMO having all this classes and functions that require HitsThresholdChecker as package protected would be enough and would not add to the overall complexity of running a simple search using an executor or not ? Package protected is a good idea -- updated. Does it look fine now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526113533 > > So, essentially, we make implicit CollectorManager implementations owned by IndexSearcher instead of separate implementation classes? > > Yes I think it would be easier and we don't really need to expose `HitsThresholdChecker` in this case, it's just an impl detail that is used by the parallel and the normal top docs collector ? Not really, it would be shared across both `TopFieldCollector and `TopScoreDocCollector`. Also, I believe having the abstraction will let future implementations customize the threshold logic without needing to make any core changes or introduce a new collector, hence we should let the threshold checker implementations be public. WDYT? Updated per comments -- let me know if it looks fine This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526102145 > > That would require a CollectorManager implementation > > Not necessarily, you could create a `GlobalSharedHitsThresholdChecker` and use it in the manager impl of `IndexSearcher#searchAfter` if the executor is not null and the number of slices is > 1 ? > In fact I wonder if the collector manager impls (`SharedHitCountFieldCollectorManager`, ...) are required, we could use the `GlobalSharedHitsThresholdChecker` internally in `IndexSearcher` without adding a dedicated manager impl ? So, essentially, we make implicit `CollectorManager` implementations owned by `IndexSearcher` instead of separate implementation classes? I am fine with this -- I do not really have a strong preference either way. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526075360 > I wonder if the shared count should be used automatically in `IndexSearcher#searchAfter` if the executor is not null and the number of slices > 1 ? That would require a `CollectorManager` implementation that passes in `GlobalSharedHitsThresholdChecker` and then merges the documents across slices. I have implemented that -- should be able to raise a follow up PR today This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-525754830 @jimczi Thanks, I updated per your comments. Please let me know if it seems fine. RE: benchmarks, the numbers I posted earlier were on wikimedium10k. I did a run of wikimedium2m and saw the following: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff HighTermMonthSort 256.49 (14.6%) 240.49 (13.0%) -6.2% ( -29% - 24%) HighTermDayOfYearSort 140.97 (8.4%) 137.26 (8.6%) -2.6% ( -18% - 15%) Fuzzy2 67.54 (20.4%) 65.90 (19.3%) -2.4% ( -34% - 46%) MedTerm 1430.97 (4.8%) 1404.31 (4.2%) -1.9% ( -10% -7%) AndHighLow 859.98 (5.2%) 847.48 (4.1%) -1.5% ( -10% -8%) Respell 111.18 (2.4%) 110.17 (2.5%) -0.9% ( -5% -4%) Prefix3 278.56 (3.3%) 276.16 (2.5%) -0.9% ( -6% -5%) LowTerm 1635.42 (5.3%) 1628.34 (3.2%) -0.4% ( -8% -8%) Fuzzy1 65.18 (11.5%) 64.94 (12.6%) -0.4% ( -21% - 26%) Wildcard 184.51 (3.0%) 183.97 (3.3%) -0.3% ( -6% -6%) HighPhrase 150.25 (2.8%) 149.83 (2.9%) -0.3% ( -5% -5%) PKLookup 124.37 (3.6%) 124.04 (3.4%) -0.3% ( -7% -7%) HighTerm 991.92 (4.7%) 990.52 (4.7%) -0.1% ( -9% -9%) BrowseDayOfYearTaxoFacets 9293.85 (4.9%) 9291.07 (5.4%) -0.0% ( -9% - 10%) LowSloppyPhrase 143.41 (1.9%) 143.38 (2.1%) -0.0% ( -3% -4%) BrowseMonthSSDVFacets 43.43 (0.9%) 43.42 (1.0%) -0.0% ( -1% -1%) MedPhrase 93.36 (1.5%) 93.41 (2.2%) 0.0% ( -3% -3%) HighSloppyPhrase 43.97 (2.4%) 43.99 (2.4%) 0.1% ( -4% -5%) MedSloppyPhrase 108.07 (2.1%) 108.15 (1.8%) 0.1% ( -3% -4%) BrowseDayOfYearSSDVFacets 38.85 (0.6%) 38.88 (0.5%) 0.1% ( -1% -1%) LowPhrase 181.55 (1.8%) 181.82 (1.6%) 0.1% ( -3% -3%) MedSpanNear 66.17 (4.4%) 66.28 (3.0%) 0.2% ( -7% -8%) HighSpanNear 110.26 (3.3%) 110.57 (2.6%) 0.3% ( -5% -6%) LowSpanNear 84.69 (1.9%) 84.94 (2.0%) 0.3% ( -3% -4%) HighIntervalsOrdered 47.21 (0.9%) 47.37 (1.5%) 0.3% ( -1% -2%) AndHighHigh 118.46 (4.3%) 119.01 (3.9%) 0.5% ( -7% -8%) AndHighMed 187.97 (3.8%) 188.85 (3.4%) 0.5% ( -6% -8%) OrHighHigh 79.07 (3.0%) 79.45 (2.5%) 0.5% ( -4% -6%) IntNRQ 238.67 (4.3%) 239.97 (3.5%) 0.5% ( -6% -8%) OrHighLow 429.26 (4.4%) 432.55 (3.3%) 0.8% ( -6% -8%) OrHighMed 170.16 (3.0%) 171.65 (2.8%) 0.9% ( -4% -6%) BrowseDateTaxoFacets 11.28 (2.6%) 11.39 (3.7%) 0.9% ( -5% -7%) BrowseMonthTaxoFacets 9470.41 (6.2%) 9584.95 (4.7%) 1.2% ( -9% - 12%) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-525720523 @jimczi Updated the PR with implementation for `TopScoreDocsCollector'. Haven't added the corresponding `CollectorManager` yet though, I presume that can be done separately? Luceneutil numbers on the latest version: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff MedSloppyPhrase 498.64 (6.9%) 484.94 (16.3%) -2.7% ( -24% - 21%) HighIntervalsOrdered 303.56 (10.5%) 296.03 (11.8%) -2.5% ( -22% - 22%) BrowseDateTaxoFacets 3366.36 (5.2%) 3328.48 (5.4%) -1.1% ( -11% -9%) HighSloppyPhrase 500.45 (8.4%) 496.32 (9.1%) -0.8% ( -16% - 18%) IntNRQ 1108.56 (3.3%) 1104.31 (4.6%) -0.4% ( -7% -7%) Fuzzy1 321.49 (4.8%) 320.92 (4.1%) -0.2% ( -8% -9%) HighPhrase 571.71 (9.5%) 571.13 (8.8%) -0.1% ( -16% - 20%) Respell 386.71 (8.4%) 386.35 (8.1%) -0.1% ( -15% - 17%) AndHighHigh 807.05 (2.5%) 806.66 (4.2%) -0.0% ( -6% -6%) HighTermDayOfYearSort 733.99 (4.4%) 733.77 (5.6%) -0.0% ( -9% - 10%) OrHighLow 1158.52 (1.7%) 1159.52 (3.6%) 0.1% ( -5% -5%) BrowseDayOfYearTaxoFacets 8364.69 (2.9%) 8375.25 (2.8%) 0.1% ( -5% -5%) BrowseMonthSSDVFacets 1999.10 (2.9%) 2005.14 (3.8%) 0.3% ( -6% -7%) BrowseMonthTaxoFacets 8403.68 (2.5%) 8430.34 (3.5%) 0.3% ( -5% -6%) MedTerm 3481.42 (3.0%) 3495.13 (3.0%) 0.4% ( -5% -6%) MedPhrase 813.12 (4.0%) 816.40 (4.0%) 0.4% ( -7% -8%) BrowseDayOfYearSSDVFacets 1762.24 (2.9%) 1770.51 (4.5%) 0.5% ( -6% -8%) OrHighMed 408.00 (3.8%) 409.94 (4.7%) 0.5% ( -7% -9%) PKLookup 148.60 (3.5%) 149.33 (6.3%) 0.5% ( -9% - 10%) HighTermMonthSort 1869.60 (12.7%) 1879.81 (13.1%) 0.5% ( -22% - 30%) Wildcard 898.93 (3.8%) 903.87 (3.7%) 0.5% ( -6% -8%) HighTerm 2075.22 (3.2%) 2088.85 (2.9%) 0.7% ( -5% -6%) LowPhrase 610.73 (4.6%) 617.56 (3.8%) 1.1% ( -6% -9%) LowSloppyPhrase 825.44 (3.2%) 835.72 (3.3%) 1.2% ( -5% -7%) LowSpanNear 1388.73 (2.2%) 1409.14 (2.8%) 1.5% ( -3% -6%) AndHighLow 3047.72 (5.5%) 3096.11 (5.9%) 1.6% ( -9% - 13%) MedSpanNear 750.19 (2.9%) 763.02 (2.0%) 1.7% ( -3% -6%) LowTerm 3670.47 (3.6%) 3736.25 (4.9%) 1.8% ( -6% - 10%) HighSpanNear 401.49 (5.4%) 408.94 (4.5%) 1.9% ( -7% - 12%) Prefix3 390.52 (9.0%) 397.98 (7.5%) 1.9% ( -13% - 20%) OrHighHigh 319.58 (15.3%) 326.27 (9.9%) 2.1% ( -20% - 32%) AndHighMed 1011.67 (3.5%) 1035.03 (3.8%) 2.3% ( -4% -9%) Fuzzy2 37.75 (25.1%) 39.47 (24.4%) 4.6% ( -35% - 72%) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-525695991 > > Looks like the performance is consistent and we see no degradation. WDYT? > > I don't think that these tasks use a `TopFieldDocCollector` with early termination ? This is why I said that it would be easier to test with the `TopScoreDocCollector`. I thought some of the tasks do, but apparently not. Should we commit this then, and then follow up on the other JIRA that I opened? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-525634541 I opened a JIRA for the follow up : https://issues.apache.org/jira/browse/LUCENE-8958 Will post a patch once we merge this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-525287168 @jimczi Thanks for reviewing the PR. I will follow up with a JIRA and a PR for `TopScoreDocCollector` that does the same as this one. I ran Luceneutil a bunch of times and took an average of performances, the numbers look like: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff BrowseDateTaxoFacets 3445.32 (5.0%) 3293.96 (5.5%) -4.4% ( -14% -6%) Prefix3 218.41 (23.7%) 212.36 (17.4%) -2.8% ( -35% - 50%) Respell 287.17 (7.8%) 280.86 (8.6%) -2.2% ( -17% - 15%) MedTerm 3415.85 (4.4%) 3349.44 (3.6%) -1.9% ( -9% -6%) Wildcard 236.80 (12.5%) 232.30 (12.4%) -1.9% ( -23% - 26%) AndHighHigh 498.85 (10.9%) 489.48 (10.1%) -1.9% ( -20% - 21%) HighTermDayOfYearSort 600.10 (3.7%) 588.84 (2.2%) -1.9% ( -7% -4%) OrHighMed 590.54 (2.9%) 581.16 (4.3%) -1.6% ( -8% -5%) LowTerm 3943.78 (2.8%) 3882.24 (3.4%) -1.6% ( -7% -4%) LowSloppyPhrase 1689.48 (2.5%) 1669.56 (2.2%) -1.2% ( -5% -3%) BrowseDayOfYearTaxoFacets 8421.12 (1.7%) 8334.95 (2.5%) -1.0% ( -5% -3%) PKLookup 146.25 (4.7%) 144.86 (5.4%) -0.9% ( -10% -9%) OrHighLow 1043.32 (2.1%) 1033.71 (2.7%) -0.9% ( -5% -3%) IntNRQ 1419.85 (2.2%) 1407.79 (1.7%) -0.8% ( -4% -3%) AndHighLow 2100.43 (2.7%) 2083.03 (2.9%) -0.8% ( -6% -4%) HighSpanNear 258.87 (18.4%) 256.75 (22.3%) -0.8% ( -35% - 48%) AndHighMed 936.63 (2.5%) 929.39 (2.0%) -0.8% ( -5% -3%) HighTermMonthSort 1668.46 (2.2%) 1656.43 (3.0%) -0.7% ( -5% -4%) MedSpanNear 1157.98 (2.5%) 1150.21 (2.5%) -0.7% ( -5% -4%) MedPhrase 515.46 (6.3%) 512.41 (6.0%) -0.6% ( -12% - 12%) LowPhrase 761.37 (2.4%) 757.55 (2.1%) -0.5% ( -4% -4%) BrowseDayOfYearSSDVFacets 1802.01 (2.7%) 1794.09 (1.8%) -0.4% ( -4% -4%) LowSpanNear 1460.48 (3.2%) 1454.29 (3.2%) -0.4% ( -6% -6%) BrowseMonthTaxoFacets 8490.66 (1.8%) 8459.18 (2.2%) -0.4% ( -4% -3%) MedSloppyPhrase 672.44 (9.3%) 670.70 (4.8%) -0.3% ( -13% - 15%) HighTerm 1356.56 (2.5%) 1354.39 (2.0%) -0.2% ( -4% -4%) OrHighHigh 580.03 (11.5%) 579.81 (11.0%) -0.0% ( -20% - 25%) Fuzzy1 268.95 (6.4%) 269.28 (5.8%) 0.1% ( -11% - 13%) BrowseMonthSSDVFacets 2025.43 (1.7%) 2029.44 (2.5%) 0.2% ( -3% -4%) HighPhrase 466.42 (14.7%) 469.01 (13.4%) 0.6% ( -23% - 33%) HighIntervalsOrdered 389.06 (13.4%) 391.67 (13.8%) 0.7% ( -23% - 32%) HighSloppyPhrase 771.58 (11.8%) 777.44 (8.9%) 0.8% ( -17% - 24%) Fuzzy2 88.96 (18.0%) 90.19 (12.7%) 1.4% ( -24% - 39%) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-524725795 @jimczi Looks like I partially misunderstood your earlier comments -- sorry about that. I have raised a version with the approach we discussed earlier -- does this look like what you had in mind? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-524385885 > The reason for that is since `AtomicInteger.get()` is not guaranteed to be thread safe, so that was a hack way to read the value safely. If you notice, we do increment the counter when a hit is collected. However, the specific location you are mentioning is just for reading without a lock -- do an `incrementAndGet`, then decrement the phantom increment that we just did Hmm, I am thinking if we should just use `getAcquire` instead of this ugly dance, since that would ensure that loads and stores are not reordered and thus serialized? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-524382441 > I don't see what the modification is. The early termination logic is the same, the only diff is that the counter to check totalHitsThreshold is global in one case and local in the current impl. What I propose is to decorrelate the local totalHits that is needed to count the number of documents visited by the collector and the counter to check the `totalHitsThreshold`. A simple `BooleanSupplier` that checks if updateMinScore can be called won't change any existing functionality and you can hide this implementation detail by keeping the TopFieldCollector#create as they are today. Did I miss something ? No, you did not. As I mentioned earlier (sorry if I was not being clear), the reason I did not do it the way you proposed was an excessive paranoid to not touch a widely used family of collectors when adding this functionality, until I demonstrate it to the community and get feedback. Based on our discussion above, I implicitly understand that you are fine with the approached proposed in the PR and believe it is safe enough to be added to the core `TopFieldCollector`? If that is the case, I will raise another iteration, integrating the same into the parent class. > Yes, why do you need to decrement the counter, any visited document should increment ? The reason for that is since `AtomicInteger.get()` is not guaranteed to be thread safe, so that was a hack way to read the value safely. If you notice, we do increment the counter when a hit is collected. However, the specific location you are mentioning is just for reading without a lock -- do an `incrementAndGet`, then decrement the phantom increment that we just did This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-524335774 > Sorry I don't follow. The logic for early termination should be the same in `TopFieldCollector` than the one you put in `SharedHitCountFieldCollector`. Yes, that is correct, but adding it to `TopFieldCollector` would mean that it is a part of `SimpleFieldCollector` and `PagingFieldCollector` code paths as well, and I was conservatively not modifying existing functionality (not atleast until we discussed it in the community). > I also don't understand why you decrement the global total hits counter ? Shouldn't it be equivalent to a shared totalHits in the current impl ? Are you referring to the case when the global hits counter is decremented right when we cross the threshold? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-524332090 @jimczi I did consider doing that, but went down this route to explicitly decouple the early termination code path from the standard implementations of TopFieldCollector. I was unsure if we should directly enable existing (and future) field collectors to work with early termination. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-524157568 Any thoughts on this one? This seems like a reasonable way to avoid collecting extra hits with minimal overhead This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] [lucene-solr] atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search
atris commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-518522813 cc @jpountz This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org