[GitHub] [lucene-solr] mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search

2019-09-03 Thread GitBox
mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early 
Termination In Parallel Search
URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527503405
 
 
   > > > https://issues.apache.org/jira/browse/LUCENE-8681
   > > 
   > > 
   > > I am not sure I see how that solves the problem?
   > > The core change targeted in this PR is allowing `TOTAL_HITS_THRESHOLD` 
to be accurately counted across all slices. Today, we collect 
`TOTAL_HITS_THRESHOLD` per slice, which is not what the API definition is. Post 
this PR, we will collect `TOTAL_HITS_THRESHOLD` in aggregate. 
`TOTAL_HITS_THRESHOLD`' s definition does not guarantee any order of collection 
of hits in the concurrent case -- we inadvertently define one today by 
collection the threshold number of hits per slice.
   > > RE: Proration, I believe that is a custom logic that can be added on top 
of this change. In any case, the proration logic also works on a bunch of 
static values + fudge factors, so it can go wrong and we might end up 
collecting lesser hits from a more valuable segment. To help prevent this 
scenario, I believe proration might also do well to build upon this PR and use 
the shared counter. But, I am unable to see why proration and accurate counting 
across slices are mutually exclusive.
   > > In any case, unlike proration, this PR does not propose any algorithmic 
changes to the way collection is done -- it simply reduces extra work done 
across slices that we do not even advertise today, so might be something that 
the user is unaware of.
   > 
   > To summarize my monologue, this PR is aimed at accurate counting of hits 
across all slices -- whereas proration targets a different use case of trying 
to "distributing" hits across slices based on some parameters.
   
   Thanks @atris, I was worried this change would alter the correctness of the 
top hits in the concurrent case based on thread scheduling, but I was wrong:
   
   In the sorted index case, we only early terminate a searcher thread (slice) 
once its thread-private PQ is full and the global (1000 default) hit count has 
been collected, so that way we know the competitive top hits from that segment 
will be merged/reduced in the end.
   
   In the unsorted index case, where we skip by impacts once we collect more 
than the 1000 by default, we are also still correct because we continue 
collecting in that slice, just skipping by impact based on that thread's 
private PQ bottom.  We can make further improvements e.g. to share the global 
PQ bottom across all searcher threads, but that should come later.
   
   So net/net I think the change is correct, and should be a big performance 
gain for concurrent searching.  Sorry for the confusion ;)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search

2019-09-02 Thread GitBox
mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early 
Termination In Parallel Search
URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527262380
 
 
   > > > I'm trying to understand the behavior change Lucene users will see 
with this, when using concurrent searching for one query (passing 
`ExecutorService` to `IndexSearcher`):
   > > > It looks like with the change such users will see their search 
precisely when the total collected hits exceeds the limit (1000 by default?), 
versus today where we will try to collect 1000 per segment and then reduce that 
to the top 1000 overall? So this means the results will change depending on 
thread execution/timing?
   > > 
   > > 
   > > Looking at the documentation around `TOTAL_HITS_THRESHOLD`, I see that 
it intends to restrict the number of documents scored in total before the query 
is early terminated. If we do a single threaded search today, that is the 
behavior we get. However, for concurrent search, we actually look at N * 
`TOTAL_HITS_THRESHOLD`, where N is the number of slices. So, I believe that we 
are not doing the advertised behavior for concurrent searches in the status 
quo. This change should fix that.
   > > However, you are correct that thread timing will come into play here -- 
different slices may have different contributions to the overall number of 
hits. However, since we are anyways not scoring all documents, I do not believe 
we offer any guarantees on the documents that we return -- even today, the best 
documents might be the ones which just came in and hence are on the last 
segments to be traversed, so never even get looked. WDYT?
   > 
   > OK that makes sense @atris -- it seems that which specific top hits you'll 
get back is intentionally not defined in the API and so we have the freedom to 
make improvements like this.
   
   I'm still confused about this change -- wouldn't it be better to e.g. 
pro-rate the topN per segment for the concurrent case 
(https://issues.apache.org/jira/browse/LUCENE-8681) rather than rely on the 
JVM's thread scheduling to determine which `TOTAL_HITS_THRESHOLD` hits are 
collected?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search

2019-09-02 Thread GitBox
mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early 
Termination In Parallel Search
URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527203682
 
 
   > > I'm trying to understand the behavior change Lucene users will see with 
this, when using concurrent searching for one query (passing `ExecutorService` 
to `IndexSearcher`):
   > > It looks like with the change such users will see their search precisely 
when the total collected hits exceeds the limit (1000 by default?), versus 
today where we will try to collect 1000 per segment and then reduce that to the 
top 1000 overall? So this means the results will change depending on thread 
execution/timing?
   > 
   > Looking at the documentation around `TOTAL_HITS_THRESHOLD`, I see that it 
intends to restrict the number of documents scored in total before the query is 
early terminated. If we do a single threaded search today, that is the behavior 
we get. However, for concurrent search, we actually look at N * 
`TOTAL_HITS_THRESHOLD`, where N is the number of slices. So, I believe that we 
are not doing the advertised behavior for concurrent searches in the status 
quo. This change should fix that.
   > 
   > However, you are correct that thread timing will come into play here -- 
different slices may have different contributions to the overall number of 
hits. However, since we are anyways not scoring all documents, I do not believe 
we offer any guarantees on the documents that we return -- even today, the best 
documents might be the ones which just came in and hence are on the last 
segments to be traversed, so never even get looked. WDYT?
   
   OK that makes sense @atris -- it seems that which specific top hits you'll 
get back is intentionally not defined in the API and so we have the freedom to 
make improvements like this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search

2019-09-02 Thread GitBox
mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early 
Termination In Parallel Search
URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527103813
 
 
   > I think the output I pasted above does mention the tasks run?
   
   Hmm the first few `luceneutil` results you posted didn't seem to state which 
task (`wikimedium10k`)?  Or maybe I missed it, sorry ... in general when 
posting benchmarks it's vital to give enough details and use/share public 
benchmark tools/data (like `luceneutil` and `wikimedia`'s corpus) so that 
someone else could go and replicate your results.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search

2019-08-31 Thread GitBox
mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early 
Termination In Parallel Search
URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-526824189
 
 
   Please don't run "real" benchmarks with `wikimedium10k` -- that may be fast 
to execute but the results are nonsense since dominant time is query setup, 
versus visiting postings lists that you'd see with `wikimediumall`.  For real 
results, use `wikimediumall` or `wikibigall`.  Also, always report which task 
you ran when you post results :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org