[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-01 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355476#comment-17355476
 ] 

Zach Chen commented on LUCENE-9976:
---

Hmm this test actually passed for me:
{code:java}
xichen@Xis-MacBook-Pro lucene % ./gradlew test --tests 
TestExpressionSorts.testQueries -Dtests.seed=FF571CE915A0955 
-Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true 
-Dtests.asserts=true -p lucene/expressions/
Starting a Gradle Daemon, 7 busy and 18 incompatible Daemons could not be 
reused, use --status for details


> Task :randomizationInfo
Running tests with randomization seed: tests.seed=FF571CE915A0955


BUILD SUCCESSFUL in 37s
{code}
I'm using mac, and trying with main branch head commit a6cf46dad

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new pull request #164: LUCENE-9905: Allow Lucene90Codec to be configured with a per-field vector format

2021-06-01 Thread GitBox


jtibshirani opened a new pull request #164:
URL: https://github.com/apache/lucene/pull/164


   Previously only AssertingCodec could handle a per-field vector format. This 
PR
   also strengthens the checks in TestPerFieldVectorFormat.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9950) Support both single- and multi-value string fields in facet counting (non-taxonomy based approaches)

2021-06-01 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355413#comment-17355413
 ] 

Greg Miller commented on LUCENE-9950:
-

Thanks for adding your additional insights and use-case details [~sqshq]! As 
for specific use cases and performance comparison to SSDVFC, I'll describe what 
I was thinking (but unfortunately don't have any benchmark data for 
performance). In cases where a large majority of all possible dims need to be 
counted, it should be more efficient to pack them into a single field, allowing 
the matching docs to be iterated a single time (counting all dims/values along 
the way and probably getting some locality benefits as you mention). On the 
other hand, if only a small percentage of all available dims need to be 
counted, a lot of wasteful counting/computation takes place during the single 
counting iteration. In these cases, using a separate field-per-dimension means 
iterating the hits multiple times but not doing any unnecessary dim/value 
counting. Seems like there could be use-cases for both approaches.

 

To be honest, I approached this more from the standpoint of being able to count 
doc value fields with any string data in them (unlike SSDVFC which assumes a 
specific format involving a dim). This is a nice functional complement to 
something like {{LongValueFacetCounts}}. It would be really interesting to do 
some performance benchmarking though!

> Support both single- and multi-value string fields in facet counting 
> (non-taxonomy based approaches)
> 
>
> Key: LUCENE-9950
> URL: https://issues.apache.org/jira/browse/LUCENE-9950
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Fix For: main (9.0), 8.9
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Users wanting to facet count string-based fields using a non-taxonomy-based 
> approach can use {{SortedSetDocValueFacetCounts}}, which accumulates facet 
> counts based on a {{SortedSetDocValues}} field. This requires the stored doc 
> values to be multi-valued (i.e., {{SORTED_SET}}), and doesn't work on 
> single-valued fields (i.e., SORTED). In contrast, if a user wants to facet 
> count on a stored numeric field, they can use {{LongValueFacetCounts}}, which 
> supports both single- and multi-valued fields (and in LUCENE-9948, we now 
> auto-detect instead of asking the user to specify).
> Let's update {{SortedSetDocValueFacetCounts}} to also support, and 
> automatically detect single- and multi-value fields. Note that this is a 
> spin-off issue from LUCENE-9946, where [~rcmuir] points out that this can 
> essentially be a one-line change, but we may want to do some class renaming 
> at the same time. Also note that we should do this in 
> {{ConcurrentSortedSetDocValuesFacetCounts}} while we're at it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9970) provide apps details about why TooManyClauses was thrown

2021-06-01 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated LUCENE-9970:
---
Attachment: LUCENE-9970.patch
  Assignee: Chris M. Hostetter
Status: Open  (was: Open)

attaching patch that...

* Adds a string arg constructor to {{TooManyClauses}}
** also adds a {{getMaxClauseCount()}} method to access what the limit is/was 
so it's available programatically even if the thrower doesn't include it in the 
message
* Adds a {{TooManyNestedClauses}} subclass
** updates {{getNumClausesCheckVisitor()}} to throw this new 
{{TooManyNestedClauses}} if it's check is exceeded
* update all existing {{TestMaxClauseLimit}} to explicitly check for 
{{TooManyNestedClauses}} when expected
** and explicitly confirm that the base exception {{TooManyClauses}} was thrown 
when the nested check shouldn't have been needed.

Patch also includes the following small related tweaks...
* rename {{TestBooleanQuery.testException}} to 
{{TestMaxClauseLimit.testIllegalArgumentExceptionOnZero}}
* improves the comment in {{BooleanQuery}} to clarify that the (original) check 
isn't just an early optimization - it's important in it's own right
* add a {{TestBooleanQuery.testTooManyClauses}} test confirming that adding 
"too many clauses" to a BQ fails fast (w/o needing 
{{IndexSearcher.rewrite(Query)}} to catch it with the recursive check)

> provide apps details about why TooManyClauses was thrown
> 
>
> Key: LUCENE-9970
> URL: https://issues.apache.org/jira/browse/LUCENE-9970
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: LUCENE-9970.patch
>
>
> Historically, if {{TooManyClauses}} was thrown it meant exactly one thing: 
> That a QueryX Builder (typically BooleanQuery, but there are a few others in 
> sandbox) was not going to allow it's caller to add a clause because that 
> QueryX object already had the maxClauseCount in _direct_ children.
> LUCENE-8811 added an additional "reason" why {{TooManyClauses}} may be thrown 
> starting in 9.0: IndexSearcher may now throw this exception  if the 
> (rewritten) Query being executed has a _cumulative_ number of clauses – 
> across the entire structure of _nested_ Query objects – that exceeds the 
> maxClauseCount.
> 
> I think it would be helpful to users if it was possible to tell from the 
> {{TooManyClauses}} exception how the maxClauseCount was exceeded (because of 
> the total number of direct children during rewrite, or cumulatively across 
> the entire nested structure) w/o needing to inspect the stack frames to see 
> if the thrower a rewrite method, or a QueryVisitor method.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9962) DrillSideways users should be able to opt-out of "drill down" facet collecting

2021-06-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355308#comment-17355308
 ] 

ASF subversion and git services commented on LUCENE-9962:
-

Commit 3c7a76a148b28cd2b2dc3a60f32e0b2816b75e21 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3c7a76a ]

LUCENE-9962: Allow DrillSideways sub-classes to provide their own "drill down" 
facet counting implementation (or null). (#143)



> DrillSideways users should be able to opt-out of "drill down" facet collecting
> --
>
> Key: LUCENE-9962
> URL: https://issues.apache.org/jira/browse/LUCENE-9962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The {{DrillSideways}} search methods will _always_ populate a 
> {{FacetsCollector}} for the "drill down" dimensions in addition to the "drill 
> sideways" dimensions. For most cases, this makes sense, but it would be nice 
> if users had a way to opt-out of this collection. It's possible a user may 
> not care to do any faceting on "drill down" dims, or may have custom needs 
> for facet collecting on the "drill downs." For the latter case, the user 
> might want to provide a {{Collector}}/{{CollectorManager}} that does facet 
> collecting with some custom logic (e.g., behind a 
> {{MultiCollector}}/{{MultiCollectorManager}}), in which case the population 
> of an additional {{FacetsCollector}} in {{DrillSideways}} is wasteful.
> The {{DrillSidewaysScorer}} already supports a {{null}} 
> {{drillDownCollector}} gracefully, so this change should mostly just involve 
> creating a {{protected}} method in {{DrillSideways}} for the purpose of 
> creating a "drill down" {{FacetsCollector}} that users can override by 
> providing {{null}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #143: LUCENE-9962: Allow DrillSideways sub-classes to provide their own "drill down" facet counting implementation (or null).

2021-06-01 Thread GitBox


gsmiller merged pull request #143:
URL: https://github.com/apache/lucene/pull/143


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-06-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355275#comment-17355275
 ] 

ASF subversion and git services commented on LUCENE-9981:
-

Commit c4cf7aa3e199c53d9cf9401363b0c65df5c36111 in lucene's branch 
refs/heads/main from Michael McCandless
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c4cf7aa ]

LUCENE-9981: more efficient getCommonSuffix/Prefix, and more accurate 'effort 
limit', instead of precise output state limit, during determinize, for throwing 
TooComplexToDeterminizeException


> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981_nfaprefix.patch, 
> LUCENE-9981_test.patch, three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-06-01 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355273#comment-17355273
 ] 

Michael McCandless commented on LUCENE-9981:


Thanks everyone, I'll push to {{main}} shortly and watch for exciting Jenkins 
failures!

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981_nfaprefix.patch, 
> LUCENE-9981_test.patch, three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-06-01 Thread Nick Knize (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355256#comment-17355256
 ] 

Nick Knize commented on LUCENE-9981:


bq. Even with no API break, I don't want these changes rushed in to meet some 
arbitrary 8.9 deadline, I'm really concerned about that.

+1 for letting this bake in main. Consumers of the API should handle the lack 
of auth separately from the fix. Many thanks to [~rmuir] and [~mikemccand] for 
the hard work on this.  

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981_nfaprefix.patch, 
> LUCENE-9981_test.patch, three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public

2021-06-01 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-9956.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

Thanks [~gworah]!

> Make getBaseQuery API from DrillDownQuery public 
> -
>
> Key: LUCENE-9956
> URL: https://issues.apache.org/jira/browse/LUCENE-9956
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Gautam Worah
>Priority: Trivial
> Fix For: main (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> It would be great if users could access the baseQuery of a DrillDownQuery. I 
> think this can be useful for folks who want to access/test the clauses of a 
> BooleanQuery (for example) after they've already wrapped it into a 
> DrillDownQuery.
>  
>  Currently the {{Query getBaseQuery()}} method is package private by default.
> If this proposed change does not make sense, or if this change breaks the 
> semantic of the class, I am happy to explore other ways of doing this!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9956) Make getBaseQuery API from DrillDownQuery public

2021-06-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355227#comment-17355227
 ] 

ASF subversion and git services commented on LUCENE-9956:
-

Commit 27b009c5d04c4a2169929ef19be5de7f7fa3ec87 in lucene's branch 
refs/heads/main from Gautam Worah
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=27b009c ]

LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery 
public  (#138)

Co-authored-by: Gautam Worah 

> Make getBaseQuery API from DrillDownQuery public 
> -
>
> Key: LUCENE-9956
> URL: https://issues.apache.org/jira/browse/LUCENE-9956
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Gautam Worah
>Priority: Trivial
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> It would be great if users could access the baseQuery of a DrillDownQuery. I 
> think this can be useful for folks who want to access/test the clauses of a 
> BooleanQuery (for example) after they've already wrapped it into a 
> DrillDownQuery.
>  
>  Currently the {{Query getBaseQuery()}} method is package private by default.
> If this proposed change does not make sense, or if this change breaks the 
> semantic of the class, I am happy to explore other ways of doing this!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #138: LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery public

2021-06-01 Thread GitBox


gsmiller merged pull request #138:
URL: https://github.com/apache/lucene/pull/138


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #138: LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery public

2021-06-01 Thread GitBox


gsmiller commented on pull request #138:
URL: https://github.com/apache/lucene/pull/138#issuecomment-852283912


   Looks great. Thanks @gautamworah96! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #138: LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery public

2021-06-01 Thread GitBox


gautamworah96 commented on a change in pull request #138:
URL: https://github.com/apache/lucene/pull/138#discussion_r642711762



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java
##
@@ -170,16 +185,36 @@ private BooleanQuery getBooleanQuery() {
 return bq.build();
   }
 
-  Query getBaseQuery() {
+  /**
+   * Returns the internal baseQuery of the DrillDownQuery
+   *
+   * @return The baseQuery used on initialization of DrillDownQuery
+   */
+  public Query getBaseQuery() {
 return baseQuery;
   }
 
-  Query[] getDrillDownQueries() {
-Query[] dimQueries = new Query[this.dimQueries.size()];
-for (int i = 0; i < dimQueries.length; ++i) {
-  dimQueries[i] = this.dimQueries.get(i).build();
+  /**
+   * Returns the dimension queries added either via {@link #add(String, 
Query)} or {@link
+   * #add(String, String...)}
+   *
+   * @return The array of dimQueries
+   */
+  public Query[] getDrillDownQueries() {
+if (dirtyDimQueryIndex.isEmpty()) {
+  // returns previously built dimQueries
+  Query[] builtDimQueriesCopy = new Query[builtDimQueries.size()];
+  return builtDimQueries.toArray(builtDimQueriesCopy);
+}
+for (int i = 0; i < this.dimQueries.size(); ++i) {
+  if (dirtyDimQueryIndex.contains(i)) {
+builtDimQueries.set(i, this.dimQueries.get(i).build());
+dirtyDimQueryIndex.remove(i);
+  }
 }
-return dimQueries;
+assert dirtyDimQueryIndex.isEmpty();
+// by this time builtDimQueries has all the built queries and 
dirtyDimQueryIndex is empty
+return getDrillDownQueries();

Review comment:
   Added. Looks cleaner
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #138: LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery public

2021-06-01 Thread GitBox


gautamworah96 commented on a change in pull request #138:
URL: https://github.com/apache/lucene/pull/138#discussion_r642711788



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java
##
@@ -170,16 +185,36 @@ private BooleanQuery getBooleanQuery() {
 return bq.build();
   }
 
-  Query getBaseQuery() {
+  /**
+   * Returns the internal baseQuery of the DrillDownQuery
+   *
+   * @return The baseQuery used on initialization of DrillDownQuery
+   */
+  public Query getBaseQuery() {
 return baseQuery;
   }
 
-  Query[] getDrillDownQueries() {
-Query[] dimQueries = new Query[this.dimQueries.size()];
-for (int i = 0; i < dimQueries.length; ++i) {
-  dimQueries[i] = this.dimQueries.get(i).build();
+  /**
+   * Returns the dimension queries added either via {@link #add(String, 
Query)} or {@link
+   * #add(String, String...)}
+   *
+   * @return The array of dimQueries
+   */
+  public Query[] getDrillDownQueries() {

Review comment:
   Added. Another use of this nice optimization!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob merged pull request #2500: SOLR-14978 OOM Killer in Foreground

2021-06-01 Thread GitBox


madrob merged pull request #2500:
URL: https://github.com/apache/lucene-solr/pull/2500


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9937) ann-benchmarks results for HNSW search

2021-06-01 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355161#comment-17355161
 ] 

Michael Sokolov commented on LUCENE-9937:
-

I ran some further tests using hnswlib and KnnGraphTester, trying to look more 
closely at how they differ. The test setup was: index 1M vectors (I used the 
enwiki GloVe-100 vectors from luceneutil) using M=16, 
efConstruction/beam-width=500, and then run 10K queries retrieving top K=10 for 
each. For the latency comparison table at the end, I set fanout for the Lucene 
KNN = 5 since this yielded the same 0.778 recall as I calculated for hnswlib 
with these settings. First I looked at some deeper per-search internal metrics: 
"hop count" and "number of updates", as well as total number of nodes visited 
(proxy for number of distance calculations). "hop count" says how many graph 
arcs we traverse during the search; we compute distances for all the neighbors 
of each visted node, but many of those neighbors get discarded, so we never 
"hop" to them. "update count" says how many times we found a competitive node 
and update the priority queue.

 

I hacked hnswlib so it would emit these metrics in a "debug" mode, by printing 
them to stdout, and then calculated some summary statistics. These tables were 
computed with beam-width=200, not 500, and fanout=0, and I did not compute 
recall against true-NN, but used the score distribution as a proxy for that:
h3. hnswlib
{noformat}
 max.scoremean.scorevisited   hops updates  

 Min.   :0.6565   Min.   :0.6364   Min.   : 43.0   Min.   :10.0   Min.   : 
17.00  
 1st Qu.:0.9680   1st Qu.:0.9585   1st Qu.:120.0   1st Qu.:11.0   1st Qu.: 
31.00  
 Median :0.9871   Median :0.9841   Median :173.0   Median :12.0   Median : 
38.00  
 Mean   :0.9734   Mean   :0.9667   Mean   :175.7   Mean   :12.7   Mean   : 
42.35  
 3rd Qu.:0.9954   3rd Qu.:0.9928   3rd Qu.:217.0   3rd Qu.:14.0   3rd Qu.: 
47.00  
 Max.   :0.9997   Max.   :0.9996   Max.   :463.0   Max.   :33.0   Max.   
:195.00 {noformat}
h3. lucene-knn
{noformat}
   top.scoremean.scorevisited   hops  updates   
  
 Min.   :0.6995   Min.   :0.6843   Min.   : 82.0   Min.   :12.00   Min.   : 
41.0  
 1st Qu.:0.9681   1st Qu.:0.9582   1st Qu.:181.0   1st Qu.:18.00   1st Qu.: 
83.0  
 Median :0.9870   Median :0.9839   Median :205.0   Median :20.00   Median : 
98.0  
 Mean   :0.9731   Mean   :0.9664   Mean   :210.7   Mean   :20.55   Mean   
:100.3  
 3rd Qu.:0.9953   3rd Qu.:0.9928   3rd Qu.:237.0   3rd Qu.:23.00   3rd 
Qu.:113.0  
 Max.   :0.9997   Max.   :0.9996   Max.   :391.0   Max.   :39.00   Max.   
:232.0{noformat}
I think this shows the two implementations are performing similarly, in the 
sense that they produce similar-scoring results for similar parameter settings 
(see max score and mean score). Lucene KNN visits more nodes, but not that many 
more. It has quite a few more hops and updates. I think this is accounted for 
by the fact that we don't have heirarchical nodes. Also - in the hnswlib case I 
did not include the traversal of the upper layers of the hierarchical graph - 
this would add a few more visits, hops, and updates, although not as many as we 
need to have since we start at random places in the graph, and it takes a few 
more hops to converge on the nodes near to the target.

Removing the debug code, I ran a latency comparison, and in this test 
||impl||recall||latency (ms)||warm count||
|hnswlib|0.778|0.032|n/a|
|lucene|0.778|0.54|cold|
|lucene|0.778|0.11|1000|
|lucene|0.778|0.08|1|

Clearly there is a strong impact of warming; in KnnGraphTester warming is 
implemented by pre-computing matches from the test set with the idea of paging 
in the index (and the test vectors) as well as giving hotspot compiler a chance 
to work its magic. This warming is kind of cheating since it is computing the 
same results we compute in the test, but it does show the difficulty of 
head-to-head latency comparisons of these different systems, and the 
sensitivity of the Java/Lucene implementation to such effects. So I think we 
need to take the ann-benchmarks results with a grain of salt; the test 
framework there isn't really set up for such warming. Maybe we can tweak it to 
do this somehow?

Another takeaway is that implementing the "H" in HNSW might actually be worth 
it; it clearly saves some work in terms of updating the priority queue, and 
doing traversals, and to some extent on the number of visited nodes as well.

 

> ann-benchmarks results for HNSW search
> --
>
> Key: LUCENE-9937
> URL: https://issues.apache.org/jira/browse/LUCENE-9937
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>
> I hooked up our HNSW implementation to 
> 

[GitHub] [lucene] bruno-roustant commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-01 Thread GitBox


bruno-roustant commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r643113793



##
File path: lucene/core/build.gradle
##
@@ -20,6 +20,8 @@ apply plugin: 'java-library'
 description = 'Lucene core library'
 
 dependencies {
+  implementation 'com.carrotsearch:hppc'

Review comment:
   According to the stats you shared @zhaih in the Jira issue, there are 3 
times more get (inc/dec) & removal than entry additions. If there are less than 
2M states then IntIntWormMap should be more efficient.
   I suggest that you benchmark with IntIntHashMap and IntIntWormMap (with your 
existing tests) and take the most appropriate.
   Edit: actually maybe a simple array is working? (see the question in the 
Operations comment)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] bruno-roustant commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-01 Thread GitBox


bruno-roustant commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r643116749



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/StateSet.java
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import com.carrotsearch.hppc.BitMixer;
+import com.carrotsearch.hppc.IntIntHashMap;
+import com.carrotsearch.hppc.cursors.IntCursor;
+import java.util.Arrays;
+
+/** A thin wrapper of {@link com.carrotsearch.hppc.IntIntHashMap} */
+final class StateSet extends IntSet {
+
+  private final IntIntHashMap inner;
+  private int hashCode;
+  private boolean changed;
+  private int[] arrayCache = new int[0];
+
+  StateSet(int capacity) {
+inner = new IntIntHashMap(capacity);
+  }
+
+  // Adds this state to the set
+  void incr(int num) {
+if (inner.addTo(num, 1) == 1) {
+  changed = true;

Review comment:
   Could we rename "changed" to "keysChanged" as this only tracks keys and 
not values?

##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/StateSet.java
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import com.carrotsearch.hppc.BitMixer;
+import com.carrotsearch.hppc.IntIntHashMap;
+import com.carrotsearch.hppc.cursors.IntCursor;
+import java.util.Arrays;
+
+/** A thin wrapper of {@link com.carrotsearch.hppc.IntIntHashMap} */
+final class StateSet extends IntSet {
+
+  private final IntIntHashMap inner;
+  private int hashCode;
+  private boolean changed;
+  private int[] arrayCache = new int[0];
+
+  StateSet(int capacity) {
+inner = new IntIntHashMap(capacity);
+  }
+
+  // Adds this state to the set
+  void incr(int num) {
+if (inner.addTo(num, 1) == 1) {
+  changed = true;
+}
+  }
+
+  // Removes this state from the set, if count decrs to 0
+  void decr(int num) {
+assert inner.containsKey(num);
+int keyIndex = inner.indexOf(num);
+int count = inner.indexGet(keyIndex) - 1;
+if (count == 0) {
+  inner.remove(num);
+  changed = true;
+} else {
+  inner.indexReplace(keyIndex, count);
+}
+  }
+
+  void computeHash() {

Review comment:
   I find strange that this method is called externally by Operations. 
Since the hash is cached and we track when it needs to be recomputed with 
"changed", could we make it private and change the signature to "private int 
getOrComputeHash()"?

##
File path: lucene/core/build.gradle
##
@@ -20,6 +20,8 @@ apply plugin: 'java-library'
 description = 'Lucene core library'
 
 dependencies {
+  implementation 'com.carrotsearch:hppc'

Review comment:
   According to the stats you shared @zhaih in the Jira issue, there are 3 
times more get (inc/dec) & removal than entry additions. If there are less than 
2M states then IntIntWormMap should be more efficient.
   I suggest that you benchmark with IntIntHashMap and IntIntWormMap (with your 
existing tests) and take the most appropriate.

##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
##
@@ -676,7 +677,7 @@ public static Automaton determinize(Automaton a, int 
maxDeterminizedStates) {
 // a.writeDot("/l/la/lucene/core/detin.dot");
 
 // Same initial values and state will always have the same hashCode
-FrozenIntSet initialset = new 

[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-01 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354856#comment-17354856
 ] 

Dawid Weiss commented on LUCENE-9983:
-

bq. from 6 min to 16 sec on my local machine

Ouch. That's what I call a nice improvement... I guess you know where to focus 
the attention now then. 

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-01 Thread GitBox


dweiss commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r642820266



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/StateSet.java
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import com.carrotsearch.hppc.BitMixer;
+import com.carrotsearch.hppc.IntIntHashMap;
+import com.carrotsearch.hppc.cursors.IntCursor;
+import java.util.Arrays;
+
+/** A thin wrapper of {@link com.carrotsearch.hppc.IntIntHashMap} */
+final class StateSet extends IntSet {
+
+  private final IntIntHashMap inner;
+  private int hashCode;
+  private boolean changed;
+  private int[] arrayCache = new int[0];
+
+  StateSet(int capacity) {
+inner = new IntIntHashMap(capacity);
+  }
+
+  // Adds this state to the set
+  void incr(int num) {
+if (inner.addTo(num, 1) == 1) {
+  changed = true;
+}
+  }
+
+  // Removes this state from the set, if count decrs to 0
+  void decr(int num) {
+assert inner.containsKey(num);
+int keyIndex = inner.indexOf(num);
+int count = inner.indexGet(keyIndex) - 1;
+if (count == 0) {
+  inner.remove(num);
+  changed = true;
+} else {
+  inner.indexReplace(keyIndex, count);
+}
+  }
+
+  void computeHash() {
+if (changed == false) {
+  return;
+}
+hashCode = inner.size();
+for (IntCursor cursor : inner.keys()) {
+  hashCode += BitMixer.mix(cursor.value);

Review comment:
   +1. An alternative: 37 * hashCode + cursor.value with a final mix step.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-01 Thread GitBox


dweiss commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r642819034



##
File path: lucene/core/build.gradle
##
@@ -20,6 +20,8 @@ apply plugin: 'java-library'
 description = 'Lucene core library'
 
 dependencies {
+  implementation 'com.carrotsearch:hppc'

Review comment:
   Feel free to cut and trim to your will. This is exactly why it's 
licensed the way it is. @bruno-roustant came up with some clever new hashing 
improvements recently - these are not published as a public revision but you 
can get them from the repository and compile it locally. See this for details:
   
   https://issues.carrot2.org/browse/HPPC-176




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-01 Thread GitBox


dweiss commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r642817638



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/StateSet.java
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import com.carrotsearch.hppc.BitMixer;
+import com.carrotsearch.hppc.IntIntHashMap;
+import com.carrotsearch.hppc.cursors.IntCursor;
+import java.util.Arrays;
+
+/** A thin wrapper of {@link com.carrotsearch.hppc.IntIntHashMap} */
+final class StateSet extends IntSet {
+
+  private final IntIntHashMap inner;
+  private int hashCode;

Review comment:
   Not just moving to a long but also using a better hash function. The 
typical accumulative default  (prime * current + elementValue) is fine but if 
the number of hashed elements is small (or if their values are small) this 
leads to poor distributions. Throw a murmur mix function in between (or at the 
end at least) and things typically look much better.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #163: LUCENE-9983: Stop sorting determinize powersets unnecessarily

2021-06-01 Thread GitBox


dweiss commented on a change in pull request #163:
URL: https://github.com/apache/lucene/pull/163#discussion_r642816694



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/StateSet.java
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import com.carrotsearch.hppc.BitMixer;
+import com.carrotsearch.hppc.IntIntHashMap;
+import com.carrotsearch.hppc.cursors.IntCursor;
+import java.util.Arrays;
+
+/** A thin wrapper of {@link com.carrotsearch.hppc.IntIntHashMap} */
+final class StateSet extends IntSet {
+
+  private final IntIntHashMap inner;
+  private int hashCode;
+  private boolean changed;
+  private int[] arrayCache = new int[0];
+
+  StateSet(int capacity) {
+inner = new IntIntHashMap(capacity);
+  }
+
+  // Adds this state to the set
+  void incr(int num) {
+if (inner.addTo(num, 1) == 1) {
+  changed = true;
+}
+  }
+
+  // Removes this state from the set, if count decrs to 0
+  void decr(int num) {
+assert inner.containsKey(num);
+int keyIndex = inner.indexOf(num);
+int count = inner.indexGet(keyIndex) - 1;
+if (count == 0) {
+  inner.remove(num);
+  changed = true;
+} else {
+  inner.indexReplace(keyIndex, count);
+}
+  }
+
+  void computeHash() {
+if (changed == false) {
+  return;
+}
+hashCode = inner.size();
+for (IntCursor cursor : inner.keys()) {
+  hashCode += BitMixer.mix(cursor.value);
+}
+  }
+
+  /**
+   * Create a snapshot of this int set associated with a given state. The 
snapshot will not retain
+   * any frequency information about the elements of this set, only existence.
+   *
+   * It is the caller's responsibility to ensure that the hashCode and data 
are up to date via
+   * the {@link #computeHash()} method before calling this method.
+   *
+   * @param state the state to associate with the frozen set.
+   * @return A new FrozenIntSet with the same values as this set.
+   */
+  FrozenIntSet freeze(int state) {
+if (changed == false) {
+  assert arrayCache != null;
+  return new FrozenIntSet(arrayCache, hashCode, state);
+}
+return new FrozenIntSet(getArray(), hashCode, state);
+  }
+
+  @Override
+  int[] getArray() {
+if (changed == false) {
+  assert arrayCache != null;
+  return arrayCache;
+}
+changed = false;
+arrayCache = inner.keys().toArray();
+// we need to sort this array since "equals" method depend on this
+Arrays.sort(arrayCache);

Review comment:
   Yeah... I think a good empirical breakdown of what actually is happening 
may be much better than theoretical complexity analysis. This doesn't mean it's 
not worth experimenting with int-int hash maps but I have a gut feeling you can 
get pretty darn close with less memory used using existing sorted arrays.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org