[jira] [Created] (LUCENE-10371) Make IndexRearranger able to arrange segment into a determined order

2022-01-10 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-10371:
---

 Summary: Make IndexRearranger able to arrange segment into a 
determined order
 Key: LUCENE-10371
 URL: https://issues.apache.org/jira/browse/LUCENE-10371
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Haoyu Zhai


Previously when I tried to make change to luceneutil to let it use 
{{IndexRearranger}} for a faster deterministic index construction, I found that 
even each segment contains the same documents set, the order of segments will 
impact the estimated hit number (using BMW): 
[https://markmail.org/message/zl6zsqvbg7nwfq6w]

At that time the discussion tend to tolerant the small hit count difference to 
resolve the issue, after some time when I discuss this issue again with 
[~mikemccand] , we thought it might also be a good idea to just add ability of 
rearranging the segments order to {{IndexRearranger}}, so that we can ensure 
each time the rearranged index is truly the same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10316) fix TestLRUQueryCache.testCachingAccountableQuery failure

2021-12-21 Thread Haoyu Zhai (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyu Zhai resolved LUCENE-10316.
-
Resolution: Fixed

> fix TestLRUQueryCache.testCachingAccountableQuery failure
> -
>
> Key: LUCENE-10316
> URL: https://issues.apache.org/jira/browse/LUCENE-10316
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I saw this build failure: 
> [https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/]
> with following stack trace
> {code:java}
> java.lang.AssertionError: expected:<130.0> but was:<1544976.0>
>   at 
> __randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0)
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:577)
>   at org.junit.Assert.assertEquals(Assert.java:701)
>   at 
> org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
> ...
> NOTE: reproduce with: gradlew test --tests 
> TestLRUQueryCache.testCachingAccountableQuery -Dtests.seed=F7826B1EB37D545A 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ckb-IR 
> -Dtests.timezone=Africa/Dakar -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> It does not reproduce on my laptop on current main branch, but since the test 
> is comparing an estimation with a 10% slack, it can fail for sure sometime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10316) fix TestLRUQueryCache.testCachingAccountableQuery failure

2021-12-14 Thread Haoyu Zhai (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyu Zhai updated LUCENE-10316:

Description: 
I saw this build failure: 
[https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/]
with following stack trace
{code:java}
java.lang.AssertionError: expected:<130.0> but was:<1544976.0>
at 
__randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:577)
at org.junit.Assert.assertEquals(Assert.java:701)
at 
org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
...
NOTE: reproduce with: gradlew test --tests 
TestLRUQueryCache.testCachingAccountableQuery -Dtests.seed=F7826B1EB37D545A 
-Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ckb-IR 
-Dtests.timezone=Africa/Dakar -Dtests.asserts=true -Dtests.file.encoding=UTF-8 
{code}
It does not reproduce on my laptop on current main branch, but since the test 
is comparing an estimation with a 10% slack, it can fail for sure sometime.

  was:
I saw this build failure: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/
with following stack trace
{code:java}
java.lang.AssertionError: expected:<130.0> but was:<1544976.0>
at 
__randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:577)
at org.junit.Assert.assertEquals(Assert.java:701)
at 
org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
...
{code}
It does not reproduce on my laptop on current main branch, but since the test 
is comparing an estimation with a 10% slack, it can fail for sure sometime.


> fix TestLRUQueryCache.testCachingAccountableQuery failure
> -
>
> Key: LUCENE-10316
> URL: https://issues.apache.org/jira/browse/LUCENE-10316
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Haoyu Zhai
>Priority: Minor
>
> I saw this build failure: 
> [https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/]
> with following stack trace
> {code:java}
> java.lang.AssertionError: expected:<130.0> but was:<1544976.0>
>   at 
> __randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0)
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:577)
>   at org.junit.Assert.assertEquals(Assert.java:701)
>   at 
> org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.refle

[jira] [Commented] (LUCENE-10316) fix TestLRUQueryCache.testCachingAccountableQuery failure

2021-12-14 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459420#comment-17459420
 ] 

Haoyu Zhai commented on LUCENE-10316:
-

So basically the test is about making sure the query cache has the right 
estimation when the query has implemented the {{Accountable}} interface. 

When I originally wrote it I estimated the query cache size using {{(query_size 
+ linked_hash_map_entry_size) * query_num}} with 10% slack to allow the error 
of estimation. But apparently it is not enough sometimes (probably larger 
number of cache entries will waste more?).

Given the aim of the test is make sure when there're known big queries being 
cached the query cache reflect it correctly, I think we could change that to 
{{assert(query_cache_size > sum_of_all_queries_cached)}}. Then we won't depend 
on a slack to assert the correctness.

> fix TestLRUQueryCache.testCachingAccountableQuery failure
> -
>
> Key: LUCENE-10316
> URL: https://issues.apache.org/jira/browse/LUCENE-10316
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Haoyu Zhai
>Priority: Minor
>
> I saw this build failure: 
> https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/
> with following stack trace
> {code:java}
> java.lang.AssertionError: expected:<130.0> but was:<1544976.0>
>   at 
> __randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0)
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:577)
>   at org.junit.Assert.assertEquals(Assert.java:701)
>   at 
> org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
> ...
> {code}
> It does not reproduce on my laptop on current main branch, but since the test 
> is comparing an estimation with a 10% slack, it can fail for sure sometime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10316) fix TestLRUQueryCache.testCachingAccountableQuery failure

2021-12-14 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-10316:
---

 Summary: fix TestLRUQueryCache.testCachingAccountableQuery failure
 Key: LUCENE-10316
 URL: https://issues.apache.org/jira/browse/LUCENE-10316
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Haoyu Zhai


I saw this build failure: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/
with following stack trace
{code:java}
java.lang.AssertionError: expected:<130.0> but was:<1544976.0>
at 
__randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:577)
at org.junit.Assert.assertEquals(Assert.java:701)
at 
org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
...
{code}
It does not reproduce on my laptop on current main branch, but since the test 
is comparing an estimation with a 10% slack, it can fail for sure sometime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets

2021-12-06 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454334#comment-17454334
 ] 

Haoyu Zhai commented on LUCENE-10229:
-

Here's the PR: https://github.com/apache/lucene/pull/521



> Match offsets should be consistent for fields with positions and fields with 
> offsets
> 
>
> Key: LUCENE-10229
> URL: https://issues.apache.org/jira/browse/LUCENE-10229
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is a follow-up of LUCENE-10223 in which it was discovered that fields 
> with
> offsets don't highlight some more complex interval queries properly.  Alan 
> says:
> {quote}
> It's because it returns the position of the inner match, but the offsets of 
> the outer.  And so if you're re-analyzing and retrieving offsets by looking 
> at the positions, you get the 'right' thing.  It's not obvious to me what the 
> correct response is here, but thinking about it the current behaviour is kind 
> of the worst of both worlds, and perhaps we should change it so that you get 
> offsets of the inner match as standard, and then the outer match is returned 
> as part of the sub matches.
> {quote}
> Intervals are nicely separated into "basic intervals" and "filters" which 
> restrict some other source of intervals, here is the original documentation:
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50
> My experience from an extended period of using interval queries in a frontend 
> where they're highlighted is that filters are restrictions that should not be 
> highlighted - it's the source intervals that people care about. Filters are 
> what you remove or where you give proper context to source intervals.
> The test code contributed in LUCENE-10223 contains numerous query-highlight 
> examples (on fields with positions) where this intuition is demonstrated on 
> all kinds of interval functions:
> https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542
> This issue is about making the internals work consistently for fields with 
> positions and fields with offsets.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets

2021-12-05 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453652#comment-17453652
 ] 

Haoyu Zhai commented on LUCENE-10229:
-

Sure I can work on a PR :)

> Match offsets should be consistent for fields with positions and fields with 
> offsets
> 
>
> Key: LUCENE-10229
> URL: https://issues.apache.org/jira/browse/LUCENE-10229
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Major
>
> This is a follow-up of LUCENE-10223 in which it was discovered that fields 
> with
> offsets don't highlight some more complex interval queries properly.  Alan 
> says:
> {quote}
> It's because it returns the position of the inner match, but the offsets of 
> the outer.  And so if you're re-analyzing and retrieving offsets by looking 
> at the positions, you get the 'right' thing.  It's not obvious to me what the 
> correct response is here, but thinking about it the current behaviour is kind 
> of the worst of both worlds, and perhaps we should change it so that you get 
> offsets of the inner match as standard, and then the outer match is returned 
> as part of the sub matches.
> {quote}
> Intervals are nicely separated into "basic intervals" and "filters" which 
> restrict some other source of intervals, here is the original documentation:
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50
> My experience from an extended period of using interval queries in a frontend 
> where they're highlighted is that filters are restrictions that should not be 
> highlighted - it's the source intervals that people care about. Filters are 
> what you remove or where you give proper context to source intervals.
> The test code contributed in LUCENE-10223 contains numerous query-highlight 
> examples (on fields with positions) where this intuition is demonstrated on 
> all kinds of interval functions:
> https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542
> This issue is about making the internals work consistently for fields with 
> positions and fields with offsets.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets

2021-12-04 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453526#comment-17453526
 ] 

Haoyu Zhai commented on LUCENE-10229:
-

Seems for {{containedBy}} this inconsistency is introduced 
[here|https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/ConjunctionMatchesIterator.java#L60,L75],
 perhaps we could further subclass the {{ConjunctionMatchesIterator}} to a 
{{FilterMatchesIterator}} to let the offset methods return only offset of 
"source"?

> Match offsets should be consistent for fields with positions and fields with 
> offsets
> 
>
> Key: LUCENE-10229
> URL: https://issues.apache.org/jira/browse/LUCENE-10229
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Major
>
> This is a follow-up of LUCENE-10223 in which it was discovered that fields 
> with
> offsets don't highlight some more complex interval queries properly.  Alan 
> says:
> {quote}
> It's because it returns the position of the inner match, but the offsets of 
> the outer.  And so if you're re-analyzing and retrieving offsets by looking 
> at the positions, you get the 'right' thing.  It's not obvious to me what the 
> correct response is here, but thinking about it the current behaviour is kind 
> of the worst of both worlds, and perhaps we should change it so that you get 
> offsets of the inner match as standard, and then the outer match is returned 
> as part of the sub matches.
> {quote}
> Intervals are nicely separated into "basic intervals" and "filters" which 
> restrict some other source of intervals, here is the original documentation:
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50
> My experience from an extended period of using interval queries in a frontend 
> where they're highlighted is that filters are restrictions that should not be 
> highlighted - it's the source intervals that people care about. Filters are 
> what you remove or where you give proper context to source intervals.
> The test code contributed in LUCENE-10223 contains numerous query-highlight 
> examples (on fields with positions) where this intuition is demonstrated on 
> all kinds of interval functions:
> https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542
> This issue is about making the internals work consistently for fields with 
> positions and fields with offsets.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-11-18 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446058#comment-17446058
 ] 

Haoyu Zhai commented on LUCENE-10122:
-

Ah Thanks [~jpountz] for reminding, I forgot that, here we go: 
https://github.com/apache/lucene/pull/454

> Explore using NumericDocValue to store taxonomy parent array
> 
>
> Key: LUCENE-10122
> URL: https://issues.apache.org/jira/browse/LUCENE-10122
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 9.0
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> We currently use term position of a hardcoded term in a hardcoded field to 
> represent the parent ordinal of each taxonomy label. That is an old way and 
> perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent 
> quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-11-15 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444098#comment-17444098
 ] 

Haoyu Zhai commented on LUCENE-10122:
-

OK, here's the new PR (with the back-compatibility): 
[https://github.com/apache/lucene/pull/442]

[~jpountz] I set that PR to target on 9.0 branch based on the previous email 
thread, but since we've already in process of releasing please let me know if 
you want this to be targeting main branch instead.

> Explore using NumericDocValue to store taxonomy parent array
> 
>
> Key: LUCENE-10122
> URL: https://issues.apache.org/jira/browse/LUCENE-10122
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (10.0)
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We currently use term position of a hardcoded term in a hardcoded field to 
> represent the parent ordinal of each taxonomy label. That is an old way and 
> perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent 
> quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-11-05 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438317#comment-17438317
 ] 

Haoyu Zhai edited comment on LUCENE-10122 at 11/5/21, 6:55 PM:
---

The luceneutil benchmark shows a mostly neutral result
{code:java}
TaskQPS base  StdDevQPS cand  StdDev
Pct diff p-value
  Fuzzy2   58.39  (5.6%)   57.70  (6.1%)   
-1.2% ( -12% -   11%) 0.518
BrowseDateTaxoFacets2.40  (6.6%)2.38  (5.8%)   
-0.7% ( -12% -   12%) 0.709
BrowseDayOfYearTaxoFacets2.40  (6.5%)2.38  (5.8%)   
-0.7% ( -12% -   12%) 0.721
   BrowseMonthTaxoFacets2.49  (6.8%)2.47  (6.1%)   
-0.7% ( -12% -   13%) 0.738
   BrowseMonthSSDVFacets   16.44 (36.1%)   16.38 (35.1%)   
-0.4% ( -52% -  110%) 0.974
 LowIntervalsOrdered   30.70  (2.8%)   30.61  (3.0%)   
-0.3% (  -5% -5%) 0.763
   LowPhrase  516.96  (1.7%)  515.67  (1.6%)   
-0.3% (  -3% -3%) 0.626
   OrNotHighHigh  580.07  (2.1%)  578.61  (2.8%)   
-0.3% (  -5% -4%) 0.747
BrowseDayOfYearSSDVFacets   15.22 (24.2%)   15.19 (24.2%)   
-0.2% ( -39% -   63%) 0.976
   HighTermDayOfYearSort  766.98  (1.7%)  765.20  (1.7%)   
-0.2% (  -3% -3%) 0.665
HighIntervalsOrdered2.46  (2.0%)2.45  (2.3%)   
-0.2% (  -4% -4%) 0.795
 MedIntervalsOrdered   27.55  (2.8%)   27.51  (2.8%)   
-0.1% (  -5% -5%) 0.878
  IntNRQ   28.96  (0.3%)   28.92  (0.6%)   
-0.1% (   0% -0%) 0.358
  OrHighHigh   36.05  (2.2%)   36.02  (1.7%)   
-0.1% (  -3% -3%) 0.870
   MedPhrase  119.18  (1.7%)  119.08  (2.0%)   
-0.1% (  -3% -3%) 0.884
 MedSpanNear   99.96  (1.1%)   99.88  (1.2%)   
-0.1% (  -2% -2%) 0.818
 MedTerm 1211.34  (2.4%) 1210.46  (2.2%)   
-0.1% (  -4% -4%) 0.919
 Respell   42.08  (1.9%)   42.06  (2.3%)   
-0.1% (  -4% -4%) 0.931
OrNotHighLow  608.56  (2.1%)  608.41  (2.4%)   
-0.0% (  -4% -4%) 0.971
HighSpanNear   38.01  (2.2%)   38.01  (2.9%)   
-0.0% (  -5% -5%) 0.994
 LowSpanNear   94.41  (1.5%)   94.42  (2.1%)
0.0% (  -3% -3%) 0.975
   OrHighLow  228.92  (2.4%)  228.98  (1.6%)
0.0% (  -3% -4%) 0.971
   OrHighMed   76.23  (2.3%)   76.26  (2.2%)
0.0% (  -4% -4%) 0.951
HighTermTitleBDVSort   19.07  (2.6%)   19.08  (2.5%)
0.0% (  -4% -5%) 0.952
  TermDTSort  312.90  (2.0%)  313.18  (2.5%)
0.1% (  -4% -4%) 0.901
PKLookup  153.21  (2.6%)  153.35  (2.5%)
0.1% (  -4% -5%) 0.910
OrHighNotMed  798.03  (2.0%)  798.83  (2.3%)
0.1% (  -4% -4%) 0.883
   HighTermMonthSort  103.99  (9.9%)  104.10  (9.7%)
0.1% ( -17% -   21%) 0.971
Wildcard  107.61  (2.1%)  107.74  (2.4%)
0.1% (  -4% -4%) 0.859
 Prefix3   82.74 (12.0%)   82.84 (12.1%)
0.1% ( -21% -   27%) 0.973
  HighPhrase   67.96  (2.0%)   68.07  (2.0%)
0.2% (  -3% -4%) 0.792
HighTerm 1058.76  (1.8%) 1060.59  (2.7%)
0.2% (  -4% -4%) 0.812
   OrHighNotHigh  528.01  (1.8%)  529.17  (2.5%)
0.2% (  -4% -4%) 0.751
  Fuzzy1   42.70  (3.0%)   42.80  (3.3%)
0.2% (  -5% -6%) 0.814
OrNotHighMed  613.17  (2.6%)  614.97  (2.6%)
0.3% (  -4% -5%) 0.722
 MedSloppyPhrase   15.29  (1.8%)   15.34  (2.2%)
0.3% (  -3% -4%) 0.601
OrHighNotLow  590.46  (2.5%)  592.57  (2.9%)
0.4% (  -4% -5%) 0.677
  AndHighLow  518.23  (2.5%)  520.65  (2.9%)
0.5% (  -4% -6%) 0.585
 LowTerm 1137.40  (2.9%) 1143.47  (2.8%)
0.5% (  -5% -6%) 0.556
HighSloppyPhrase   10.76  (3.2%)   10.82  (3.6%)
0.6% (  -6% -7%) 0.602
 LowSloppyPhrase  152.21  (2.1%)  153.24  (2.4%)
0.7% (  -3% -5%) 0.350
  AndHighMed  170.44  (2.5%)  171.76  (3.6%)
0.8% (  -5% -7%) 0.426
 AndHighHigh   64.45  (3.2%)   65.07  (4.4%)
1.0% (  -6% -8%) 0.424
{code}
 And size of taxonomy index does not ch

[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-11-03 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438317#comment-17438317
 ] 

Haoyu Zhai commented on LUCENE-10122:
-

The luceneutil benchmark shows a mostly neutral result
{code:java}
TaskQPS base  StdDevQPS cand  StdDev
Pct diff p-value
  Fuzzy2   58.39  (5.6%)   57.70  (6.1%)   
-1.2% ( -12% -   11%) 0.518
BrowseDateTaxoFacets2.40  (6.6%)2.38  (5.8%)   
-0.7% ( -12% -   12%) 0.709
BrowseDayOfYearTaxoFacets2.40  (6.5%)2.38  (5.8%)   
-0.7% ( -12% -   12%) 0.721
   BrowseMonthTaxoFacets2.49  (6.8%)2.47  (6.1%)   
-0.7% ( -12% -   13%) 0.738
   BrowseMonthSSDVFacets   16.44 (36.1%)   16.38 (35.1%)   
-0.4% ( -52% -  110%) 0.974
 LowIntervalsOrdered   30.70  (2.8%)   30.61  (3.0%)   
-0.3% (  -5% -5%) 0.763
   LowPhrase  516.96  (1.7%)  515.67  (1.6%)   
-0.3% (  -3% -3%) 0.626
   OrNotHighHigh  580.07  (2.1%)  578.61  (2.8%)   
-0.3% (  -5% -4%) 0.747
BrowseDayOfYearSSDVFacets   15.22 (24.2%)   15.19 (24.2%)   
-0.2% ( -39% -   63%) 0.976
   HighTermDayOfYearSort  766.98  (1.7%)  765.20  (1.7%)   
-0.2% (  -3% -3%) 0.665
HighIntervalsOrdered2.46  (2.0%)2.45  (2.3%)   
-0.2% (  -4% -4%) 0.795
 MedIntervalsOrdered   27.55  (2.8%)   27.51  (2.8%)   
-0.1% (  -5% -5%) 0.878
  IntNRQ   28.96  (0.3%)   28.92  (0.6%)   
-0.1% (   0% -0%) 0.358
  OrHighHigh   36.05  (2.2%)   36.02  (1.7%)   
-0.1% (  -3% -3%) 0.870
   MedPhrase  119.18  (1.7%)  119.08  (2.0%)   
-0.1% (  -3% -3%) 0.884
 MedSpanNear   99.96  (1.1%)   99.88  (1.2%)   
-0.1% (  -2% -2%) 0.818
 MedTerm 1211.34  (2.4%) 1210.46  (2.2%)   
-0.1% (  -4% -4%) 0.919
 Respell   42.08  (1.9%)   42.06  (2.3%)   
-0.1% (  -4% -4%) 0.931
OrNotHighLow  608.56  (2.1%)  608.41  (2.4%)   
-0.0% (  -4% -4%) 0.971
HighSpanNear   38.01  (2.2%)   38.01  (2.9%)   
-0.0% (  -5% -5%) 0.994
 LowSpanNear   94.41  (1.5%)   94.42  (2.1%)
0.0% (  -3% -3%) 0.975
   OrHighLow  228.92  (2.4%)  228.98  (1.6%)
0.0% (  -3% -4%) 0.971
   OrHighMed   76.23  (2.3%)   76.26  (2.2%)
0.0% (  -4% -4%) 0.951
HighTermTitleBDVSort   19.07  (2.6%)   19.08  (2.5%)
0.0% (  -4% -5%) 0.952
  TermDTSort  312.90  (2.0%)  313.18  (2.5%)
0.1% (  -4% -4%) 0.901
PKLookup  153.21  (2.6%)  153.35  (2.5%)
0.1% (  -4% -5%) 0.910
OrHighNotMed  798.03  (2.0%)  798.83  (2.3%)
0.1% (  -4% -4%) 0.883
   HighTermMonthSort  103.99  (9.9%)  104.10  (9.7%)
0.1% ( -17% -   21%) 0.971
Wildcard  107.61  (2.1%)  107.74  (2.4%)
0.1% (  -4% -4%) 0.859
 Prefix3   82.74 (12.0%)   82.84 (12.1%)
0.1% ( -21% -   27%) 0.973
  HighPhrase   67.96  (2.0%)   68.07  (2.0%)
0.2% (  -3% -4%) 0.792
HighTerm 1058.76  (1.8%) 1060.59  (2.7%)
0.2% (  -4% -4%) 0.812
   OrHighNotHigh  528.01  (1.8%)  529.17  (2.5%)
0.2% (  -4% -4%) 0.751
  Fuzzy1   42.70  (3.0%)   42.80  (3.3%)
0.2% (  -5% -6%) 0.814
OrNotHighMed  613.17  (2.6%)  614.97  (2.6%)
0.3% (  -4% -5%) 0.722
 MedSloppyPhrase   15.29  (1.8%)   15.34  (2.2%)
0.3% (  -3% -4%) 0.601
OrHighNotLow  590.46  (2.5%)  592.57  (2.9%)
0.4% (  -4% -5%) 0.677
  AndHighLow  518.23  (2.5%)  520.65  (2.9%)
0.5% (  -4% -6%) 0.585
 LowTerm 1137.40  (2.9%) 1143.47  (2.8%)
0.5% (  -5% -6%) 0.556
HighSloppyPhrase   10.76  (3.2%)   10.82  (3.6%)
0.6% (  -6% -7%) 0.602
 LowSloppyPhrase  152.21  (2.1%)  153.24  (2.4%)
0.7% (  -3% -5%) 0.350
  AndHighMed  170.44  (2.5%)  171.76  (3.6%)
0.8% (  -5% -7%) 0.426
 AndHighHigh   64.45  (3.2%)   65.07  (4.4%)
1.0% (  -6% -8%) 0.424
{code}
 And size of taxonomy index does not change. 

I've also ran the internal benchmark we use

[jira] [Commented] (LUCENE-9839) TestIndexFileDeleter.testExcInDecRef test failure

2021-10-13 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428549#comment-17428549
 ] 

Haoyu Zhai commented on LUCENE-9839:


This same error appeared in my PR's auto check as well: 
[https://github.com/apache/lucene/runs/3873024350?check_suite_focus=true]

And went away after retry...

> TestIndexFileDeleter.testExcInDecRef test failure
> -
>
> Key: LUCENE-9839
> URL: https://issues.apache.org/jira/browse/LUCENE-9839
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>
> It isn't reproducible for me (at least trying again a single time). I'm 
> guessing a concurrency issue?
> {noformat}
> > Task :lucene:core:test
> org.apache.lucene.index.TestIndexFileDeleter > testExcInDecRef FAILED
> org.apache.lucene.store.AlreadyClosedException: ReaderPool is already 
> closed
> at 
> __randomizedtesting.SeedInfo.seed([9142DCE874F11926:78DFABDA0238FEDB]:0)
> at org.apache.lucene.index.ReaderPool.get(ReaderPool.java:400)
> at 
> org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3760)
> at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:590)
> at 
> org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:474)
> at 
> org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:406)
> at 
> org.apache.lucene.index.TestIndexFileDeleter.testExcInDecRef(TestIndexFileDeleter.java:484)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
> at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:51)
> at 
> org.apache.lucen

[jira] [Resolved] (LUCENE-10103) QueryCache not estimating query size properly

2021-10-13 Thread Haoyu Zhai (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyu Zhai resolved LUCENE-10103.
-
Resolution: Fixed

> QueryCache not estimating query size properly
> -
>
> Key: LUCENE-10103
> URL: https://issues.apache.org/jira/browse/LUCENE-10103
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Haoyu Zhai
>Priority: Minor
> Attachments: query_cache_error_demo.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> QueryCache seems estimating the cached query size using a 
> [constant|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L302],
>  it will cause OOM error in some extreme cases where queries cached will use 
> far more memories than assumed. (The default QueryCache tries to use [only 5% 
> of 
> heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L89])
>  One example of such memory-eating query is AutomatonQuery, it will each 
> carry a 
> [RunAutomaton|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java#L42]
>  , which consumes a good amount of memory in exchange for the speed.
> On the other hand, we actually have a good implementation of {{Accountable}} 
> interface for AutomatonQuery (though it will become a bit more complicated 
> later since this query will eventually be rewritten to something else), so 
> maybe QueryCache could use those estimation directly (using an {{instanceof}} 
> check)? Or moreover we could make all {{Query}} implement {Accountable}}, and 
> maybe the default implementation could just be returning the current constant 
> we're using, and only override the method of the potential troublesome 
> queries?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10103) QueryCache not estimating query size properly

2021-10-13 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428455#comment-17428455
 ] 

Haoyu Zhai commented on LUCENE-10103:
-

Thank you [~mikemccand], I think we should backport since it's more like a 
bug-fix, and even if someone was impacted by this change and want to return to 
previous behavior they would only need to adjust the max size of QueryCache.

> QueryCache not estimating query size properly
> -
>
> Key: LUCENE-10103
> URL: https://issues.apache.org/jira/browse/LUCENE-10103
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Haoyu Zhai
>Priority: Minor
> Attachments: query_cache_error_demo.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> QueryCache seems estimating the cached query size using a 
> [constant|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L302],
>  it will cause OOM error in some extreme cases where queries cached will use 
> far more memories than assumed. (The default QueryCache tries to use [only 5% 
> of 
> heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L89])
>  One example of such memory-eating query is AutomatonQuery, it will each 
> carry a 
> [RunAutomaton|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java#L42]
>  , which consumes a good amount of memory in exchange for the speed.
> On the other hand, we actually have a good implementation of {{Accountable}} 
> interface for AutomatonQuery (though it will become a bit more complicated 
> later since this query will eventually be rewritten to something else), so 
> maybe QueryCache could use those estimation directly (using an {{instanceof}} 
> check)? Or moreover we could make all {{Query}} implement {Accountable}}, and 
> maybe the default implementation could just be returning the current constant 
> we're using, and only override the method of the potential troublesome 
> queries?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-09-30 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422937#comment-17422937
 ] 

Haoyu Zhai commented on LUCENE-9983:


[~mikemccand] yes we can close it. But it seems I can't close it, could you 
close it? thank you!

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-09-24 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420016#comment-17420016
 ] 

Haoyu Zhai commented on LUCENE-10122:
-

Oh my bad, I wanted to say NumericDocValues but typed BinaryDocValues in title, 
just changed.

> Explore using NumericDocValue to store taxonomy parent array
> 
>
> Key: LUCENE-10122
> URL: https://issues.apache.org/jira/browse/LUCENE-10122
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Haoyu Zhai
>Priority: Minor
>
> We currently use term position of a hardcoded term in a hardcoded field to 
> represent the parent ordinal of each taxonomy label. That is an old way and 
> perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent 
> quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-09-24 Thread Haoyu Zhai (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyu Zhai updated LUCENE-10122:

Summary: Explore using NumericDocValue to store taxonomy parent array  
(was: Explore using BinaryDocValue to store taxonomy parent array)

> Explore using NumericDocValue to store taxonomy parent array
> 
>
> Key: LUCENE-10122
> URL: https://issues.apache.org/jira/browse/LUCENE-10122
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Haoyu Zhai
>Priority: Minor
>
> We currently use term position of a hardcoded term in a hardcoded field to 
> represent the parent ordinal of each taxonomy label. That is an old way and 
> perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent 
> quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9969) DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机

2021-09-24 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420012#comment-17420012
 ] 

Haoyu Zhai commented on LUCENE-9969:


+1, I created https://issues.apache.org/jira/browse/LUCENE-10122

> DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机
> 
>
> Key: LUCENE-9969
> URL: https://issues.apache.org/jira/browse/LUCENE-9969
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 6.6.2
>Reporter: FengFeng Cheng
>Priority: Trivial
> Attachments: image-2021-05-24-13-43-43-289.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> 首先数据量很大,jvm内存为90G,但是TaxonomyIndexArrays几乎占走了一半
> !image-2021-05-24-13-43-43-289.png!
> 请问对于TaxonomyReader是否有更好的使用方式或者其他的优化?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10122) Explore using BinaryDocValue to store taxonomy parent array

2021-09-24 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-10122:
---

 Summary: Explore using BinaryDocValue to store taxonomy parent 
array
 Key: LUCENE-10122
 URL: https://issues.apache.org/jira/browse/LUCENE-10122
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Affects Versions: main (9.0)
Reporter: Haoyu Zhai


We currently use term position of a hardcoded term in a hardcoded field to 
represent the parent ordinal of each taxonomy label. That is an old way and 
perhaps could be dated back to the time where doc values didn't exist.

We probably would want to use NumericDocValues instead given we have spent 
quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10103) QueryCache not estimating query size properly

2021-09-16 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416248#comment-17416248
 ] 

Haoyu Zhai commented on LUCENE-10103:
-

I've attached a unit test showing the problem, in my laptop it will print out:

941634874
358: 187280

So the ramBytesUsage estimated by the AutomatonQuery is 941634874 bytes while 
the QueryCache "think" the cached query uses only 187280 bytes

> QueryCache not estimating query size properly
> -
>
> Key: LUCENE-10103
> URL: https://issues.apache.org/jira/browse/LUCENE-10103
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Haoyu Zhai
>Priority: Minor
> Attachments: query_cache_error_demo.patch
>
>
> QueryCache seems estimating the cached query size using a 
> [constant|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L302],
>  it will cause OOM error in some extreme cases where queries cached will use 
> far more memories than assumed. (The default QueryCache tries to use [only 5% 
> of 
> heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L89])
>  One example of such memory-eating query is AutomatonQuery, it will each 
> carry a 
> [RunAutomaton|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java#L42]
>  , which consumes a good amount of memory in exchange for the speed.
> On the other hand, we actually have a good implementation of {{Accountable}} 
> interface for AutomatonQuery (though it will become a bit more complicated 
> later since this query will eventually be rewritten to something else), so 
> maybe QueryCache could use those estimation directly (using an {{instanceof}} 
> check)? Or moreover we could make all {{Query}} implement {Accountable}}, and 
> maybe the default implementation could just be returning the current constant 
> we're using, and only override the method of the potential troublesome 
> queries?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10103) QueryCache not estimating query size properly

2021-09-16 Thread Haoyu Zhai (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyu Zhai updated LUCENE-10103:

Attachment: query_cache_error_demo.patch

> QueryCache not estimating query size properly
> -
>
> Key: LUCENE-10103
> URL: https://issues.apache.org/jira/browse/LUCENE-10103
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Haoyu Zhai
>Priority: Minor
> Attachments: query_cache_error_demo.patch
>
>
> QueryCache seems estimating the cached query size using a 
> [constant|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L302],
>  it will cause OOM error in some extreme cases where queries cached will use 
> far more memories than assumed. (The default QueryCache tries to use [only 5% 
> of 
> heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L89])
>  One example of such memory-eating query is AutomatonQuery, it will each 
> carry a 
> [RunAutomaton|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java#L42]
>  , which consumes a good amount of memory in exchange for the speed.
> On the other hand, we actually have a good implementation of {{Accountable}} 
> interface for AutomatonQuery (though it will become a bit more complicated 
> later since this query will eventually be rewritten to something else), so 
> maybe QueryCache could use those estimation directly (using an {{instanceof}} 
> check)? Or moreover we could make all {{Query}} implement {Accountable}}, and 
> maybe the default implementation could just be returning the current constant 
> we're using, and only override the method of the potential troublesome 
> queries?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10103) QueryCache not estimating query size properly

2021-09-14 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-10103:
---

 Summary: QueryCache not estimating query size properly
 Key: LUCENE-10103
 URL: https://issues.apache.org/jira/browse/LUCENE-10103
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Reporter: Haoyu Zhai


QueryCache seems estimating the cached query size using a 
[constant|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L302],
 it will cause OOM error in some extreme cases where queries cached will use 
far more memories than assumed. (The default QueryCache tries to use [only 5% 
of 
heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L89])
 One example of such memory-eating query is AutomatonQuery, it will each carry 
a 
[RunAutomaton|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java#L42]
 , which consumes a good amount of memory in exchange for the speed.

On the other hand, we actually have a good implementation of {{Accountable}} 
interface for AutomatonQuery (though it will become a bit more complicated 
later since this query will eventually be rewritten to something else), so 
maybe QueryCache could use those estimation directly (using an {{instanceof}} 
check)? Or moreover we could make all {{Query}} implement {Accountable}}, and 
maybe the default implementation could just be returning the current constant 
we're using, and only override the method of the potential troublesome queries?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9969) DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机

2021-09-13 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414308#comment-17414308
 ] 

Haoyu Zhai commented on LUCENE-9969:


[~gsmiller] thanks for mentioning that, it's surprised that we're currently 
using term positions to store the parents ordinal. Do you know whether we have 
a specific reason for doing this? Or it's just because when the code was 
created we don't have NumericDocValues? I think we should  create a separate 
issue switching to use NDV if there's no specific reason against that since it 
will should be faster and (probably) better compressed?

> DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机
> 
>
> Key: LUCENE-9969
> URL: https://issues.apache.org/jira/browse/LUCENE-9969
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 6.6.2
>Reporter: FengFeng Cheng
>Priority: Trivial
> Attachments: image-2021-05-24-13-43-43-289.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> 首先数据量很大,jvm内存为90G,但是TaxonomyIndexArrays几乎占走了一半
> !image-2021-05-24-13-43-43-289.png!
> 请问对于TaxonomyReader是否有更好的使用方式或者其他的优化?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10010) Should we have a NFA Query?

2021-09-09 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412728#comment-17412728
 ] 

Haoyu Zhai commented on LUCENE-10010:
-

And I just came across [this 
issue|https://github.com/mikemccand/luceneutil/issues/139] with the benchmark, 
essentially it is not counting the query construction time towards the qps 
numbers, and that's kind of unfair for the NFA vs DFA query comparison since 
DFA does the determinize work when query is constructed while NFA query does it 
when query is executing.

> Should we have a NFA Query?
> ---
>
> Key: LUCENE-10010
> URL: https://issues.apache.org/jira/browse/LUCENE-10010
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/search
>Affects Versions: main (9.0)
>Reporter: Haoyu Zhai
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Today when a {{RegexpQuery}} is created, it will be translated to NFA, 
> determinized to DFA and eventually become an {{AutomatonQuery}}, which is 
> very fast. However, not every NFA could be determinized to DFA easily, the 
> example given in LUCENE-9981 showed how easy could a short regexp break the 
> determinize process.
> Maybe, instead of marking those kind of queries as adversarial cases, we 
> could make a new kind of NFA query, which execute directly on NFA and thus no 
> need to worry about determinize process or determinized DFA size. It should 
> be slower, but also makes those adversarial cases doable.
> [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a 
> simple but efficient way of searching over NFA, essentially it is a partial 
> determinize process that only determinize the necessary part of DFA. Maybe we 
> could give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10010) Should we have a NFA Query?

2021-09-07 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411388#comment-17411388
 ] 

Haoyu Zhai commented on LUCENE-10010:
-

I've done a very basic benchmark using the current "Wildcard" task based on 
current PR revision.

On wiki10k index, I see a ~6% qps improvement (295 vs 313).

On wikiall index, I see a ~50% qps degradation (40 vs 20).

Also on wikiall index, the JFR helps me identified that the {{getCharClass}} is 
the biggest hotspot (we have optimized that in DFA cases by using a 256 length 
array to map the char to char class, in NFA case I haven't added that 
optimization yet and we do a binary search each time a char coming), I'll try 
to optimize the current PR and see what number we can get at the end. And also 
I'll create a task using more complex regex, current Wildcard task is too 
simple.

> Should we have a NFA Query?
> ---
>
> Key: LUCENE-10010
> URL: https://issues.apache.org/jira/browse/LUCENE-10010
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/search
>Affects Versions: main (9.0)
>Reporter: Haoyu Zhai
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Today when a {{RegexpQuery}} is created, it will be translated to NFA, 
> determinized to DFA and eventually become an {{AutomatonQuery}}, which is 
> very fast. However, not every NFA could be determinized to DFA easily, the 
> example given in LUCENE-9981 showed how easy could a short regexp break the 
> determinize process.
> Maybe, instead of marking those kind of queries as adversarial cases, we 
> could make a new kind of NFA query, which execute directly on NFA and thus no 
> need to worry about determinize process or determinized DFA size. It should 
> be slower, but also makes those adversarial cases doable.
> [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a 
> simple but efficient way of searching over NFA, essentially it is a partial 
> determinize process that only determinize the necessary part of DFA. Maybe we 
> could give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10010) Should we have a NFA Query?

2021-07-26 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387626#comment-17387626
 ] 

Haoyu Zhai commented on LUCENE-10010:
-

Here's a WIP PR: https://github.com/apache/lucene/pull/225

> Should we have a NFA Query?
> ---
>
> Key: LUCENE-10010
> URL: https://issues.apache.org/jira/browse/LUCENE-10010
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/search
>Affects Versions: main (9.0)
>Reporter: Haoyu Zhai
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Today when a {{RegexpQuery}} is created, it will be translated to NFA, 
> determinized to DFA and eventually become an {{AutomatonQuery}}, which is 
> very fast. However, not every NFA could be determinized to DFA easily, the 
> example given in LUCENE-9981 showed how easy could a short regexp break the 
> determinize process.
> Maybe, instead of marking those kind of queries as adversarial cases, we 
> could make a new kind of NFA query, which execute directly on NFA and thus no 
> need to worry about determinize process or determinized DFA size. It should 
> be slower, but also makes those adversarial cases doable.
> [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a 
> simple but efficient way of searching over NFA, essentially it is a partial 
> determinize process that only determinize the necessary part of DFA. Maybe we 
> could give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10010) Should we have a NFA Query?

2021-07-14 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381057#comment-17381057
 ] 

Haoyu Zhai commented on LUCENE-10010:
-

??With an NFA, we'd be forced to match every term in the term dictionary? Or 
what I missing something here??
 As far as I understand (which might be wrong), the way we're currently use DFA 
to intersect with term dictionary is to provide a initial term (which might be 
null), then based on that term find the next acceptable term in lexicographic 
order. I think this can still be done using an NFA? What in my mind is like 
doing a partial determinize process by always taking the smallest unvisited 
transition until an accept state is reached.

I think there're mainly two benefits we can get if we have this new kind of 
query:
 # A possibly better performance when query are not reusable: what we do today 
is we determinize upfront and use DFA at search time, so if the determinized 
query could be reused, then the determinize cost could be amortized to nearly 
zero. But if on the opposite, then we have to pay the whole determinization 
cost every time. While in the NFA query, ideally we don't need to determinize 
the whole NFA every time, so it could be faster than the DFA query. An extreme 
case is that for an empty index, DFA query still need to determinize while NFA 
query don't need it at all.
 # As [~mikemccand] mentioned above, we could be more resilient against ReDoS 
attack

I can try to get some code work done and benchmark to see whether point 1 holds.
  

 

> Should we have a NFA Query?
> ---
>
> Key: LUCENE-10010
> URL: https://issues.apache.org/jira/browse/LUCENE-10010
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/search
>Affects Versions: main (9.0)
>Reporter: Haoyu Zhai
>Priority: Major
>
> Today when a {{RegexpQuery}} is created, it will be translated to NFA, 
> determinized to DFA and eventually become an {{AutomatonQuery}}, which is 
> very fast. However, not every NFA could be determinized to DFA easily, the 
> example given in LUCENE-9981 showed how easy could a short regexp break the 
> determinize process.
> Maybe, instead of marking those kind of queries as adversarial cases, we 
> could make a new kind of NFA query, which execute directly on NFA and thus no 
> need to worry about determinize process or determinized DFA size. It should 
> be slower, but also makes those adversarial cases doable.
> [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a 
> simple but efficient way of searching over NFA, essentially it is a partial 
> determinize process that only determinize the necessary part of DFA. Maybe we 
> could give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10021) Upgrade HPPC to 0.9.0

2021-07-04 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-10021:
---

 Summary: Upgrade HPPC to 0.9.0
 Key: LUCENE-10021
 URL: https://issues.apache.org/jira/browse/LUCENE-10021
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Haoyu Zhai


HPPC 0.9.0 was out and we probably should upgrade.

The {{...ScatterMap}} was deprecated in 0.9.0 and I think we're still using 
them in a few places so probably we should measure the performance impact if 
there is. (According to [release 
note|https://github.com/carrotsearch/hppc/releases] there shouldn't be any)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10010) Should we have a NFA Query?

2021-06-21 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-10010:
---

 Summary: Should we have a NFA Query?
 Key: LUCENE-10010
 URL: https://issues.apache.org/jira/browse/LUCENE-10010
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/search
Affects Versions: main (9.0)
Reporter: Haoyu Zhai


Today when a {{RegexpQuery}} is created, it will be translated to NFA, 
determinized to DFA and eventually become an {{AutomatonQuery}}, which is very 
fast. However, not every NFA could be determinized to DFA easily, the example 
given in LUCENE-9981 showed how easy could a short regexp break the determinize 
process.

Maybe, instead of marking those kind of queries as adversarial cases, we could 
make a new kind of NFA query, which execute directly on NFA and thus no need to 
worry about determinize process or determinized DFA size. It should be slower, 
but also makes those adversarial cases doable.

[This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a simple 
but efficient way of searching over NFA, essentially it is a partial 
determinize process that only determinize the necessary part of DFA. Maybe we 
could give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-14 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363284#comment-17363284
 ] 

Haoyu Zhai commented on LUCENE-9983:


{quote}Could you maybe open PR to add that initial set of synthetic regexps 
into {{luceneutil}}?
{quote}
 OK, opened one: https://github.com/mikemccand/luceneutil/pull/130

 

 

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-09 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360495#comment-17360495
 ] 

Haoyu Zhai commented on LUCENE-9983:


I constructed a file with 235k words that has some part of it randomly replaced 
by a regex (like "apple" to "a[pl]*e")

Then warm up 10 rounds and run 20 rounds to measure the average time of 
constructing {{RegexpQuery}} for those words. Here's the results I got:
|| ||Baseline ||IntIntHashMap||IntIntWormMap||int[128] + IntIntHashMap||
|Time|23.55|23.61|23.78|23.69|

So in normal case original code and {{IntIntHashMap}} only have a very similar 
performance, other choices all has some kind of performance loss seems.

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-07 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358789#comment-17358789
 ] 

Haoyu Zhai commented on LUCENE-9983:


[~broustant] in the adversarial test case, I added 3 static counters for 
measuring the avg and max size we seen in the set, and result is we're seeing 
1800+ states averagely and 24000 states at most. I record the set size each 
time we call {{size()}} (basically each iteration) to calculate the average so 
it might not be very accurate.

[~mikemccand] ah thanks my bad, I didn't realize {{determinize}} is called at 
construction time. I'll benchmark that.

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-04 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17357749#comment-17357749
 ] 

Haoyu Zhai commented on LUCENE-9983:


I just realized that we've already had several tasks that is comparing the 
performance of regexp queries, such as 
[here|https://github.com/mikemccand/luceneutil/blob/master/tasks/wikimedium.10M.nostopwords.tasks#L5238].

So I've done some benchmarking comparing the PR as well as another commit that 
is based on PR but with an additional 128 size int array trying to make access 
to count of first 128 states faster. The result showed that both candidates 
doesn't show much qps difference (within 1%) when comparing to baseline with 
"Wildcard" and "Prefix3" tasks.

If the benchmark results are reliable (meaning I didn't mess up with 
configuration etc.) I think the new PR won't affect the normal case a lot, and 
additional optimization seems not having visible benefit. So I think it might 
be better to start with just using {{IntIntHashMap}} to make things simpler? 
I'll update the PR accordingly.

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-02 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355852#comment-17355852
 ] 

Haoyu Zhai commented on LUCENE-9983:


+1 to have a set of regexps so that we can benchmark them, I'm also a little 
worried the PR might make the normal cases worse too.

[~broustant] That is a good idea, I've tried to use a 128 size array as a map 
for first 128 states and it doesn't help the adversarial cases (I also pulled 
out some stats and found in adversarial cases states are actually much more 
than that number). But I think we might see some benefits from the normal cases 
once we have benchmark set up.

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-05-31 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354660#comment-17354660
 ] 

Haoyu Zhai commented on LUCENE-9983:


I've added a simple static counter just for the adversarial test, and here's 
the stats:
 * {{incr}} called: 106073079
 * entry added to set: 100076079
 * {{decr}} called: 106069079
 * entry removed from set: 100072079
 * {{computeHash}} called: 40057
 * {{freeze}} called: 14056

So seems to me my guess above holds, we're doing way more put/remove entry 
operations than others 

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-05-31 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354659#comment-17354659
 ] 

Haoyu Zhai edited comment on LUCENE-9983 at 5/31/21, 9:07 PM:
--

Thanks [~mikemccand] and [~dweiss]. I've opened a PR based on 
{{IntIntHashMap}}: [https://github.com/apache/lucene/pull/163]

I've applied the test attached in LUCENE-9981 to verify this PR helps. Seems it 
successfully reduce the time it need before throwing the exception from 6 min 
to 16 sec on my local machine (they both stoped at the same point as well).

I still kept the state array to be sorted when get it, so we'll be slower when 
actually getting array but way faster on putting/removing keys. I'm not quite 
sure why the speed up is this much, but my guess is we're doing way more 
operations and spending way more times on increasing/decreasing state count and 
putting/removing states from the set than introducing new states?


was (Author: zhai7631):
Thanks [~mikemccand] and [~dweiss]. I've opened a PR based on 
{{IntIntHashMap}}: [https://github.com/apache/lucene/pull/162]

I've applied the test attached in LUCENE-9981 to verify this PR helps. Seems it 
successfully reduce the time it need before throwing the exception from 6 min 
to 16 sec on my local machine (they both stoped at the same point as well).

I still kept the state array to be sorted when get it, so we'll be slower when 
actually getting array but way faster on putting/removing keys. I'm not quite 
sure why the speed up is this much, but my guess is we're doing way more 
operations and spending way more times on increasing/decreasing state count and 
putting/removing states from the set than introducing new states?

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-05-31 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354659#comment-17354659
 ] 

Haoyu Zhai commented on LUCENE-9983:


Thanks [~mikemccand] and [~dweiss]. I've opened a PR based on 
{{IntIntHashMap}}: [https://github.com/apache/lucene/pull/162]

I've applied the test attached in LUCENE-9981 to verify this PR helps. Seems it 
successfully reduce the time it need before throwing the exception from 6 min 
to 16 sec on my local machine (they both stoped at the same point as well).

I still kept the state array to be sorted when get it, so we'll be slower when 
actually getting array but way faster on putting/removing keys. I'm not quite 
sure why the speed up is this much, but my guess is we're doing way more 
operations and spending way more times on increasing/decreasing state count and 
putting/removing states from the set than introducing new states?

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-05-30 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354237#comment-17354237
 ] 

Haoyu Zhai commented on LUCENE-9983:


Oh I realized we're still gonna iterate on those frozen set 
[here|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java#L705]
 so maybe bitset is not a good choice? What about just iterate over the keys 
and create a {{FronzenIntSet}} based on that? Since we're anyway gonna copy 
those keys so it should only add a little more overhead comparing to the 
current implementation, while getting the benefit of using a light weight, sort 
free data structure?

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-05-30 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354225#comment-17354225
 ] 

Haoyu Zhai commented on LUCENE-9983:


Hi Mike,

So if I understand correctly what we really need is a map that could maps key 
(which is state) to its count, and remove the state when count goes to 0 while 
iterating the intervals? And freeze seems to be necessary since we want to make 
a snapshot of the key set to use it as a hash key?

I'm thinking about using an {{IntIntHashMap}} along with a {{FixedBitSet}}, so 
that we keep the count using the map and use the snapshot of the bitset as hash 
key. What do you think?

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9694) New tool for creating a deterministic index

2021-01-26 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272291#comment-17272291
 ] 

Haoyu Zhai commented on LUCENE-9694:


Oh I didn't include that since I want to keep it as generic as possible. But I 
guess I could add an example {{DocumentSelector}} as suggested by Mike in PR.

> New tool for creating a deterministic index
> ---
>
> Key: LUCENE-9694
> URL: https://issues.apache.org/jira/browse/LUCENE-9694
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: general/tools
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Lucene's index is segmented, and sometimes number of segments and documents 
> arrangement greatly impact performance.
> Given a stable index sort, our team create a tool that records document 
> arrangement (called index map) of an index and rearrange another index 
> (consists of same documents) into the same structure (segment num, and 
> documents included in each segment).
> This tool could be also used in lucene benchmarks for a faster deterministic 
> index construction (if I understand correctly lucene benchmark is using a 
> single thread manner to achieve this).
>  
> We've already had some discussion in email
> [https://markmail.org/message/lbtdntclpnocmfuf]
> And I've implemented the first method, using {{IndexWriter.addIndexes}} and a 
> customized {{FilteredCodecReader}} to achieve the goal. The index 
> construction time is about 25min and time executing this tool is about 10min.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9694) New tool for creating a deterministic index

2021-01-25 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271790#comment-17271790
 ] 

Haoyu Zhai commented on LUCENE-9694:


I've opened a PR for this:

https://github.com/apache/lucene-solr/pull/2246

> New tool for creating a deterministic index
> ---
>
> Key: LUCENE-9694
> URL: https://issues.apache.org/jira/browse/LUCENE-9694
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: general/tools
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Lucene's index is segmented, and sometimes number of segments and documents 
> arrangement greatly impact performance.
> Given a stable index sort, our team create a tool that records document 
> arrangement (called index map) of an index and rearrange another index 
> (consists of same documents) into the same structure (segment num, and 
> documents included in each segment).
> This tool could be also used in lucene benchmarks for a faster deterministic 
> index construction (if I understand correctly lucene benchmark is using a 
> single thread manner to achieve this).
>  
> We've already had some discussion in email
> [https://markmail.org/message/lbtdntclpnocmfuf]
> And I've implemented the first method, using {{IndexWriter.addIndexes}} and a 
> customized {{FilteredCodecReader}} to achieve the goal. The index 
> construction time is about 25min and time executing this tool is about 10min.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9694) New tool for creating a deterministic index

2021-01-22 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-9694:
--

 Summary: New tool for creating a deterministic index
 Key: LUCENE-9694
 URL: https://issues.apache.org/jira/browse/LUCENE-9694
 Project: Lucene - Core
  Issue Type: New Feature
  Components: general/tools
Reporter: Haoyu Zhai


Lucene's index is segmented, and sometimes number of segments and documents 
arrangement greatly impact performance.

Given a stable index sort, our team create a tool that records document 
arrangement (called index map) of an index and rearrange another index 
(consists of same documents) into the same structure (segment num, and 
documents included in each segment).

This tool could be also used in lucene benchmarks for a faster deterministic 
index construction (if I understand correctly lucene benchmark is using a 
single thread manner to achieve this).

 

We've already had some discussion in email

[https://markmail.org/message/lbtdntclpnocmfuf]

And I've implemented the first method, using {{IndexWriter.addIndexes}} and a 
customized {{FilteredCodecReader}} to achieve the goal. The index construction 
time is about 25min and time executing this tool is about 10min.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test

2020-11-19 Thread Haoyu Zhai (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyu Zhai updated LUCENE-9618:
---
Description: 
I'm trying to play around with my own {{IntervalSource}} and found out that 
{{nextInterval}} method of IntervalIterator will be called sometimes even after 
{{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
  
 After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
calling an inner iterator's {{nextInterval}} regardless of what the result of 
{{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
implementation do considered the case where {{nextInterval}} is called after 
{{nextDoc}} returns NO_MORE_DOCS.
  
 We should probably update the javadoc and test if the behavior is necessary. 
Or we should change the current implementation to avoid this behavior
 original email discussion thread:

https://markmail.org/thread/aytal77bgzl2zafm

  was:
I'm trying to play around with my own {{IntervalSource}} and found out that 
{{nextInterval}} method of IntervalIterator will be called sometimes even after 
{{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
 
After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
calling an inner iterator's {{nextInterval}} regardless of what the result of 
{{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
implementation do considered the case where {{nextInterval}} is called after 
{{nextDoc}} returns NO_MORE_DOCS.
 
We should probably update the javadoc and test if the behavior is necessary. Or 
we should change the current implementation to avoid this behavior
original email discussion thread:

https://markmail.org/message/7itbwk6ts3bo3gdh


> Improve IntervalIterator.nextInterval's behavior/documentation/test
> ---
>
> Key: LUCENE-9618
> URL: https://issues.apache.org/jira/browse/LUCENE-9618
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/query
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm trying to play around with my own {{IntervalSource}} and found out that 
> {{nextInterval}} method of IntervalIterator will be called sometimes even 
> after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
>   
>  After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
> calling an inner iterator's {{nextInterval}} regardless of what the result of 
> {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
> implementation do considered the case where {{nextInterval}} is called after 
> {{nextDoc}} returns NO_MORE_DOCS.
>   
>  We should probably update the javadoc and test if the behavior is necessary. 
> Or we should change the current implementation to avoid this behavior
>  original email discussion thread:
> https://markmail.org/thread/aytal77bgzl2zafm



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test

2020-11-19 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235652#comment-17235652
 ] 

Haoyu Zhai commented on LUCENE-9618:


I created a [PR|https://github.com/apache/lucene-solr/pull/2090] with a simple 
test case to demonstrate the issue mentioned.

> Improve IntervalIterator.nextInterval's behavior/documentation/test
> ---
>
> Key: LUCENE-9618
> URL: https://issues.apache.org/jira/browse/LUCENE-9618
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/query
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm trying to play around with my own {{IntervalSource}} and found out that 
> {{nextInterval}} method of IntervalIterator will be called sometimes even 
> after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
>  
> After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
> calling an inner iterator's {{nextInterval}} regardless of what the result of 
> {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
> implementation do considered the case where {{nextInterval}} is called after 
> {{nextDoc}} returns NO_MORE_DOCS.
>  
> We should probably update the javadoc and test if the behavior is necessary. 
> Or we should change the current implementation to avoid this behavior
> original email discussion thread:
> https://markmail.org/message/7itbwk6ts3bo3gdh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test

2020-11-19 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-9618:
--

 Summary: Improve IntervalIterator.nextInterval's 
behavior/documentation/test
 Key: LUCENE-9618
 URL: https://issues.apache.org/jira/browse/LUCENE-9618
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/query
Reporter: Haoyu Zhai


I'm trying to play around with my own {{IntervalSource}} and found out that 
{{nextInterval}} method of IntervalIterator will be called sometimes even after 
{{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS.
 
After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is 
calling an inner iterator's {{nextInterval}} regardless of what the result of 
{{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s 
implementation do considered the case where {{nextInterval}} is called after 
{{nextDoc}} returns NO_MORE_DOCS.
 
We should probably update the javadoc and test if the behavior is necessary. Or 
we should change the current implementation to avoid this behavior
original email discussion thread:

https://markmail.org/message/7itbwk6ts3bo3gdh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9560) Position aware TermQuery

2020-10-09 Thread Haoyu Zhai (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyu Zhai resolved LUCENE-9560.

Resolution: Not A Problem

> Position aware TermQuery
> 
>
> Key: LUCENE-9560
> URL: https://issues.apache.org/jira/browse/LUCENE-9560
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/search
>Reporter: Haoyu Zhai
>Priority: Major
>
> In our work, we index most of our fields into an "all" field (like 
> elasticsearch) to make our search faster. But at the same time we still want 
> to support some of the field specific search (like {{title}}), so currently 
> our solution is to double index them so that we could do both "all" search as 
> well as specific field search.
> I want to propose a new term query that accept a range in a specific field to 
> search so that we could search on "all" field but act like a field specific 
> search. Then we need not to doubly index those field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9560) Position aware TermQuery

2020-10-09 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211544#comment-17211544
 ] 

Haoyu Zhai commented on LUCENE-9560:


Thanks guys. I tried and did find out that {{IntervalQuery}} provide the 
functionality I desire, I'll close this issue

> Position aware TermQuery
> 
>
> Key: LUCENE-9560
> URL: https://issues.apache.org/jira/browse/LUCENE-9560
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/search
>Reporter: Haoyu Zhai
>Priority: Major
>
> In our work, we index most of our fields into an "all" field (like 
> elasticsearch) to make our search faster. But at the same time we still want 
> to support some of the field specific search (like {{title}}), so currently 
> our solution is to double index them so that we could do both "all" search as 
> well as specific field search.
> I want to propose a new term query that accept a range in a specific field to 
> search so that we could search on "all" field but act like a field specific 
> search. Then we need not to doubly index those field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9560) Position aware TermQuery

2020-10-03 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-9560:
--

 Summary: Position aware TermQuery
 Key: LUCENE-9560
 URL: https://issues.apache.org/jira/browse/LUCENE-9560
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/search
Reporter: Haoyu Zhai


In our work, we index most of our fields into an "all" field (like 
elasticsearch) to make our search faster. But at the same time we still want to 
support some of the field specific search (like {{title}}), so currently our 
solution is to double index them so that we could do both "all" search as well 
as specific field search.

I want to propose a new term query that accept a range in a specific field to 
search so that we could search on "all" field but act like a field specific 
search. Then we need not to doubly index those field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7882) Maybe expression compiler should cache recently compiled expressions?

2020-09-16 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197237#comment-17197237
 ] 

Haoyu Zhai commented on LUCENE-7882:


Hi Uwe

Thank you for making this fantastic PR!

This issue was a bug in our service that would incorrectly recompile many 
expressions and it is fixed, now we only see ~500 expression compiled per 
benchmark run.

We've tested this PR by compiling with JDK11 and running with JDK15 (because of 
some reason it's not easy to compile our service with JDK15 directly). But 
because of the reason I mentioned above, it seems that we don't have enough 
expressions compilation now to observe the difference with or without the PR.

> Maybe expression compiler should cache recently compiled expressions?
> -
>
> Key: LUCENE-7882
> URL: https://issues.apache.org/jira/browse/LUCENE-7882
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/expressions
>Reporter: Michael McCandless
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: demo.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I've been running search performance tests using a simple expression 
> ({{_score + ln(1000+unit_sales)}}) for sorting and hit this odd bottleneck:
> {noformat}
> "pool-1-thread-30" #70 prio=5 os_prio=0 tid=0x7eea7000a000 nid=0x1ea8a 
> waiting for monitor entry [0x7eea867dd000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.lucene.expressions.js.JavascriptCompiler$CompiledExpression.evaluate(_score
>  + ln(1000+unit_sales))
>   at 
> org.apache.lucene.expressions.ExpressionFunctionValues.doubleValue(ExpressionFunctionValues.java:49)
>   at 
> com.amazon.lucene.OrderedVELeafCollector.collectInternal(OrderedVELeafCollector.java:123)
>   at 
> com.amazon.lucene.OrderedVELeafCollector.collect(OrderedVELeafCollector.java:108)
>   at 
> org.apache.lucene.search.MultiCollectorManager$Collectors$LeafCollectors.collect(MultiCollectorManager.java:102)
>   at 
> org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:241)
>   at 
> org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:184)
>   at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:658)
>   at org.apache.lucene.search.IndexSearcher$5.call(IndexSearcher.java:600)
>   at org.apache.lucene.search.IndexSearcher$5.call(IndexSearcher.java:597)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> I couldn't see any {{synchronized}} in the sources here, so I'm not sure 
> which object monitor it's blocked on.
> I was accidentally compiling a new expression for every query, and that 
> bottleneck would cause overall QPS to slow down drastically (~4X slower after 
> ~1 hour of redline tests), as if the JVM is getting slower and slower to 
> evaluate each expression the more expressions I had compiled.
> I tested JDK 9-ea and it also kept slowing down over time as the performance 
> test ran.
> Maybe we should put a small cache in front of the expressions compiler to 
> make it less trappy?  Or maybe we can get to the root cause of why the JVM 
> slows down more and more, the more expressions you compile?
> I won't have time to work on this in the near future so if anyone else feels 
> the itch, please scratch it!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9424) Have a warning comment for AttributeSource.captureState

2020-07-29 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167454#comment-17167454
 ] 

Haoyu Zhai commented on LUCENE-9424:


[~mikemccand] Thank you!

> Have a warning comment for AttributeSource.captureState
> ---
>
> Key: LUCENE-9424
> URL: https://issues.apache.org/jira/browse/LUCENE-9424
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/javadocs
>Reporter: Haoyu Zhai
>Priority: Trivial
> Fix For: master (9.0), 8.7
>
> Attachments: LUCENE-9424.patch
>
>
> {{AttributeSource.captureState}} is a powerful method that can be used to 
> store and (later on) restore the current state, but it comes with a cost of 
> copying all attributes in this source and sometimes can be a big cost if 
> called multiple times.
> We could probably add a warning to indicate this cost, as this method is 
> encapsulated quite well and sometimes people who use it won't be aware of the 
> cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9424) Have a warning comment for AttributeSource.captureState

2020-07-24 Thread Haoyu Zhai (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyu Zhai updated LUCENE-9424:
---
Attachment: LUCENE-9424.patch

> Have a warning comment for AttributeSource.captureState
> ---
>
> Key: LUCENE-9424
> URL: https://issues.apache.org/jira/browse/LUCENE-9424
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/javadocs
>Reporter: Haoyu Zhai
>Priority: Trivial
> Attachments: LUCENE-9424.patch
>
>
> {{AttributeSource.captureState}} is a powerful method that can be used to 
> store and (later on) restore the current state, but it comes with a cost of 
> copying all attributes in this source and sometimes can be a big cost if 
> called multiple times.
> We could probably add a warning to indicate this cost, as this method is 
> encapsulated quite well and sometimes people who use it won't be aware of the 
> cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9424) Have a warning comment for AttributeSource.captureState

2020-07-09 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-9424:
--

 Summary: Have a warning comment for AttributeSource.captureState
 Key: LUCENE-9424
 URL: https://issues.apache.org/jira/browse/LUCENE-9424
 Project: Lucene - Core
  Issue Type: Improvement
  Components: general/javadocs
Reporter: Haoyu Zhai


{{AttributeSource.captureState}} is a powerful method that can be used to store 
and (later on) restore the current state, but it comes with a cost of copying 
all attributes in this source and sometimes can be a big cost if called 
multiple times.

We could probably add a warning to indicate this cost, as this method is 
encapsulated quite well and sometimes people who use it won't be aware of the 
cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value

2020-07-01 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149623#comment-17149623
 ] 

Haoyu Zhai commented on LUCENE-8574:


[~mikemccand] Yes it fixes the test case. Before the fix the test will cause 
{{OutOfMemoryException}} and after the fix it finishes in a reasonable time 
(like < 1s when I run it solely)

Thanks for reviewing that, I've addressed those comments!

> ExpressionFunctionValues should cache per-hit value
> ---
>
> Key: LUCENE-8574
> URL: https://issues.apache.org/jira/browse/LUCENE-8574
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.5, 8.0
>Reporter: Michael McCandless
>Assignee: Robert Muir
>Priority: Major
> Attachments: LUCENE-8574.patch, unit_test.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> The original version of {{ExpressionFunctionValues}} had a simple per-hit 
> cache, so that nested expressions that reference the same common variable 
> would compute the value for that variable the first time it was referenced 
> and then use that cached value for all subsequent invocations, within one 
> hit.  I think it was accidentally removed in LUCENE-7609?
> This is quite important if you have non-trivial expressions that reference 
> the same variable multiple times.
> E.g. if I have these expressions:
> {noformat}
> x = c + d
> c = b + 2 
> d = b * 2{noformat}
> Then evaluating x should only cause b's value to be computed once (for a 
> given hit), but today it's computed twice.  The problem is combinatoric if b 
> then references another variable multiple times, etc.
> I think to fix this we just need to restore the per-hit cache?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value

2020-06-25 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145732#comment-17145732
 ] 

Haoyu Zhai commented on LUCENE-8574:


Made a PR([https://github.com/apache/lucene-solr/pull/1613]) about this issue

Basically making a new DoubleValuesSource class which pass in `valueCache` to a 
custom getValues to enforce only one DoubleValue per name along the whole 
generation process.

> ExpressionFunctionValues should cache per-hit value
> ---
>
> Key: LUCENE-8574
> URL: https://issues.apache.org/jira/browse/LUCENE-8574
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.5, 8.0
>Reporter: Michael McCandless
>Assignee: Robert Muir
>Priority: Major
> Attachments: LUCENE-8574.patch, unit_test.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The original version of {{ExpressionFunctionValues}} had a simple per-hit 
> cache, so that nested expressions that reference the same common variable 
> would compute the value for that variable the first time it was referenced 
> and then use that cached value for all subsequent invocations, within one 
> hit.  I think it was accidentally removed in LUCENE-7609?
> This is quite important if you have non-trivial expressions that reference 
> the same variable multiple times.
> E.g. if I have these expressions:
> {noformat}
> x = c + d
> c = b + 2 
> d = b * 2{noformat}
> Then evaluating x should only cause b's value to be computed once (for a 
> given hit), but today it's computed twice.  The problem is combinatoric if b 
> then references another variable multiple times, etc.
> I think to fix this we just need to restore the per-hit cache?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value

2020-06-17 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138710#comment-17138710
 ] 

Haoyu Zhai commented on LUCENE-8574:


Ah, yes will use boolean instead of NaN. I'm just verifying whether the patch 
works so quickly inserted few lines of code without much thinking.

But how should we fix this issue correctly? Since the easy fix patch seems not 
solving the problem.

> ExpressionFunctionValues should cache per-hit value
> ---
>
> Key: LUCENE-8574
> URL: https://issues.apache.org/jira/browse/LUCENE-8574
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.5, 8.0
>Reporter: Michael McCandless
>Assignee: Robert Muir
>Priority: Major
> Attachments: LUCENE-8574.patch, unit_test.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The original version of {{ExpressionFunctionValues}} had a simple per-hit 
> cache, so that nested expressions that reference the same common variable 
> would compute the value for that variable the first time it was referenced 
> and then use that cached value for all subsequent invocations, within one 
> hit.  I think it was accidentally removed in LUCENE-7609?
> This is quite important if you have non-trivial expressions that reference 
> the same variable multiple times.
> E.g. if I have these expressions:
> {noformat}
> x = c + d
> c = b + 2 
> d = b * 2{noformat}
> Then evaluating x should only cause b's value to be computed once (for a 
> given hit), but today it's computed twice.  The problem is combinatoric if b 
> then references another variable multiple times, etc.
> I think to fix this we just need to restore the per-hit cache?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value

2020-06-16 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137920#comment-17137920
 ] 

Haoyu Zhai commented on LUCENE-8574:


I've attached a unit test showing a case that current code could not handle. 
And it seems the patch attached to this issue could not handle it as well 
(since DoubleValues generated for the same LeafReaderContext is not the same, 
we still get tons of DoubleValues created).

> ExpressionFunctionValues should cache per-hit value
> ---
>
> Key: LUCENE-8574
> URL: https://issues.apache.org/jira/browse/LUCENE-8574
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.5, 8.0
>Reporter: Michael McCandless
>Assignee: Robert Muir
>Priority: Major
> Attachments: LUCENE-8574.patch, unit_test.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The original version of {{ExpressionFunctionValues}} had a simple per-hit 
> cache, so that nested expressions that reference the same common variable 
> would compute the value for that variable the first time it was referenced 
> and then use that cached value for all subsequent invocations, within one 
> hit.  I think it was accidentally removed in LUCENE-7609?
> This is quite important if you have non-trivial expressions that reference 
> the same variable multiple times.
> E.g. if I have these expressions:
> {noformat}
> x = c + d
> c = b + 2 
> d = b * 2{noformat}
> Then evaluating x should only cause b's value to be computed once (for a 
> given hit), but today it's computed twice.  The problem is combinatoric if b 
> then references another variable multiple times, etc.
> I think to fix this we just need to restore the per-hit cache?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value

2020-06-16 Thread Haoyu Zhai (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyu Zhai updated LUCENE-8574:
---
Attachment: unit_test.patch

> ExpressionFunctionValues should cache per-hit value
> ---
>
> Key: LUCENE-8574
> URL: https://issues.apache.org/jira/browse/LUCENE-8574
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.5, 8.0
>Reporter: Michael McCandless
>Assignee: Robert Muir
>Priority: Major
> Attachments: LUCENE-8574.patch, unit_test.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The original version of {{ExpressionFunctionValues}} had a simple per-hit 
> cache, so that nested expressions that reference the same common variable 
> would compute the value for that variable the first time it was referenced 
> and then use that cached value for all subsequent invocations, within one 
> hit.  I think it was accidentally removed in LUCENE-7609?
> This is quite important if you have non-trivial expressions that reference 
> the same variable multiple times.
> E.g. if I have these expressions:
> {noformat}
> x = c + d
> c = b + 2 
> d = b * 2{noformat}
> Then evaluating x should only cause b's value to be computed once (for a 
> given hit), but today it's computed twice.  The problem is combinatoric if b 
> then references another variable multiple times, etc.
> I think to fix this we just need to restore the per-hit cache?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value

2020-06-15 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136136#comment-17136136
 ] 

Haoyu Zhai commented on LUCENE-8574:


I've checked the current release and couldn't see this patch merged. And I 
think there's no other changes introducing similar functionality (not so sure). 
Probably we should merge this?

> ExpressionFunctionValues should cache per-hit value
> ---
>
> Key: LUCENE-8574
> URL: https://issues.apache.org/jira/browse/LUCENE-8574
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.5, 8.0
>Reporter: Michael McCandless
>Assignee: Robert Muir
>Priority: Major
> Attachments: LUCENE-8574.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The original version of {{ExpressionFunctionValues}} had a simple per-hit 
> cache, so that nested expressions that reference the same common variable 
> would compute the value for that variable the first time it was referenced 
> and then use that cached value for all subsequent invocations, within one 
> hit.  I think it was accidentally removed in LUCENE-7609?
> This is quite important if you have non-trivial expressions that reference 
> the same variable multiple times.
> E.g. if I have these expressions:
> {noformat}
> x = c + d
> c = b + 2 
> d = b * 2{noformat}
> Then evaluating x should only cause b's value to be computed once (for a 
> given hit), but today it's computed twice.  The problem is combinatoric if b 
> then references another variable multiple times, etc.
> I think to fix this we just need to restore the per-hit cache?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value

2020-06-12 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134323#comment-17134323
 ] 

Haoyu Zhai commented on LUCENE-8574:


Sry, wrong commit message pointed to here, the correct issue should be 
LUCENE-9391.

BTW, is this patch ever merged?

> ExpressionFunctionValues should cache per-hit value
> ---
>
> Key: LUCENE-8574
> URL: https://issues.apache.org/jira/browse/LUCENE-8574
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.5, 8.0
>Reporter: Michael McCandless
>Assignee: Robert Muir
>Priority: Major
> Attachments: LUCENE-8574.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The original version of {{ExpressionFunctionValues}} had a simple per-hit 
> cache, so that nested expressions that reference the same common variable 
> would compute the value for that variable the first time it was referenced 
> and then use that cached value for all subsequent invocations, within one 
> hit.  I think it was accidentally removed in LUCENE-7609?
> This is quite important if you have non-trivial expressions that reference 
> the same variable multiple times.
> E.g. if I have these expressions:
> {noformat}
> x = c + d
> c = b + 2 
> d = b * 2{noformat}
> Then evaluating x should only cause b's value to be computed once (for a 
> given hit), but today it's computed twice.  The problem is combinatoric if b 
> then references another variable multiple times, etc.
> I think to fix this we just need to restore the per-hit cache?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9391) Upgrade to HPPC 0.8.2

2020-06-03 Thread Haoyu Zhai (Jira)
Haoyu Zhai created LUCENE-9391:
--

 Summary: Upgrade to HPPC 0.8.2
 Key: LUCENE-9391
 URL: https://issues.apache.org/jira/browse/LUCENE-9391
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Haoyu Zhai


HPPC 0.8.2 is out and exposes an Accountable-like interface using to estimate 
the memory usage.

[https://issues.carrot2.org/secure/ReleaseNote.jspa?projectId=10070&version=13522&styleName=Text]

We should upgrade to that if any of components using hppc need to estimate 
memory better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org