[jira] [Created] (LUCENE-10371) Make IndexRearranger able to arrange segment into a determined order
Haoyu Zhai created LUCENE-10371: --- Summary: Make IndexRearranger able to arrange segment into a determined order Key: LUCENE-10371 URL: https://issues.apache.org/jira/browse/LUCENE-10371 Project: Lucene - Core Issue Type: Improvement Reporter: Haoyu Zhai Previously when I tried to make change to luceneutil to let it use {{IndexRearranger}} for a faster deterministic index construction, I found that even each segment contains the same documents set, the order of segments will impact the estimated hit number (using BMW): [https://markmail.org/message/zl6zsqvbg7nwfq6w] At that time the discussion tend to tolerant the small hit count difference to resolve the issue, after some time when I discuss this issue again with [~mikemccand] , we thought it might also be a good idea to just add ability of rearranging the segments order to {{IndexRearranger}}, so that we can ensure each time the rearranged index is truly the same. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10316) fix TestLRUQueryCache.testCachingAccountableQuery failure
[ https://issues.apache.org/jira/browse/LUCENE-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyu Zhai resolved LUCENE-10316. - Resolution: Fixed > fix TestLRUQueryCache.testCachingAccountableQuery failure > - > > Key: LUCENE-10316 > URL: https://issues.apache.org/jira/browse/LUCENE-10316 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > I saw this build failure: > [https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/] > with following stack trace > {code:java} > java.lang.AssertionError: expected:<130.0> but was:<1544976.0> > at > __randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:577) > at org.junit.Assert.assertEquals(Assert.java:701) > at > org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) > ... > NOTE: reproduce with: gradlew test --tests > TestLRUQueryCache.testCachingAccountableQuery -Dtests.seed=F7826B1EB37D545A > -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ckb-IR > -Dtests.timezone=Africa/Dakar -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > It does not reproduce on my laptop on current main branch, but since the test > is comparing an estimation with a 10% slack, it can fail for sure sometime. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10316) fix TestLRUQueryCache.testCachingAccountableQuery failure
[ https://issues.apache.org/jira/browse/LUCENE-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyu Zhai updated LUCENE-10316: Description: I saw this build failure: [https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/] with following stack trace {code:java} java.lang.AssertionError: expected:<130.0> but was:<1544976.0> at __randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0) at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:577) at org.junit.Assert.assertEquals(Assert.java:701) at org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) ... NOTE: reproduce with: gradlew test --tests TestLRUQueryCache.testCachingAccountableQuery -Dtests.seed=F7826B1EB37D545A -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ckb-IR -Dtests.timezone=Africa/Dakar -Dtests.asserts=true -Dtests.file.encoding=UTF-8 {code} It does not reproduce on my laptop on current main branch, but since the test is comparing an estimation with a 10% slack, it can fail for sure sometime. was: I saw this build failure: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/ with following stack trace {code:java} java.lang.AssertionError: expected:<130.0> but was:<1544976.0> at __randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0) at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:577) at org.junit.Assert.assertEquals(Assert.java:701) at org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) ... {code} It does not reproduce on my laptop on current main branch, but since the test is comparing an estimation with a 10% slack, it can fail for sure sometime. > fix TestLRUQueryCache.testCachingAccountableQuery failure > - > > Key: LUCENE-10316 > URL: https://issues.apache.org/jira/browse/LUCENE-10316 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Haoyu Zhai >Priority: Minor > > I saw this build failure: > [https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/] > with following stack trace > {code:java} > java.lang.AssertionError: expected:<130.0> but was:<1544976.0> > at > __randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:577) > at org.junit.Assert.assertEquals(Assert.java:701) > at > org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.refle
[jira] [Commented] (LUCENE-10316) fix TestLRUQueryCache.testCachingAccountableQuery failure
[ https://issues.apache.org/jira/browse/LUCENE-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459420#comment-17459420 ] Haoyu Zhai commented on LUCENE-10316: - So basically the test is about making sure the query cache has the right estimation when the query has implemented the {{Accountable}} interface. When I originally wrote it I estimated the query cache size using {{(query_size + linked_hash_map_entry_size) * query_num}} with 10% slack to allow the error of estimation. But apparently it is not enough sometimes (probably larger number of cache entries will waste more?). Given the aim of the test is make sure when there're known big queries being cached the query cache reflect it correctly, I think we could change that to {{assert(query_cache_size > sum_of_all_queries_cached)}}. Then we won't depend on a slack to assert the correctness. > fix TestLRUQueryCache.testCachingAccountableQuery failure > - > > Key: LUCENE-10316 > URL: https://issues.apache.org/jira/browse/LUCENE-10316 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Haoyu Zhai >Priority: Minor > > I saw this build failure: > https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/ > with following stack trace > {code:java} > java.lang.AssertionError: expected:<130.0> but was:<1544976.0> > at > __randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:577) > at org.junit.Assert.assertEquals(Assert.java:701) > at > org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) > ... > {code} > It does not reproduce on my laptop on current main branch, but since the test > is comparing an estimation with a 10% slack, it can fail for sure sometime. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10316) fix TestLRUQueryCache.testCachingAccountableQuery failure
Haoyu Zhai created LUCENE-10316: --- Summary: fix TestLRUQueryCache.testCachingAccountableQuery failure Key: LUCENE-10316 URL: https://issues.apache.org/jira/browse/LUCENE-10316 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Haoyu Zhai I saw this build failure: https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/348/ with following stack trace {code:java} java.lang.AssertionError: expected:<130.0> but was:<1544976.0> at __randomizedtesting.SeedInfo.seed([F7826B1EB37D545A:995B6ED46A95D1A0]:0) at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:577) at org.junit.Assert.assertEquals(Assert.java:701) at org.apache.lucene.search.TestLRUQueryCache.testCachingAccountableQuery(TestLRUQueryCache.java:570) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) ... {code} It does not reproduce on my laptop on current main branch, but since the test is comparing an estimation with a 10% slack, it can fail for sure sometime. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets
[ https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454334#comment-17454334 ] Haoyu Zhai commented on LUCENE-10229: - Here's the PR: https://github.com/apache/lucene/pull/521 > Match offsets should be consistent for fields with positions and fields with > offsets > > > Key: LUCENE-10229 > URL: https://issues.apache.org/jira/browse/LUCENE-10229 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > This is a follow-up of LUCENE-10223 in which it was discovered that fields > with > offsets don't highlight some more complex interval queries properly. Alan > says: > {quote} > It's because it returns the position of the inner match, but the offsets of > the outer. And so if you're re-analyzing and retrieving offsets by looking > at the positions, you get the 'right' thing. It's not obvious to me what the > correct response is here, but thinking about it the current behaviour is kind > of the worst of both worlds, and perhaps we should change it so that you get > offsets of the inner match as standard, and then the outer match is returned > as part of the sub matches. > {quote} > Intervals are nicely separated into "basic intervals" and "filters" which > restrict some other source of intervals, here is the original documentation: > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50 > My experience from an extended period of using interval queries in a frontend > where they're highlighted is that filters are restrictions that should not be > highlighted - it's the source intervals that people care about. Filters are > what you remove or where you give proper context to source intervals. > The test code contributed in LUCENE-10223 contains numerous query-highlight > examples (on fields with positions) where this intuition is demonstrated on > all kinds of interval functions: > https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542 > This issue is about making the internals work consistently for fields with > positions and fields with offsets. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets
[ https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453652#comment-17453652 ] Haoyu Zhai commented on LUCENE-10229: - Sure I can work on a PR :) > Match offsets should be consistent for fields with positions and fields with > offsets > > > Key: LUCENE-10229 > URL: https://issues.apache.org/jira/browse/LUCENE-10229 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Major > > This is a follow-up of LUCENE-10223 in which it was discovered that fields > with > offsets don't highlight some more complex interval queries properly. Alan > says: > {quote} > It's because it returns the position of the inner match, but the offsets of > the outer. And so if you're re-analyzing and retrieving offsets by looking > at the positions, you get the 'right' thing. It's not obvious to me what the > correct response is here, but thinking about it the current behaviour is kind > of the worst of both worlds, and perhaps we should change it so that you get > offsets of the inner match as standard, and then the outer match is returned > as part of the sub matches. > {quote} > Intervals are nicely separated into "basic intervals" and "filters" which > restrict some other source of intervals, here is the original documentation: > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50 > My experience from an extended period of using interval queries in a frontend > where they're highlighted is that filters are restrictions that should not be > highlighted - it's the source intervals that people care about. Filters are > what you remove or where you give proper context to source intervals. > The test code contributed in LUCENE-10223 contains numerous query-highlight > examples (on fields with positions) where this intuition is demonstrated on > all kinds of interval functions: > https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542 > This issue is about making the internals work consistently for fields with > positions and fields with offsets. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets
[ https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453526#comment-17453526 ] Haoyu Zhai commented on LUCENE-10229: - Seems for {{containedBy}} this inconsistency is introduced [here|https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/ConjunctionMatchesIterator.java#L60,L75], perhaps we could further subclass the {{ConjunctionMatchesIterator}} to a {{FilterMatchesIterator}} to let the offset methods return only offset of "source"? > Match offsets should be consistent for fields with positions and fields with > offsets > > > Key: LUCENE-10229 > URL: https://issues.apache.org/jira/browse/LUCENE-10229 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Major > > This is a follow-up of LUCENE-10223 in which it was discovered that fields > with > offsets don't highlight some more complex interval queries properly. Alan > says: > {quote} > It's because it returns the position of the inner match, but the offsets of > the outer. And so if you're re-analyzing and retrieving offsets by looking > at the positions, you get the 'right' thing. It's not obvious to me what the > correct response is here, but thinking about it the current behaviour is kind > of the worst of both worlds, and perhaps we should change it so that you get > offsets of the inner match as standard, and then the outer match is returned > as part of the sub matches. > {quote} > Intervals are nicely separated into "basic intervals" and "filters" which > restrict some other source of intervals, here is the original documentation: > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50 > My experience from an extended period of using interval queries in a frontend > where they're highlighted is that filters are restrictions that should not be > highlighted - it's the source intervals that people care about. Filters are > what you remove or where you give proper context to source intervals. > The test code contributed in LUCENE-10223 contains numerous query-highlight > examples (on fields with positions) where this intuition is demonstrated on > all kinds of interval functions: > https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542 > This issue is about making the internals work consistently for fields with > positions and fields with offsets. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446058#comment-17446058 ] Haoyu Zhai commented on LUCENE-10122: - Ah Thanks [~jpountz] for reminding, I forgot that, here we go: https://github.com/apache/lucene/pull/454 > Explore using NumericDocValue to store taxonomy parent array > > > Key: LUCENE-10122 > URL: https://issues.apache.org/jira/browse/LUCENE-10122 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 9.0 >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 5.5h > Remaining Estimate: 0h > > We currently use term position of a hardcoded term in a hardcoded field to > represent the parent ordinal of each taxonomy label. That is an old way and > perhaps could be dated back to the time where doc values didn't exist. > We probably would want to use NumericDocValues instead given we have spent > quite a lot of effort optimizing them. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444098#comment-17444098 ] Haoyu Zhai commented on LUCENE-10122: - OK, here's the new PR (with the back-compatibility): [https://github.com/apache/lucene/pull/442] [~jpountz] I set that PR to target on 9.0 branch based on the previous email thread, but since we've already in process of releasing please let me know if you want this to be targeting main branch instead. > Explore using NumericDocValue to store taxonomy parent array > > > Key: LUCENE-10122 > URL: https://issues.apache.org/jira/browse/LUCENE-10122 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (10.0) >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 2h 10m > Remaining Estimate: 0h > > We currently use term position of a hardcoded term in a hardcoded field to > represent the parent ordinal of each taxonomy label. That is an old way and > perhaps could be dated back to the time where doc values didn't exist. > We probably would want to use NumericDocValues instead given we have spent > quite a lot of effort optimizing them. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438317#comment-17438317 ] Haoyu Zhai edited comment on LUCENE-10122 at 11/5/21, 6:55 PM: --- The luceneutil benchmark shows a mostly neutral result {code:java} TaskQPS base StdDevQPS cand StdDev Pct diff p-value Fuzzy2 58.39 (5.6%) 57.70 (6.1%) -1.2% ( -12% - 11%) 0.518 BrowseDateTaxoFacets2.40 (6.6%)2.38 (5.8%) -0.7% ( -12% - 12%) 0.709 BrowseDayOfYearTaxoFacets2.40 (6.5%)2.38 (5.8%) -0.7% ( -12% - 12%) 0.721 BrowseMonthTaxoFacets2.49 (6.8%)2.47 (6.1%) -0.7% ( -12% - 13%) 0.738 BrowseMonthSSDVFacets 16.44 (36.1%) 16.38 (35.1%) -0.4% ( -52% - 110%) 0.974 LowIntervalsOrdered 30.70 (2.8%) 30.61 (3.0%) -0.3% ( -5% -5%) 0.763 LowPhrase 516.96 (1.7%) 515.67 (1.6%) -0.3% ( -3% -3%) 0.626 OrNotHighHigh 580.07 (2.1%) 578.61 (2.8%) -0.3% ( -5% -4%) 0.747 BrowseDayOfYearSSDVFacets 15.22 (24.2%) 15.19 (24.2%) -0.2% ( -39% - 63%) 0.976 HighTermDayOfYearSort 766.98 (1.7%) 765.20 (1.7%) -0.2% ( -3% -3%) 0.665 HighIntervalsOrdered2.46 (2.0%)2.45 (2.3%) -0.2% ( -4% -4%) 0.795 MedIntervalsOrdered 27.55 (2.8%) 27.51 (2.8%) -0.1% ( -5% -5%) 0.878 IntNRQ 28.96 (0.3%) 28.92 (0.6%) -0.1% ( 0% -0%) 0.358 OrHighHigh 36.05 (2.2%) 36.02 (1.7%) -0.1% ( -3% -3%) 0.870 MedPhrase 119.18 (1.7%) 119.08 (2.0%) -0.1% ( -3% -3%) 0.884 MedSpanNear 99.96 (1.1%) 99.88 (1.2%) -0.1% ( -2% -2%) 0.818 MedTerm 1211.34 (2.4%) 1210.46 (2.2%) -0.1% ( -4% -4%) 0.919 Respell 42.08 (1.9%) 42.06 (2.3%) -0.1% ( -4% -4%) 0.931 OrNotHighLow 608.56 (2.1%) 608.41 (2.4%) -0.0% ( -4% -4%) 0.971 HighSpanNear 38.01 (2.2%) 38.01 (2.9%) -0.0% ( -5% -5%) 0.994 LowSpanNear 94.41 (1.5%) 94.42 (2.1%) 0.0% ( -3% -3%) 0.975 OrHighLow 228.92 (2.4%) 228.98 (1.6%) 0.0% ( -3% -4%) 0.971 OrHighMed 76.23 (2.3%) 76.26 (2.2%) 0.0% ( -4% -4%) 0.951 HighTermTitleBDVSort 19.07 (2.6%) 19.08 (2.5%) 0.0% ( -4% -5%) 0.952 TermDTSort 312.90 (2.0%) 313.18 (2.5%) 0.1% ( -4% -4%) 0.901 PKLookup 153.21 (2.6%) 153.35 (2.5%) 0.1% ( -4% -5%) 0.910 OrHighNotMed 798.03 (2.0%) 798.83 (2.3%) 0.1% ( -4% -4%) 0.883 HighTermMonthSort 103.99 (9.9%) 104.10 (9.7%) 0.1% ( -17% - 21%) 0.971 Wildcard 107.61 (2.1%) 107.74 (2.4%) 0.1% ( -4% -4%) 0.859 Prefix3 82.74 (12.0%) 82.84 (12.1%) 0.1% ( -21% - 27%) 0.973 HighPhrase 67.96 (2.0%) 68.07 (2.0%) 0.2% ( -3% -4%) 0.792 HighTerm 1058.76 (1.8%) 1060.59 (2.7%) 0.2% ( -4% -4%) 0.812 OrHighNotHigh 528.01 (1.8%) 529.17 (2.5%) 0.2% ( -4% -4%) 0.751 Fuzzy1 42.70 (3.0%) 42.80 (3.3%) 0.2% ( -5% -6%) 0.814 OrNotHighMed 613.17 (2.6%) 614.97 (2.6%) 0.3% ( -4% -5%) 0.722 MedSloppyPhrase 15.29 (1.8%) 15.34 (2.2%) 0.3% ( -3% -4%) 0.601 OrHighNotLow 590.46 (2.5%) 592.57 (2.9%) 0.4% ( -4% -5%) 0.677 AndHighLow 518.23 (2.5%) 520.65 (2.9%) 0.5% ( -4% -6%) 0.585 LowTerm 1137.40 (2.9%) 1143.47 (2.8%) 0.5% ( -5% -6%) 0.556 HighSloppyPhrase 10.76 (3.2%) 10.82 (3.6%) 0.6% ( -6% -7%) 0.602 LowSloppyPhrase 152.21 (2.1%) 153.24 (2.4%) 0.7% ( -3% -5%) 0.350 AndHighMed 170.44 (2.5%) 171.76 (3.6%) 0.8% ( -5% -7%) 0.426 AndHighHigh 64.45 (3.2%) 65.07 (4.4%) 1.0% ( -6% -8%) 0.424 {code} And size of taxonomy index does not ch
[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438317#comment-17438317 ] Haoyu Zhai commented on LUCENE-10122: - The luceneutil benchmark shows a mostly neutral result {code:java} TaskQPS base StdDevQPS cand StdDev Pct diff p-value Fuzzy2 58.39 (5.6%) 57.70 (6.1%) -1.2% ( -12% - 11%) 0.518 BrowseDateTaxoFacets2.40 (6.6%)2.38 (5.8%) -0.7% ( -12% - 12%) 0.709 BrowseDayOfYearTaxoFacets2.40 (6.5%)2.38 (5.8%) -0.7% ( -12% - 12%) 0.721 BrowseMonthTaxoFacets2.49 (6.8%)2.47 (6.1%) -0.7% ( -12% - 13%) 0.738 BrowseMonthSSDVFacets 16.44 (36.1%) 16.38 (35.1%) -0.4% ( -52% - 110%) 0.974 LowIntervalsOrdered 30.70 (2.8%) 30.61 (3.0%) -0.3% ( -5% -5%) 0.763 LowPhrase 516.96 (1.7%) 515.67 (1.6%) -0.3% ( -3% -3%) 0.626 OrNotHighHigh 580.07 (2.1%) 578.61 (2.8%) -0.3% ( -5% -4%) 0.747 BrowseDayOfYearSSDVFacets 15.22 (24.2%) 15.19 (24.2%) -0.2% ( -39% - 63%) 0.976 HighTermDayOfYearSort 766.98 (1.7%) 765.20 (1.7%) -0.2% ( -3% -3%) 0.665 HighIntervalsOrdered2.46 (2.0%)2.45 (2.3%) -0.2% ( -4% -4%) 0.795 MedIntervalsOrdered 27.55 (2.8%) 27.51 (2.8%) -0.1% ( -5% -5%) 0.878 IntNRQ 28.96 (0.3%) 28.92 (0.6%) -0.1% ( 0% -0%) 0.358 OrHighHigh 36.05 (2.2%) 36.02 (1.7%) -0.1% ( -3% -3%) 0.870 MedPhrase 119.18 (1.7%) 119.08 (2.0%) -0.1% ( -3% -3%) 0.884 MedSpanNear 99.96 (1.1%) 99.88 (1.2%) -0.1% ( -2% -2%) 0.818 MedTerm 1211.34 (2.4%) 1210.46 (2.2%) -0.1% ( -4% -4%) 0.919 Respell 42.08 (1.9%) 42.06 (2.3%) -0.1% ( -4% -4%) 0.931 OrNotHighLow 608.56 (2.1%) 608.41 (2.4%) -0.0% ( -4% -4%) 0.971 HighSpanNear 38.01 (2.2%) 38.01 (2.9%) -0.0% ( -5% -5%) 0.994 LowSpanNear 94.41 (1.5%) 94.42 (2.1%) 0.0% ( -3% -3%) 0.975 OrHighLow 228.92 (2.4%) 228.98 (1.6%) 0.0% ( -3% -4%) 0.971 OrHighMed 76.23 (2.3%) 76.26 (2.2%) 0.0% ( -4% -4%) 0.951 HighTermTitleBDVSort 19.07 (2.6%) 19.08 (2.5%) 0.0% ( -4% -5%) 0.952 TermDTSort 312.90 (2.0%) 313.18 (2.5%) 0.1% ( -4% -4%) 0.901 PKLookup 153.21 (2.6%) 153.35 (2.5%) 0.1% ( -4% -5%) 0.910 OrHighNotMed 798.03 (2.0%) 798.83 (2.3%) 0.1% ( -4% -4%) 0.883 HighTermMonthSort 103.99 (9.9%) 104.10 (9.7%) 0.1% ( -17% - 21%) 0.971 Wildcard 107.61 (2.1%) 107.74 (2.4%) 0.1% ( -4% -4%) 0.859 Prefix3 82.74 (12.0%) 82.84 (12.1%) 0.1% ( -21% - 27%) 0.973 HighPhrase 67.96 (2.0%) 68.07 (2.0%) 0.2% ( -3% -4%) 0.792 HighTerm 1058.76 (1.8%) 1060.59 (2.7%) 0.2% ( -4% -4%) 0.812 OrHighNotHigh 528.01 (1.8%) 529.17 (2.5%) 0.2% ( -4% -4%) 0.751 Fuzzy1 42.70 (3.0%) 42.80 (3.3%) 0.2% ( -5% -6%) 0.814 OrNotHighMed 613.17 (2.6%) 614.97 (2.6%) 0.3% ( -4% -5%) 0.722 MedSloppyPhrase 15.29 (1.8%) 15.34 (2.2%) 0.3% ( -3% -4%) 0.601 OrHighNotLow 590.46 (2.5%) 592.57 (2.9%) 0.4% ( -4% -5%) 0.677 AndHighLow 518.23 (2.5%) 520.65 (2.9%) 0.5% ( -4% -6%) 0.585 LowTerm 1137.40 (2.9%) 1143.47 (2.8%) 0.5% ( -5% -6%) 0.556 HighSloppyPhrase 10.76 (3.2%) 10.82 (3.6%) 0.6% ( -6% -7%) 0.602 LowSloppyPhrase 152.21 (2.1%) 153.24 (2.4%) 0.7% ( -3% -5%) 0.350 AndHighMed 170.44 (2.5%) 171.76 (3.6%) 0.8% ( -5% -7%) 0.426 AndHighHigh 64.45 (3.2%) 65.07 (4.4%) 1.0% ( -6% -8%) 0.424 {code} And size of taxonomy index does not change. I've also ran the internal benchmark we use
[jira] [Commented] (LUCENE-9839) TestIndexFileDeleter.testExcInDecRef test failure
[ https://issues.apache.org/jira/browse/LUCENE-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428549#comment-17428549 ] Haoyu Zhai commented on LUCENE-9839: This same error appeared in my PR's auto check as well: [https://github.com/apache/lucene/runs/3873024350?check_suite_focus=true] And went away after retry... > TestIndexFileDeleter.testExcInDecRef test failure > - > > Key: LUCENE-9839 > URL: https://issues.apache.org/jira/browse/LUCENE-9839 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > > It isn't reproducible for me (at least trying again a single time). I'm > guessing a concurrency issue? > {noformat} > > Task :lucene:core:test > org.apache.lucene.index.TestIndexFileDeleter > testExcInDecRef FAILED > org.apache.lucene.store.AlreadyClosedException: ReaderPool is already > closed > at > __randomizedtesting.SeedInfo.seed([9142DCE874F11926:78DFABDA0238FEDB]:0) > at org.apache.lucene.index.ReaderPool.get(ReaderPool.java:400) > at > org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3760) > at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:590) > at > org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:474) > at > org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:406) > at > org.apache.lucene.index.TestIndexFileDeleter.testExcInDecRef(TestIndexFileDeleter.java:484) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:564) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) > at > org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44) > at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) > at > org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) > at > org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) > at > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898) > at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) > at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:51) > at > org.apache.lucen
[jira] [Resolved] (LUCENE-10103) QueryCache not estimating query size properly
[ https://issues.apache.org/jira/browse/LUCENE-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyu Zhai resolved LUCENE-10103. - Resolution: Fixed > QueryCache not estimating query size properly > - > > Key: LUCENE-10103 > URL: https://issues.apache.org/jira/browse/LUCENE-10103 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Haoyu Zhai >Priority: Minor > Attachments: query_cache_error_demo.patch > > Time Spent: 50m > Remaining Estimate: 0h > > QueryCache seems estimating the cached query size using a > [constant|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L302], > it will cause OOM error in some extreme cases where queries cached will use > far more memories than assumed. (The default QueryCache tries to use [only 5% > of > heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L89]) > One example of such memory-eating query is AutomatonQuery, it will each > carry a > [RunAutomaton|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java#L42] > , which consumes a good amount of memory in exchange for the speed. > On the other hand, we actually have a good implementation of {{Accountable}} > interface for AutomatonQuery (though it will become a bit more complicated > later since this query will eventually be rewritten to something else), so > maybe QueryCache could use those estimation directly (using an {{instanceof}} > check)? Or moreover we could make all {{Query}} implement {Accountable}}, and > maybe the default implementation could just be returning the current constant > we're using, and only override the method of the potential troublesome > queries? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10103) QueryCache not estimating query size properly
[ https://issues.apache.org/jira/browse/LUCENE-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428455#comment-17428455 ] Haoyu Zhai commented on LUCENE-10103: - Thank you [~mikemccand], I think we should backport since it's more like a bug-fix, and even if someone was impacted by this change and want to return to previous behavior they would only need to adjust the max size of QueryCache. > QueryCache not estimating query size properly > - > > Key: LUCENE-10103 > URL: https://issues.apache.org/jira/browse/LUCENE-10103 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Haoyu Zhai >Priority: Minor > Attachments: query_cache_error_demo.patch > > Time Spent: 40m > Remaining Estimate: 0h > > QueryCache seems estimating the cached query size using a > [constant|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L302], > it will cause OOM error in some extreme cases where queries cached will use > far more memories than assumed. (The default QueryCache tries to use [only 5% > of > heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L89]) > One example of such memory-eating query is AutomatonQuery, it will each > carry a > [RunAutomaton|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java#L42] > , which consumes a good amount of memory in exchange for the speed. > On the other hand, we actually have a good implementation of {{Accountable}} > interface for AutomatonQuery (though it will become a bit more complicated > later since this query will eventually be rewritten to something else), so > maybe QueryCache could use those estimation directly (using an {{instanceof}} > check)? Or moreover we could make all {{Query}} implement {Accountable}}, and > maybe the default implementation could just be returning the current constant > we're using, and only override the method of the potential troublesome > queries? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422937#comment-17422937 ] Haoyu Zhai commented on LUCENE-9983: [~mikemccand] yes we can close it. But it seems I can't close it, could you close it? thank you! > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 5h 40m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420016#comment-17420016 ] Haoyu Zhai commented on LUCENE-10122: - Oh my bad, I wanted to say NumericDocValues but typed BinaryDocValues in title, just changed. > Explore using NumericDocValue to store taxonomy parent array > > > Key: LUCENE-10122 > URL: https://issues.apache.org/jira/browse/LUCENE-10122 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Haoyu Zhai >Priority: Minor > > We currently use term position of a hardcoded term in a hardcoded field to > represent the parent ordinal of each taxonomy label. That is an old way and > perhaps could be dated back to the time where doc values didn't exist. > We probably would want to use NumericDocValues instead given we have spent > quite a lot of effort optimizing them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyu Zhai updated LUCENE-10122: Summary: Explore using NumericDocValue to store taxonomy parent array (was: Explore using BinaryDocValue to store taxonomy parent array) > Explore using NumericDocValue to store taxonomy parent array > > > Key: LUCENE-10122 > URL: https://issues.apache.org/jira/browse/LUCENE-10122 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Haoyu Zhai >Priority: Minor > > We currently use term position of a hardcoded term in a hardcoded field to > represent the parent ordinal of each taxonomy label. That is an old way and > perhaps could be dated back to the time where doc values didn't exist. > We probably would want to use NumericDocValues instead given we have spent > quite a lot of effort optimizing them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9969) DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机
[ https://issues.apache.org/jira/browse/LUCENE-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420012#comment-17420012 ] Haoyu Zhai commented on LUCENE-9969: +1, I created https://issues.apache.org/jira/browse/LUCENE-10122 > DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机 > > > Key: LUCENE-9969 > URL: https://issues.apache.org/jira/browse/LUCENE-9969 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 6.6.2 >Reporter: FengFeng Cheng >Priority: Trivial > Attachments: image-2021-05-24-13-43-43-289.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > 首先数据量很大,jvm内存为90G,但是TaxonomyIndexArrays几乎占走了一半 > !image-2021-05-24-13-43-43-289.png! > 请问对于TaxonomyReader是否有更好的使用方式或者其他的优化? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10122) Explore using BinaryDocValue to store taxonomy parent array
Haoyu Zhai created LUCENE-10122: --- Summary: Explore using BinaryDocValue to store taxonomy parent array Key: LUCENE-10122 URL: https://issues.apache.org/jira/browse/LUCENE-10122 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Affects Versions: main (9.0) Reporter: Haoyu Zhai We currently use term position of a hardcoded term in a hardcoded field to represent the parent ordinal of each taxonomy label. That is an old way and perhaps could be dated back to the time where doc values didn't exist. We probably would want to use NumericDocValues instead given we have spent quite a lot of effort optimizing them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10103) QueryCache not estimating query size properly
[ https://issues.apache.org/jira/browse/LUCENE-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416248#comment-17416248 ] Haoyu Zhai commented on LUCENE-10103: - I've attached a unit test showing the problem, in my laptop it will print out: 941634874 358: 187280 So the ramBytesUsage estimated by the AutomatonQuery is 941634874 bytes while the QueryCache "think" the cached query uses only 187280 bytes > QueryCache not estimating query size properly > - > > Key: LUCENE-10103 > URL: https://issues.apache.org/jira/browse/LUCENE-10103 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Haoyu Zhai >Priority: Minor > Attachments: query_cache_error_demo.patch > > > QueryCache seems estimating the cached query size using a > [constant|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L302], > it will cause OOM error in some extreme cases where queries cached will use > far more memories than assumed. (The default QueryCache tries to use [only 5% > of > heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L89]) > One example of such memory-eating query is AutomatonQuery, it will each > carry a > [RunAutomaton|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java#L42] > , which consumes a good amount of memory in exchange for the speed. > On the other hand, we actually have a good implementation of {{Accountable}} > interface for AutomatonQuery (though it will become a bit more complicated > later since this query will eventually be rewritten to something else), so > maybe QueryCache could use those estimation directly (using an {{instanceof}} > check)? Or moreover we could make all {{Query}} implement {Accountable}}, and > maybe the default implementation could just be returning the current constant > we're using, and only override the method of the potential troublesome > queries? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10103) QueryCache not estimating query size properly
[ https://issues.apache.org/jira/browse/LUCENE-10103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyu Zhai updated LUCENE-10103: Attachment: query_cache_error_demo.patch > QueryCache not estimating query size properly > - > > Key: LUCENE-10103 > URL: https://issues.apache.org/jira/browse/LUCENE-10103 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Haoyu Zhai >Priority: Minor > Attachments: query_cache_error_demo.patch > > > QueryCache seems estimating the cached query size using a > [constant|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L302], > it will cause OOM error in some extreme cases where queries cached will use > far more memories than assumed. (The default QueryCache tries to use [only 5% > of > heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L89]) > One example of such memory-eating query is AutomatonQuery, it will each > carry a > [RunAutomaton|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java#L42] > , which consumes a good amount of memory in exchange for the speed. > On the other hand, we actually have a good implementation of {{Accountable}} > interface for AutomatonQuery (though it will become a bit more complicated > later since this query will eventually be rewritten to something else), so > maybe QueryCache could use those estimation directly (using an {{instanceof}} > check)? Or moreover we could make all {{Query}} implement {Accountable}}, and > maybe the default implementation could just be returning the current constant > we're using, and only override the method of the potential troublesome > queries? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10103) QueryCache not estimating query size properly
Haoyu Zhai created LUCENE-10103: --- Summary: QueryCache not estimating query size properly Key: LUCENE-10103 URL: https://issues.apache.org/jira/browse/LUCENE-10103 Project: Lucene - Core Issue Type: Improvement Components: core/search Reporter: Haoyu Zhai QueryCache seems estimating the cached query size using a [constant|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L302], it will cause OOM error in some extreme cases where queries cached will use far more memories than assumed. (The default QueryCache tries to use [only 5% of heap|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L89]) One example of such memory-eating query is AutomatonQuery, it will each carry a [RunAutomaton|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java#L42] , which consumes a good amount of memory in exchange for the speed. On the other hand, we actually have a good implementation of {{Accountable}} interface for AutomatonQuery (though it will become a bit more complicated later since this query will eventually be rewritten to something else), so maybe QueryCache could use those estimation directly (using an {{instanceof}} check)? Or moreover we could make all {{Query}} implement {Accountable}}, and maybe the default implementation could just be returning the current constant we're using, and only override the method of the potential troublesome queries? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9969) DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机
[ https://issues.apache.org/jira/browse/LUCENE-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414308#comment-17414308 ] Haoyu Zhai commented on LUCENE-9969: [~gsmiller] thanks for mentioning that, it's surprised that we're currently using term positions to store the parents ordinal. Do you know whether we have a specific reason for doing this? Or it's just because when the code was created we don't have NumericDocValues? I think we should create a separate issue switching to use NDV if there's no specific reason against that since it will should be faster and (probably) better compressed? > DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机 > > > Key: LUCENE-9969 > URL: https://issues.apache.org/jira/browse/LUCENE-9969 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 6.6.2 >Reporter: FengFeng Cheng >Priority: Trivial > Attachments: image-2021-05-24-13-43-43-289.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > 首先数据量很大,jvm内存为90G,但是TaxonomyIndexArrays几乎占走了一半 > !image-2021-05-24-13-43-43-289.png! > 请问对于TaxonomyReader是否有更好的使用方式或者其他的优化? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10010) Should we have a NFA Query?
[ https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412728#comment-17412728 ] Haoyu Zhai commented on LUCENE-10010: - And I just came across [this issue|https://github.com/mikemccand/luceneutil/issues/139] with the benchmark, essentially it is not counting the query construction time towards the qps numbers, and that's kind of unfair for the NFA vs DFA query comparison since DFA does the determinize work when query is constructed while NFA query does it when query is executing. > Should we have a NFA Query? > --- > > Key: LUCENE-10010 > URL: https://issues.apache.org/jira/browse/LUCENE-10010 > Project: Lucene - Core > Issue Type: New Feature > Components: core/search >Affects Versions: main (9.0) >Reporter: Haoyu Zhai >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > Today when a {{RegexpQuery}} is created, it will be translated to NFA, > determinized to DFA and eventually become an {{AutomatonQuery}}, which is > very fast. However, not every NFA could be determinized to DFA easily, the > example given in LUCENE-9981 showed how easy could a short regexp break the > determinize process. > Maybe, instead of marking those kind of queries as adversarial cases, we > could make a new kind of NFA query, which execute directly on NFA and thus no > need to worry about determinize process or determinized DFA size. It should > be slower, but also makes those adversarial cases doable. > [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a > simple but efficient way of searching over NFA, essentially it is a partial > determinize process that only determinize the necessary part of DFA. Maybe we > could give it a try? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10010) Should we have a NFA Query?
[ https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411388#comment-17411388 ] Haoyu Zhai commented on LUCENE-10010: - I've done a very basic benchmark using the current "Wildcard" task based on current PR revision. On wiki10k index, I see a ~6% qps improvement (295 vs 313). On wikiall index, I see a ~50% qps degradation (40 vs 20). Also on wikiall index, the JFR helps me identified that the {{getCharClass}} is the biggest hotspot (we have optimized that in DFA cases by using a 256 length array to map the char to char class, in NFA case I haven't added that optimization yet and we do a binary search each time a char coming), I'll try to optimize the current PR and see what number we can get at the end. And also I'll create a task using more complex regex, current Wildcard task is too simple. > Should we have a NFA Query? > --- > > Key: LUCENE-10010 > URL: https://issues.apache.org/jira/browse/LUCENE-10010 > Project: Lucene - Core > Issue Type: New Feature > Components: core/search >Affects Versions: main (9.0) >Reporter: Haoyu Zhai >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > Today when a {{RegexpQuery}} is created, it will be translated to NFA, > determinized to DFA and eventually become an {{AutomatonQuery}}, which is > very fast. However, not every NFA could be determinized to DFA easily, the > example given in LUCENE-9981 showed how easy could a short regexp break the > determinize process. > Maybe, instead of marking those kind of queries as adversarial cases, we > could make a new kind of NFA query, which execute directly on NFA and thus no > need to worry about determinize process or determinized DFA size. It should > be slower, but also makes those adversarial cases doable. > [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a > simple but efficient way of searching over NFA, essentially it is a partial > determinize process that only determinize the necessary part of DFA. Maybe we > could give it a try? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10010) Should we have a NFA Query?
[ https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387626#comment-17387626 ] Haoyu Zhai commented on LUCENE-10010: - Here's a WIP PR: https://github.com/apache/lucene/pull/225 > Should we have a NFA Query? > --- > > Key: LUCENE-10010 > URL: https://issues.apache.org/jira/browse/LUCENE-10010 > Project: Lucene - Core > Issue Type: New Feature > Components: core/search >Affects Versions: main (9.0) >Reporter: Haoyu Zhai >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Today when a {{RegexpQuery}} is created, it will be translated to NFA, > determinized to DFA and eventually become an {{AutomatonQuery}}, which is > very fast. However, not every NFA could be determinized to DFA easily, the > example given in LUCENE-9981 showed how easy could a short regexp break the > determinize process. > Maybe, instead of marking those kind of queries as adversarial cases, we > could make a new kind of NFA query, which execute directly on NFA and thus no > need to worry about determinize process or determinized DFA size. It should > be slower, but also makes those adversarial cases doable. > [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a > simple but efficient way of searching over NFA, essentially it is a partial > determinize process that only determinize the necessary part of DFA. Maybe we > could give it a try? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10010) Should we have a NFA Query?
[ https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381057#comment-17381057 ] Haoyu Zhai commented on LUCENE-10010: - ??With an NFA, we'd be forced to match every term in the term dictionary? Or what I missing something here?? As far as I understand (which might be wrong), the way we're currently use DFA to intersect with term dictionary is to provide a initial term (which might be null), then based on that term find the next acceptable term in lexicographic order. I think this can still be done using an NFA? What in my mind is like doing a partial determinize process by always taking the smallest unvisited transition until an accept state is reached. I think there're mainly two benefits we can get if we have this new kind of query: # A possibly better performance when query are not reusable: what we do today is we determinize upfront and use DFA at search time, so if the determinized query could be reused, then the determinize cost could be amortized to nearly zero. But if on the opposite, then we have to pay the whole determinization cost every time. While in the NFA query, ideally we don't need to determinize the whole NFA every time, so it could be faster than the DFA query. An extreme case is that for an empty index, DFA query still need to determinize while NFA query don't need it at all. # As [~mikemccand] mentioned above, we could be more resilient against ReDoS attack I can try to get some code work done and benchmark to see whether point 1 holds. > Should we have a NFA Query? > --- > > Key: LUCENE-10010 > URL: https://issues.apache.org/jira/browse/LUCENE-10010 > Project: Lucene - Core > Issue Type: New Feature > Components: core/search >Affects Versions: main (9.0) >Reporter: Haoyu Zhai >Priority: Major > > Today when a {{RegexpQuery}} is created, it will be translated to NFA, > determinized to DFA and eventually become an {{AutomatonQuery}}, which is > very fast. However, not every NFA could be determinized to DFA easily, the > example given in LUCENE-9981 showed how easy could a short regexp break the > determinize process. > Maybe, instead of marking those kind of queries as adversarial cases, we > could make a new kind of NFA query, which execute directly on NFA and thus no > need to worry about determinize process or determinized DFA size. It should > be slower, but also makes those adversarial cases doable. > [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a > simple but efficient way of searching over NFA, essentially it is a partial > determinize process that only determinize the necessary part of DFA. Maybe we > could give it a try? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10021) Upgrade HPPC to 0.9.0
Haoyu Zhai created LUCENE-10021: --- Summary: Upgrade HPPC to 0.9.0 Key: LUCENE-10021 URL: https://issues.apache.org/jira/browse/LUCENE-10021 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Haoyu Zhai HPPC 0.9.0 was out and we probably should upgrade. The {{...ScatterMap}} was deprecated in 0.9.0 and I think we're still using them in a few places so probably we should measure the performance impact if there is. (According to [release note|https://github.com/carrotsearch/hppc/releases] there shouldn't be any) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10010) Should we have a NFA Query?
Haoyu Zhai created LUCENE-10010: --- Summary: Should we have a NFA Query? Key: LUCENE-10010 URL: https://issues.apache.org/jira/browse/LUCENE-10010 Project: Lucene - Core Issue Type: New Feature Components: core/search Affects Versions: main (9.0) Reporter: Haoyu Zhai Today when a {{RegexpQuery}} is created, it will be translated to NFA, determinized to DFA and eventually become an {{AutomatonQuery}}, which is very fast. However, not every NFA could be determinized to DFA easily, the example given in LUCENE-9981 showed how easy could a short regexp break the determinize process. Maybe, instead of marking those kind of queries as adversarial cases, we could make a new kind of NFA query, which execute directly on NFA and thus no need to worry about determinize process or determinized DFA size. It should be slower, but also makes those adversarial cases doable. [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a simple but efficient way of searching over NFA, essentially it is a partial determinize process that only determinize the necessary part of DFA. Maybe we could give it a try? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363284#comment-17363284 ] Haoyu Zhai commented on LUCENE-9983: {quote}Could you maybe open PR to add that initial set of synthetic regexps into {{luceneutil}}? {quote} OK, opened one: https://github.com/mikemccand/luceneutil/pull/130 > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 5h 10m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360495#comment-17360495 ] Haoyu Zhai commented on LUCENE-9983: I constructed a file with 235k words that has some part of it randomly replaced by a regex (like "apple" to "a[pl]*e") Then warm up 10 rounds and run 20 rounds to measure the average time of constructing {{RegexpQuery}} for those words. Here's the results I got: || ||Baseline ||IntIntHashMap||IntIntWormMap||int[128] + IntIntHashMap|| |Time|23.55|23.61|23.78|23.69| So in normal case original code and {{IntIntHashMap}} only have a very similar performance, other choices all has some kind of performance loss seems. > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 4h 20m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358789#comment-17358789 ] Haoyu Zhai commented on LUCENE-9983: [~broustant] in the adversarial test case, I added 3 static counters for measuring the avg and max size we seen in the set, and result is we're seeing 1800+ states averagely and 24000 states at most. I record the set size each time we call {{size()}} (basically each iteration) to calculate the average so it might not be very accurate. [~mikemccand] ah thanks my bad, I didn't realize {{determinize}} is called at construction time. I'll benchmark that. > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 4h 20m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17357749#comment-17357749 ] Haoyu Zhai commented on LUCENE-9983: I just realized that we've already had several tasks that is comparing the performance of regexp queries, such as [here|https://github.com/mikemccand/luceneutil/blob/master/tasks/wikimedium.10M.nostopwords.tasks#L5238]. So I've done some benchmarking comparing the PR as well as another commit that is based on PR but with an additional 128 size int array trying to make access to count of first 128 states faster. The result showed that both candidates doesn't show much qps difference (within 1%) when comparing to baseline with "Wildcard" and "Prefix3" tasks. If the benchmark results are reliable (meaning I didn't mess up with configuration etc.) I think the new PR won't affect the normal case a lot, and additional optimization seems not having visible benefit. So I think it might be better to start with just using {{IntIntHashMap}} to make things simpler? I'll update the PR accordingly. > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 4h 20m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355852#comment-17355852 ] Haoyu Zhai commented on LUCENE-9983: +1 to have a set of regexps so that we can benchmark them, I'm also a little worried the PR might make the normal cases worse too. [~broustant] That is a good idea, I've tried to use a 128 size array as a map for first 128 states and it doesn't help the adversarial cases (I also pulled out some stats and found in adversarial cases states are actually much more than that number). But I think we might see some benefits from the normal cases once we have benchmark set up. > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354660#comment-17354660 ] Haoyu Zhai commented on LUCENE-9983: I've added a simple static counter just for the adversarial test, and here's the stats: * {{incr}} called: 106073079 * entry added to set: 100076079 * {{decr}} called: 106069079 * entry removed from set: 100072079 * {{computeHash}} called: 40057 * {{freeze}} called: 14056 So seems to me my guess above holds, we're doing way more put/remove entry operations than others > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354659#comment-17354659 ] Haoyu Zhai edited comment on LUCENE-9983 at 5/31/21, 9:07 PM: -- Thanks [~mikemccand] and [~dweiss]. I've opened a PR based on {{IntIntHashMap}}: [https://github.com/apache/lucene/pull/163] I've applied the test attached in LUCENE-9981 to verify this PR helps. Seems it successfully reduce the time it need before throwing the exception from 6 min to 16 sec on my local machine (they both stoped at the same point as well). I still kept the state array to be sorted when get it, so we'll be slower when actually getting array but way faster on putting/removing keys. I'm not quite sure why the speed up is this much, but my guess is we're doing way more operations and spending way more times on increasing/decreasing state count and putting/removing states from the set than introducing new states? was (Author: zhai7631): Thanks [~mikemccand] and [~dweiss]. I've opened a PR based on {{IntIntHashMap}}: [https://github.com/apache/lucene/pull/162] I've applied the test attached in LUCENE-9981 to verify this PR helps. Seems it successfully reduce the time it need before throwing the exception from 6 min to 16 sec on my local machine (they both stoped at the same point as well). I still kept the state array to be sorted when get it, so we'll be slower when actually getting array but way faster on putting/removing keys. I'm not quite sure why the speed up is this much, but my guess is we're doing way more operations and spending way more times on increasing/decreasing state count and putting/removing states from the set than introducing new states? > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354659#comment-17354659 ] Haoyu Zhai commented on LUCENE-9983: Thanks [~mikemccand] and [~dweiss]. I've opened a PR based on {{IntIntHashMap}}: [https://github.com/apache/lucene/pull/162] I've applied the test attached in LUCENE-9981 to verify this PR helps. Seems it successfully reduce the time it need before throwing the exception from 6 min to 16 sec on my local machine (they both stoped at the same point as well). I still kept the state array to be sorted when get it, so we'll be slower when actually getting array but way faster on putting/removing keys. I'm not quite sure why the speed up is this much, but my guess is we're doing way more operations and spending way more times on increasing/decreasing state count and putting/removing states from the set than introducing new states? > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354237#comment-17354237 ] Haoyu Zhai commented on LUCENE-9983: Oh I realized we're still gonna iterate on those frozen set [here|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java#L705] so maybe bitset is not a good choice? What about just iterate over the keys and create a {{FronzenIntSet}} based on that? Since we're anyway gonna copy those keys so it should only add a little more overhead comparing to the current implementation, while getting the benefit of using a light weight, sort free data structure? > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354225#comment-17354225 ] Haoyu Zhai commented on LUCENE-9983: Hi Mike, So if I understand correctly what we really need is a map that could maps key (which is state) to its count, and remove the state when count goes to 0 while iterating the intervals? And freeze seems to be necessary since we want to make a snapshot of the key set to use it as a hash key? I'm thinking about using an {{IntIntHashMap}} along with a {{FixedBitSet}}, so that we keep the count using the map and use the snapshot of the bitset as hash key. What do you think? > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9694) New tool for creating a deterministic index
[ https://issues.apache.org/jira/browse/LUCENE-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272291#comment-17272291 ] Haoyu Zhai commented on LUCENE-9694: Oh I didn't include that since I want to keep it as generic as possible. But I guess I could add an example {{DocumentSelector}} as suggested by Mike in PR. > New tool for creating a deterministic index > --- > > Key: LUCENE-9694 > URL: https://issues.apache.org/jira/browse/LUCENE-9694 > Project: Lucene - Core > Issue Type: New Feature > Components: general/tools >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > Lucene's index is segmented, and sometimes number of segments and documents > arrangement greatly impact performance. > Given a stable index sort, our team create a tool that records document > arrangement (called index map) of an index and rearrange another index > (consists of same documents) into the same structure (segment num, and > documents included in each segment). > This tool could be also used in lucene benchmarks for a faster deterministic > index construction (if I understand correctly lucene benchmark is using a > single thread manner to achieve this). > > We've already had some discussion in email > [https://markmail.org/message/lbtdntclpnocmfuf] > And I've implemented the first method, using {{IndexWriter.addIndexes}} and a > customized {{FilteredCodecReader}} to achieve the goal. The index > construction time is about 25min and time executing this tool is about 10min. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9694) New tool for creating a deterministic index
[ https://issues.apache.org/jira/browse/LUCENE-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271790#comment-17271790 ] Haoyu Zhai commented on LUCENE-9694: I've opened a PR for this: https://github.com/apache/lucene-solr/pull/2246 > New tool for creating a deterministic index > --- > > Key: LUCENE-9694 > URL: https://issues.apache.org/jira/browse/LUCENE-9694 > Project: Lucene - Core > Issue Type: New Feature > Components: general/tools >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Lucene's index is segmented, and sometimes number of segments and documents > arrangement greatly impact performance. > Given a stable index sort, our team create a tool that records document > arrangement (called index map) of an index and rearrange another index > (consists of same documents) into the same structure (segment num, and > documents included in each segment). > This tool could be also used in lucene benchmarks for a faster deterministic > index construction (if I understand correctly lucene benchmark is using a > single thread manner to achieve this). > > We've already had some discussion in email > [https://markmail.org/message/lbtdntclpnocmfuf] > And I've implemented the first method, using {{IndexWriter.addIndexes}} and a > customized {{FilteredCodecReader}} to achieve the goal. The index > construction time is about 25min and time executing this tool is about 10min. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9694) New tool for creating a deterministic index
Haoyu Zhai created LUCENE-9694: -- Summary: New tool for creating a deterministic index Key: LUCENE-9694 URL: https://issues.apache.org/jira/browse/LUCENE-9694 Project: Lucene - Core Issue Type: New Feature Components: general/tools Reporter: Haoyu Zhai Lucene's index is segmented, and sometimes number of segments and documents arrangement greatly impact performance. Given a stable index sort, our team create a tool that records document arrangement (called index map) of an index and rearrange another index (consists of same documents) into the same structure (segment num, and documents included in each segment). This tool could be also used in lucene benchmarks for a faster deterministic index construction (if I understand correctly lucene benchmark is using a single thread manner to achieve this). We've already had some discussion in email [https://markmail.org/message/lbtdntclpnocmfuf] And I've implemented the first method, using {{IndexWriter.addIndexes}} and a customized {{FilteredCodecReader}} to achieve the goal. The index construction time is about 25min and time executing this tool is about 10min. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test
[ https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyu Zhai updated LUCENE-9618: --- Description: I'm trying to play around with my own {{IntervalSource}} and found out that {{nextInterval}} method of IntervalIterator will be called sometimes even after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is calling an inner iterator's {{nextInterval}} regardless of what the result of {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s implementation do considered the case where {{nextInterval}} is called after {{nextDoc}} returns NO_MORE_DOCS. We should probably update the javadoc and test if the behavior is necessary. Or we should change the current implementation to avoid this behavior original email discussion thread: https://markmail.org/thread/aytal77bgzl2zafm was: I'm trying to play around with my own {{IntervalSource}} and found out that {{nextInterval}} method of IntervalIterator will be called sometimes even after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is calling an inner iterator's {{nextInterval}} regardless of what the result of {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s implementation do considered the case where {{nextInterval}} is called after {{nextDoc}} returns NO_MORE_DOCS. We should probably update the javadoc and test if the behavior is necessary. Or we should change the current implementation to avoid this behavior original email discussion thread: https://markmail.org/message/7itbwk6ts3bo3gdh > Improve IntervalIterator.nextInterval's behavior/documentation/test > --- > > Key: LUCENE-9618 > URL: https://issues.apache.org/jira/browse/LUCENE-9618 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/query >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I'm trying to play around with my own {{IntervalSource}} and found out that > {{nextInterval}} method of IntervalIterator will be called sometimes even > after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. > > After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is > calling an inner iterator's {{nextInterval}} regardless of what the result of > {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s > implementation do considered the case where {{nextInterval}} is called after > {{nextDoc}} returns NO_MORE_DOCS. > > We should probably update the javadoc and test if the behavior is necessary. > Or we should change the current implementation to avoid this behavior > original email discussion thread: > https://markmail.org/thread/aytal77bgzl2zafm -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test
[ https://issues.apache.org/jira/browse/LUCENE-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235652#comment-17235652 ] Haoyu Zhai commented on LUCENE-9618: I created a [PR|https://github.com/apache/lucene-solr/pull/2090] with a simple test case to demonstrate the issue mentioned. > Improve IntervalIterator.nextInterval's behavior/documentation/test > --- > > Key: LUCENE-9618 > URL: https://issues.apache.org/jira/browse/LUCENE-9618 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/query >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I'm trying to play around with my own {{IntervalSource}} and found out that > {{nextInterval}} method of IntervalIterator will be called sometimes even > after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. > > After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is > calling an inner iterator's {{nextInterval}} regardless of what the result of > {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s > implementation do considered the case where {{nextInterval}} is called after > {{nextDoc}} returns NO_MORE_DOCS. > > We should probably update the javadoc and test if the behavior is necessary. > Or we should change the current implementation to avoid this behavior > original email discussion thread: > https://markmail.org/message/7itbwk6ts3bo3gdh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9618) Improve IntervalIterator.nextInterval's behavior/documentation/test
Haoyu Zhai created LUCENE-9618: -- Summary: Improve IntervalIterator.nextInterval's behavior/documentation/test Key: LUCENE-9618 URL: https://issues.apache.org/jira/browse/LUCENE-9618 Project: Lucene - Core Issue Type: Improvement Components: modules/query Reporter: Haoyu Zhai I'm trying to play around with my own {{IntervalSource}} and found out that {{nextInterval}} method of IntervalIterator will be called sometimes even after {{nextDoc}}/ {{docID}}/ {{advance}} method returns NO_MORE_DOCS. After I dug a bit more I found that {{FilteringIntervalIterator.reset}} is calling an inner iterator's {{nextInterval}} regardless of what the result of {{nextDoc}}, and also most (if not all) existing {{IntervalIterator}}'s implementation do considered the case where {{nextInterval}} is called after {{nextDoc}} returns NO_MORE_DOCS. We should probably update the javadoc and test if the behavior is necessary. Or we should change the current implementation to avoid this behavior original email discussion thread: https://markmail.org/message/7itbwk6ts3bo3gdh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9560) Position aware TermQuery
[ https://issues.apache.org/jira/browse/LUCENE-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyu Zhai resolved LUCENE-9560. Resolution: Not A Problem > Position aware TermQuery > > > Key: LUCENE-9560 > URL: https://issues.apache.org/jira/browse/LUCENE-9560 > Project: Lucene - Core > Issue Type: New Feature > Components: core/search >Reporter: Haoyu Zhai >Priority: Major > > In our work, we index most of our fields into an "all" field (like > elasticsearch) to make our search faster. But at the same time we still want > to support some of the field specific search (like {{title}}), so currently > our solution is to double index them so that we could do both "all" search as > well as specific field search. > I want to propose a new term query that accept a range in a specific field to > search so that we could search on "all" field but act like a field specific > search. Then we need not to doubly index those field. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9560) Position aware TermQuery
[ https://issues.apache.org/jira/browse/LUCENE-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211544#comment-17211544 ] Haoyu Zhai commented on LUCENE-9560: Thanks guys. I tried and did find out that {{IntervalQuery}} provide the functionality I desire, I'll close this issue > Position aware TermQuery > > > Key: LUCENE-9560 > URL: https://issues.apache.org/jira/browse/LUCENE-9560 > Project: Lucene - Core > Issue Type: New Feature > Components: core/search >Reporter: Haoyu Zhai >Priority: Major > > In our work, we index most of our fields into an "all" field (like > elasticsearch) to make our search faster. But at the same time we still want > to support some of the field specific search (like {{title}}), so currently > our solution is to double index them so that we could do both "all" search as > well as specific field search. > I want to propose a new term query that accept a range in a specific field to > search so that we could search on "all" field but act like a field specific > search. Then we need not to doubly index those field. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9560) Position aware TermQuery
Haoyu Zhai created LUCENE-9560: -- Summary: Position aware TermQuery Key: LUCENE-9560 URL: https://issues.apache.org/jira/browse/LUCENE-9560 Project: Lucene - Core Issue Type: New Feature Components: core/search Reporter: Haoyu Zhai In our work, we index most of our fields into an "all" field (like elasticsearch) to make our search faster. But at the same time we still want to support some of the field specific search (like {{title}}), so currently our solution is to double index them so that we could do both "all" search as well as specific field search. I want to propose a new term query that accept a range in a specific field to search so that we could search on "all" field but act like a field specific search. Then we need not to doubly index those field. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7882) Maybe expression compiler should cache recently compiled expressions?
[ https://issues.apache.org/jira/browse/LUCENE-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197237#comment-17197237 ] Haoyu Zhai commented on LUCENE-7882: Hi Uwe Thank you for making this fantastic PR! This issue was a bug in our service that would incorrectly recompile many expressions and it is fixed, now we only see ~500 expression compiled per benchmark run. We've tested this PR by compiling with JDK11 and running with JDK15 (because of some reason it's not easy to compile our service with JDK15 directly). But because of the reason I mentioned above, it seems that we don't have enough expressions compilation now to observe the difference with or without the PR. > Maybe expression compiler should cache recently compiled expressions? > - > > Key: LUCENE-7882 > URL: https://issues.apache.org/jira/browse/LUCENE-7882 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/expressions >Reporter: Michael McCandless >Assignee: Uwe Schindler >Priority: Major > Attachments: demo.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > I've been running search performance tests using a simple expression > ({{_score + ln(1000+unit_sales)}}) for sorting and hit this odd bottleneck: > {noformat} > "pool-1-thread-30" #70 prio=5 os_prio=0 tid=0x7eea7000a000 nid=0x1ea8a > waiting for monitor entry [0x7eea867dd000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.lucene.expressions.js.JavascriptCompiler$CompiledExpression.evaluate(_score > + ln(1000+unit_sales)) > at > org.apache.lucene.expressions.ExpressionFunctionValues.doubleValue(ExpressionFunctionValues.java:49) > at > com.amazon.lucene.OrderedVELeafCollector.collectInternal(OrderedVELeafCollector.java:123) > at > com.amazon.lucene.OrderedVELeafCollector.collect(OrderedVELeafCollector.java:108) > at > org.apache.lucene.search.MultiCollectorManager$Collectors$LeafCollectors.collect(MultiCollectorManager.java:102) > at > org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:241) > at > org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:184) > at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:658) > at org.apache.lucene.search.IndexSearcher$5.call(IndexSearcher.java:600) > at org.apache.lucene.search.IndexSearcher$5.call(IndexSearcher.java:597) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > I couldn't see any {{synchronized}} in the sources here, so I'm not sure > which object monitor it's blocked on. > I was accidentally compiling a new expression for every query, and that > bottleneck would cause overall QPS to slow down drastically (~4X slower after > ~1 hour of redline tests), as if the JVM is getting slower and slower to > evaluate each expression the more expressions I had compiled. > I tested JDK 9-ea and it also kept slowing down over time as the performance > test ran. > Maybe we should put a small cache in front of the expressions compiler to > make it less trappy? Or maybe we can get to the root cause of why the JVM > slows down more and more, the more expressions you compile? > I won't have time to work on this in the near future so if anyone else feels > the itch, please scratch it! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9424) Have a warning comment for AttributeSource.captureState
[ https://issues.apache.org/jira/browse/LUCENE-9424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167454#comment-17167454 ] Haoyu Zhai commented on LUCENE-9424: [~mikemccand] Thank you! > Have a warning comment for AttributeSource.captureState > --- > > Key: LUCENE-9424 > URL: https://issues.apache.org/jira/browse/LUCENE-9424 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Reporter: Haoyu Zhai >Priority: Trivial > Fix For: master (9.0), 8.7 > > Attachments: LUCENE-9424.patch > > > {{AttributeSource.captureState}} is a powerful method that can be used to > store and (later on) restore the current state, but it comes with a cost of > copying all attributes in this source and sometimes can be a big cost if > called multiple times. > We could probably add a warning to indicate this cost, as this method is > encapsulated quite well and sometimes people who use it won't be aware of the > cost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9424) Have a warning comment for AttributeSource.captureState
[ https://issues.apache.org/jira/browse/LUCENE-9424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyu Zhai updated LUCENE-9424: --- Attachment: LUCENE-9424.patch > Have a warning comment for AttributeSource.captureState > --- > > Key: LUCENE-9424 > URL: https://issues.apache.org/jira/browse/LUCENE-9424 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Reporter: Haoyu Zhai >Priority: Trivial > Attachments: LUCENE-9424.patch > > > {{AttributeSource.captureState}} is a powerful method that can be used to > store and (later on) restore the current state, but it comes with a cost of > copying all attributes in this source and sometimes can be a big cost if > called multiple times. > We could probably add a warning to indicate this cost, as this method is > encapsulated quite well and sometimes people who use it won't be aware of the > cost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9424) Have a warning comment for AttributeSource.captureState
Haoyu Zhai created LUCENE-9424: -- Summary: Have a warning comment for AttributeSource.captureState Key: LUCENE-9424 URL: https://issues.apache.org/jira/browse/LUCENE-9424 Project: Lucene - Core Issue Type: Improvement Components: general/javadocs Reporter: Haoyu Zhai {{AttributeSource.captureState}} is a powerful method that can be used to store and (later on) restore the current state, but it comes with a cost of copying all attributes in this source and sometimes can be a big cost if called multiple times. We could probably add a warning to indicate this cost, as this method is encapsulated quite well and sometimes people who use it won't be aware of the cost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value
[ https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149623#comment-17149623 ] Haoyu Zhai commented on LUCENE-8574: [~mikemccand] Yes it fixes the test case. Before the fix the test will cause {{OutOfMemoryException}} and after the fix it finishes in a reasonable time (like < 1s when I run it solely) Thanks for reviewing that, I've addressed those comments! > ExpressionFunctionValues should cache per-hit value > --- > > Key: LUCENE-8574 > URL: https://issues.apache.org/jira/browse/LUCENE-8574 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 7.5, 8.0 >Reporter: Michael McCandless >Assignee: Robert Muir >Priority: Major > Attachments: LUCENE-8574.patch, unit_test.patch > > Time Spent: 2h > Remaining Estimate: 0h > > The original version of {{ExpressionFunctionValues}} had a simple per-hit > cache, so that nested expressions that reference the same common variable > would compute the value for that variable the first time it was referenced > and then use that cached value for all subsequent invocations, within one > hit. I think it was accidentally removed in LUCENE-7609? > This is quite important if you have non-trivial expressions that reference > the same variable multiple times. > E.g. if I have these expressions: > {noformat} > x = c + d > c = b + 2 > d = b * 2{noformat} > Then evaluating x should only cause b's value to be computed once (for a > given hit), but today it's computed twice. The problem is combinatoric if b > then references another variable multiple times, etc. > I think to fix this we just need to restore the per-hit cache? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value
[ https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145732#comment-17145732 ] Haoyu Zhai commented on LUCENE-8574: Made a PR([https://github.com/apache/lucene-solr/pull/1613]) about this issue Basically making a new DoubleValuesSource class which pass in `valueCache` to a custom getValues to enforce only one DoubleValue per name along the whole generation process. > ExpressionFunctionValues should cache per-hit value > --- > > Key: LUCENE-8574 > URL: https://issues.apache.org/jira/browse/LUCENE-8574 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 7.5, 8.0 >Reporter: Michael McCandless >Assignee: Robert Muir >Priority: Major > Attachments: LUCENE-8574.patch, unit_test.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The original version of {{ExpressionFunctionValues}} had a simple per-hit > cache, so that nested expressions that reference the same common variable > would compute the value for that variable the first time it was referenced > and then use that cached value for all subsequent invocations, within one > hit. I think it was accidentally removed in LUCENE-7609? > This is quite important if you have non-trivial expressions that reference > the same variable multiple times. > E.g. if I have these expressions: > {noformat} > x = c + d > c = b + 2 > d = b * 2{noformat} > Then evaluating x should only cause b's value to be computed once (for a > given hit), but today it's computed twice. The problem is combinatoric if b > then references another variable multiple times, etc. > I think to fix this we just need to restore the per-hit cache? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value
[ https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138710#comment-17138710 ] Haoyu Zhai commented on LUCENE-8574: Ah, yes will use boolean instead of NaN. I'm just verifying whether the patch works so quickly inserted few lines of code without much thinking. But how should we fix this issue correctly? Since the easy fix patch seems not solving the problem. > ExpressionFunctionValues should cache per-hit value > --- > > Key: LUCENE-8574 > URL: https://issues.apache.org/jira/browse/LUCENE-8574 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 7.5, 8.0 >Reporter: Michael McCandless >Assignee: Robert Muir >Priority: Major > Attachments: LUCENE-8574.patch, unit_test.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The original version of {{ExpressionFunctionValues}} had a simple per-hit > cache, so that nested expressions that reference the same common variable > would compute the value for that variable the first time it was referenced > and then use that cached value for all subsequent invocations, within one > hit. I think it was accidentally removed in LUCENE-7609? > This is quite important if you have non-trivial expressions that reference > the same variable multiple times. > E.g. if I have these expressions: > {noformat} > x = c + d > c = b + 2 > d = b * 2{noformat} > Then evaluating x should only cause b's value to be computed once (for a > given hit), but today it's computed twice. The problem is combinatoric if b > then references another variable multiple times, etc. > I think to fix this we just need to restore the per-hit cache? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value
[ https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137920#comment-17137920 ] Haoyu Zhai commented on LUCENE-8574: I've attached a unit test showing a case that current code could not handle. And it seems the patch attached to this issue could not handle it as well (since DoubleValues generated for the same LeafReaderContext is not the same, we still get tons of DoubleValues created). > ExpressionFunctionValues should cache per-hit value > --- > > Key: LUCENE-8574 > URL: https://issues.apache.org/jira/browse/LUCENE-8574 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 7.5, 8.0 >Reporter: Michael McCandless >Assignee: Robert Muir >Priority: Major > Attachments: LUCENE-8574.patch, unit_test.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The original version of {{ExpressionFunctionValues}} had a simple per-hit > cache, so that nested expressions that reference the same common variable > would compute the value for that variable the first time it was referenced > and then use that cached value for all subsequent invocations, within one > hit. I think it was accidentally removed in LUCENE-7609? > This is quite important if you have non-trivial expressions that reference > the same variable multiple times. > E.g. if I have these expressions: > {noformat} > x = c + d > c = b + 2 > d = b * 2{noformat} > Then evaluating x should only cause b's value to be computed once (for a > given hit), but today it's computed twice. The problem is combinatoric if b > then references another variable multiple times, etc. > I think to fix this we just need to restore the per-hit cache? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value
[ https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyu Zhai updated LUCENE-8574: --- Attachment: unit_test.patch > ExpressionFunctionValues should cache per-hit value > --- > > Key: LUCENE-8574 > URL: https://issues.apache.org/jira/browse/LUCENE-8574 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 7.5, 8.0 >Reporter: Michael McCandless >Assignee: Robert Muir >Priority: Major > Attachments: LUCENE-8574.patch, unit_test.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The original version of {{ExpressionFunctionValues}} had a simple per-hit > cache, so that nested expressions that reference the same common variable > would compute the value for that variable the first time it was referenced > and then use that cached value for all subsequent invocations, within one > hit. I think it was accidentally removed in LUCENE-7609? > This is quite important if you have non-trivial expressions that reference > the same variable multiple times. > E.g. if I have these expressions: > {noformat} > x = c + d > c = b + 2 > d = b * 2{noformat} > Then evaluating x should only cause b's value to be computed once (for a > given hit), but today it's computed twice. The problem is combinatoric if b > then references another variable multiple times, etc. > I think to fix this we just need to restore the per-hit cache? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value
[ https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136136#comment-17136136 ] Haoyu Zhai commented on LUCENE-8574: I've checked the current release and couldn't see this patch merged. And I think there's no other changes introducing similar functionality (not so sure). Probably we should merge this? > ExpressionFunctionValues should cache per-hit value > --- > > Key: LUCENE-8574 > URL: https://issues.apache.org/jira/browse/LUCENE-8574 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 7.5, 8.0 >Reporter: Michael McCandless >Assignee: Robert Muir >Priority: Major > Attachments: LUCENE-8574.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The original version of {{ExpressionFunctionValues}} had a simple per-hit > cache, so that nested expressions that reference the same common variable > would compute the value for that variable the first time it was referenced > and then use that cached value for all subsequent invocations, within one > hit. I think it was accidentally removed in LUCENE-7609? > This is quite important if you have non-trivial expressions that reference > the same variable multiple times. > E.g. if I have these expressions: > {noformat} > x = c + d > c = b + 2 > d = b * 2{noformat} > Then evaluating x should only cause b's value to be computed once (for a > given hit), but today it's computed twice. The problem is combinatoric if b > then references another variable multiple times, etc. > I think to fix this we just need to restore the per-hit cache? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8574) ExpressionFunctionValues should cache per-hit value
[ https://issues.apache.org/jira/browse/LUCENE-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134323#comment-17134323 ] Haoyu Zhai commented on LUCENE-8574: Sry, wrong commit message pointed to here, the correct issue should be LUCENE-9391. BTW, is this patch ever merged? > ExpressionFunctionValues should cache per-hit value > --- > > Key: LUCENE-8574 > URL: https://issues.apache.org/jira/browse/LUCENE-8574 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 7.5, 8.0 >Reporter: Michael McCandless >Assignee: Robert Muir >Priority: Major > Attachments: LUCENE-8574.patch > > Time Spent: 1h > Remaining Estimate: 0h > > The original version of {{ExpressionFunctionValues}} had a simple per-hit > cache, so that nested expressions that reference the same common variable > would compute the value for that variable the first time it was referenced > and then use that cached value for all subsequent invocations, within one > hit. I think it was accidentally removed in LUCENE-7609? > This is quite important if you have non-trivial expressions that reference > the same variable multiple times. > E.g. if I have these expressions: > {noformat} > x = c + d > c = b + 2 > d = b * 2{noformat} > Then evaluating x should only cause b's value to be computed once (for a > given hit), but today it's computed twice. The problem is combinatoric if b > then references another variable multiple times, etc. > I think to fix this we just need to restore the per-hit cache? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9391) Upgrade to HPPC 0.8.2
Haoyu Zhai created LUCENE-9391: -- Summary: Upgrade to HPPC 0.8.2 Key: LUCENE-9391 URL: https://issues.apache.org/jira/browse/LUCENE-9391 Project: Lucene - Core Issue Type: Improvement Reporter: Haoyu Zhai HPPC 0.8.2 is out and exposes an Accountable-like interface using to estimate the memory usage. [https://issues.carrot2.org/secure/ReleaseNote.jspa?projectId=10070&version=13522&styleName=Text] We should upgrade to that if any of components using hppc need to estimate memory better. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org