[jira] [Created] (LUCENE-10079) DocValues new iterator API is missing in migration guide

2021-08-31 Thread Bernhard Scholz (Jira)
Bernhard Scholz created LUCENE-10079:


 Summary: DocValues new iterator API is missing in migration guide
 Key: LUCENE-10079
 URL: https://issues.apache.org/jira/browse/LUCENE-10079
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 7.0
Reporter: Bernhard Scholz


LUCENE-7407 introduced a breaking change in the DocValues API. This change 
should be mentioned in the [Migration 
Guide|https://lucene.apache.org/core/7_7_3/MIGRATE.html], not only in the 
[Change 
Log|https://lucene.apache.org/core/7_0_0/changes/Changes.html#v7.0.0.api_changes].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #242: LUCENE-9620 Add Weight#count(LeafReaderContext)

2021-08-31 Thread GitBox


jpountz commented on a change in pull request #242:
URL: https://github.com/apache/lucene/pull/242#discussion_r699205142



##
File path: lucene/core/src/java/org/apache/lucene/search/FilterWeight.java
##
@@ -67,4 +67,9 @@ public Scorer scorer(LeafReaderContext context) throws 
IOException {
   public Matches matches(LeafReaderContext context, int doc) throws 
IOException {
 return in.matches(context, doc);
   }
+
+  @Override
+  public int count(LeafReaderContext context) throws IOException {
+return in.count(context);
+  }

Review comment:
   In general we only forward calls to the wrapped instance in `FilterXXX` 
classes when the method is abstract. This is why `FilterWeight` doesn't 
implement `bulkScorer` for instance.
   
   The reason why we do this is because it can be a bit trappy otherwise, e.g. 
someone extends FilterWeight and overrides the `scorer()` method but forgets to 
override `count`.
   
   We should move this to the concrete instances where it's correct to forward 
to the wrapped instance like the Weight of `ConstantScoreQuery`.

##
File path: lucene/core/src/java/org/apache/lucene/search/TermQuery.java
##
@@ -179,6 +179,22 @@ public Explanation explain(LeafReaderContext context, int 
doc) throws IOExceptio
   }
   return Explanation.noMatch("no matching term");
 }
+
+@Override
+public int count(LeafReaderContext context) throws IOException {
+  if (context.reader().hasDeletions() == false) {
+TermsEnum termsEnum = getTermsEnum(context);
+// termsEnum is not null if term state is available
+if (termsEnum != null) {
+  return termsEnum.docFreq();
+} else {
+  // no term state found so rely on the default reader.docFreq call
+  return context.reader().docFreq(term);

Review comment:
   You could return `0` directly here, if `getTermsEnum` returns `null` , 
it means that the term cannot be found in the dictionary, so the count for the 
query is 0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #242: LUCENE-9620 Add Weight#count(LeafReaderContext)

2021-08-31 Thread GitBox


jpountz commented on pull request #242:
URL: https://github.com/apache/lucene/pull/242#issuecomment-909123525


   > Does it make sense to have a count API return -1 as the result if the 
number of matches are greater than a threshold?
   > Also, in an unoptimized query with > TOTAL_HITS_THRESHOLD hits, we will 
count the results twice because we first count it in the count API (with the 
weight.count call) and then again with the leafCollector from the new 
totalHitCountCollector we create in the class?
   
   This is why I suggested always returning -1 if the count cannot be returned 
in constant-time. This way we ensure that we would never linearly scan all 
matches twice for the purpose of counting hits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10080) Use a bit set to count long-tail of singleton FacetLabels?

2021-08-31 Thread Michael McCandless (Jira)
Michael McCandless created LUCENE-10080:
---

 Summary: Use a bit set to count long-tail of singleton FacetLabels?
 Key: LUCENE-10080
 URL: https://issues.apache.org/jira/browse/LUCENE-10080
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless


I was talking about this with [~rcmuir ] about LUCENE-9969, and he had a neat 
idea for more efficient facet counting.

Today we accumulate counts directly in an HPPC native int/int map, or a 
non-sparse {{int[]}} (if enough hits match the query).

But it is likely that many of these facet counts are singletons (occur only 
once in each query). To be more space efficient, we could wrap a bit set around 
the map or {{int[]}}.  The first time we see an ordinal, we set its bit.  The 
second and subsequent times, we increment the count as we do today.

If we use a non-sparse bitset (e.g. {{FixedBitSet}}) that will add some 
non-sparse heap cost O(maxDoc) for each segment, but if there are enough 
ordinals to count, that can be a win over just the HPPC native int map for some 
cases?

Maybe this could be an intermediate implementation, since we already cover the 
"very low hit count" (use HPPC int/int map) and "very high hit count" (using 
{{int[]}}) today?

Also, this bit set would be able to quickly iterate over the sorted ordinals, 
which might be helpful if we move the three big {{int[]}} into numeric doc 
values?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9969) DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机

2021-08-31 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407363#comment-17407363
 ] 

Michael McCandless commented on LUCENE-9969:


Imagine we had a {{NUMERIC}} doc values field, holding the parent ordinal of 
each ordinal in the taxonomy index.  I think we can easily create that while 
indexing, since we already ensure a parent is assigned an ordinal before its 
children.

Then, at search time, instead of using the big non-sparse hard-allocated 
{{int[] parents}} array, we could pull a {{NumericDocValues}} iterator, sort 
the ordinals we had just counted (the bitset idea from LUCENE-10080 might help 
with that?), and make a single iteration through the DV iterator to find all 
parent ordinals, to then know how to collate the ordinals into each dimension?

Except for the added sort (N * log(N) worst case), performance should be good – 
doc values are already designed for this forward only iteration.  And then we 
wouldn't need {{int[] parents}} for "normal" non-hierarchical facet counting.  
For truly hierarchical facet counting I'm not sure what to do yet :)

> DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机
> 
>
> Key: LUCENE-9969
> URL: https://issues.apache.org/jira/browse/LUCENE-9969
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 6.6.2
>Reporter: FengFeng Cheng
>Priority: Trivial
> Attachments: image-2021-05-24-13-43-43-289.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> 首先数据量很大,jvm内存为90G,但是TaxonomyIndexArrays几乎占走了一半
> !image-2021-05-24-13-43-43-289.png!
> 请问对于TaxonomyReader是否有更好的使用方式或者其他的优化?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] janhoy commented on pull request #60: Remove Google Analytics from Lucene site

2021-08-31 Thread GitBox


janhoy commented on pull request #60:
URL: https://github.com/apache/lucene-site/pull/60#issuecomment-909272342


   @msokolov please review if you want Lucene to be GA free :-) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] uschindler commented on pull request #60: Remove Google Analytics from Lucene site

2021-08-31 Thread GitBox


uschindler commented on pull request #60:
URL: https://github.com/apache/lucene-site/pull/60#issuecomment-909291912


   (sorry for late reply)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] janhoy commented on pull request #60: Remove Google Analytics from Lucene site

2021-08-31 Thread GitBox


janhoy commented on pull request #60:
URL: https://github.com/apache/lucene-site/pull/60#issuecomment-909272342


   @msokolov please review if you want Lucene to be GA free :-) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

2021-08-31 Thread GitBox


jpountz commented on pull request #262:
URL: https://github.com/apache/lucene/pull/262#issuecomment-908331029


   @msokolov Let's merge this PR to stop test failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on pull request #242: LUCENE-9620 Add Weight#count(LeafReaderContext)

2021-08-31 Thread GitBox


gautamworah96 commented on pull request #242:
URL: https://github.com/apache/lucene/pull/242#issuecomment-908676085


   Hmmm. So this is indeed multi-threaded but I am still confused. Does it make 
sense to have a count API return `-1` as the result if the number of matches 
are greater than a threshold? Also, in an unoptimized query with > 
TOTAL_HITS_THRESHOLD hits, we will count the results twice because we first 
count it in the count API (with the `weight.count` call) and then again with 
the `leafCollector` from the new `totalHitCountCollector` we create in the 
class? 
   
   I am a bit new to this part of the code so it may be possible that I am 
misunderstanding things!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wuda0112 commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-08-31 Thread GitBox


wuda0112 commented on pull request #224:
URL: https://github.com/apache/lucene/pull/224#issuecomment-908356737


   @jpountz Thank you, you helped me a lot,  and thanks for your patience to 
review !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10079) DocValues new iterator API is missing in migration guide

2021-08-31 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407411#comment-17407411
 ] 

Michael McCandless commented on LUCENE-10079:
-

Hrmph, you are right!  We clearly should have written a {{MIGRATE}} entry for 
this big change – I can't believe we failed to.  Sorry :(  We won't be 
releasing another 7.x release, so I don't think we will fix this.  My [blog 
post about Lucene 
7.0|https://blog.mikemccandless.com/2017/03/apache-lucene-70-is-coming-soon.html]
 shared some details, or maybe look at how tests were updated in that change to 
see how to use the new iterator form?  It is similar to the postings API, so 
should hopefully be familiar.

> DocValues new iterator API is missing in migration guide
> 
>
> Key: LUCENE-10079
> URL: https://issues.apache.org/jira/browse/LUCENE-10079
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.0
>Reporter: Bernhard Scholz
>Priority: Minor
>
> LUCENE-7407 introduced a breaking change in the DocValues API. This change 
> should be mentioned in the [Migration 
> Guide|https://lucene.apache.org/core/7_7_3/MIGRATE.html], not only in the 
> [Change 
> Log|https://lucene.apache.org/core/7_0_0/changes/Changes.html#v7.0.0.api_changes].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #242: LUCENE-9620 Add Weight#count(LeafReaderContext)

2021-08-31 Thread GitBox


jpountz commented on a change in pull request #242:
URL: https://github.com/apache/lucene/pull/242#discussion_r699205142



##
File path: lucene/core/src/java/org/apache/lucene/search/FilterWeight.java
##
@@ -67,4 +67,9 @@ public Scorer scorer(LeafReaderContext context) throws 
IOException {
   public Matches matches(LeafReaderContext context, int doc) throws 
IOException {
 return in.matches(context, doc);
   }
+
+  @Override
+  public int count(LeafReaderContext context) throws IOException {
+return in.count(context);
+  }

Review comment:
   In general we only forward calls to the wrapped instance in `FilterXXX` 
classes when the method is abstract. This is why `FilterWeight` doesn't 
implement `bulkScorer` for instance.
   
   The reason why we do this is because it can be a bit trappy otherwise, e.g. 
someone extends FilterWeight and overrides the `scorer()` method but forgets to 
override `count`.
   
   We should move this to the concrete instances where it's correct to forward 
to the wrapped instance like the Weight of `ConstantScoreQuery`.

##
File path: lucene/core/src/java/org/apache/lucene/search/TermQuery.java
##
@@ -179,6 +179,22 @@ public Explanation explain(LeafReaderContext context, int 
doc) throws IOExceptio
   }
   return Explanation.noMatch("no matching term");
 }
+
+@Override
+public int count(LeafReaderContext context) throws IOException {
+  if (context.reader().hasDeletions() == false) {
+TermsEnum termsEnum = getTermsEnum(context);
+// termsEnum is not null if term state is available
+if (termsEnum != null) {
+  return termsEnum.docFreq();
+} else {
+  // no term state found so rely on the default reader.docFreq call
+  return context.reader().docFreq(term);

Review comment:
   You could return `0` directly here, if `getTermsEnum` returns `null` , 
it means that the term cannot be found in the dictionary, so the count for the 
query is 0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #242: LUCENE-9620 Add Weight#count(LeafReaderContext)

2021-08-31 Thread GitBox


jpountz commented on pull request #242:
URL: https://github.com/apache/lucene/pull/242#issuecomment-909123525


   > Does it make sense to have a count API return -1 as the result if the 
number of matches are greater than a threshold?
   > Also, in an unoptimized query with > TOTAL_HITS_THRESHOLD hits, we will 
count the results twice because we first count it in the count API (with the 
weight.count call) and then again with the leafCollector from the new 
totalHitCountCollector we create in the class?
   
   This is why I suggested always returning -1 if the count cannot be returned 
in constant-time. This way we ensure that we would never linearly scan all 
matches twice for the purpose of counting hits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-08-31 Thread GitBox


jpountz merged pull request #224:
URL: https://github.com/apache/lucene/pull/224


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] uschindler commented on pull request #60: Remove Google Analytics from Lucene site

2021-08-31 Thread GitBox


uschindler commented on pull request #60:
URL: https://github.com/apache/lucene-site/pull/60#issuecomment-909291912


   (sorry for late reply)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-08-31 Thread GitBox


jpountz commented on pull request #224:
URL: https://github.com/apache/lucene/pull/224#issuecomment-908326788


   > In the meantime that failing test should add an assume that the current 
codec is not SimpleText.
   
   Or let's just merge https://github.com/apache/lucene/pull/262? :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] janhoy merged pull request #61: Remove GA from Lucene site (prod)

2021-08-31 Thread GitBox


janhoy merged pull request #61:
URL: https://github.com/apache/lucene-site/pull/61


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] janhoy merged pull request #60: Remove Google Analytics from Lucene site

2021-08-31 Thread GitBox


janhoy merged pull request #60:
URL: https://github.com/apache/lucene-site/pull/60


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob closed pull request #937: SOLR-13209 fixed by adding a null check that throws a SolrException

2021-08-31 Thread GitBox


madrob closed pull request #937:
URL: https://github.com/apache/lucene-solr/pull/937


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] HoustonPutman merged pull request #2563: SOLR-15599: Upgrade AWS SDK from v1 to v2

2021-08-31 Thread GitBox


HoustonPutman merged pull request #2563:
URL: https://github.com/apache/lucene-solr/pull/2563


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov merged pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

2021-08-31 Thread GitBox


msokolov merged pull request #262:
URL: https://github.com/apache/lucene/pull/262


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

2021-08-31 Thread GitBox


msokolov commented on pull request #262:
URL: https://github.com/apache/lucene/pull/262#issuecomment-909460252


   Thanks for the reminder! I had lost track...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10063) SimpleTextKnnVectorsReader.search needs an implementation

2021-08-31 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407558#comment-17407558
 ] 

ASF subversion and git services commented on LUCENE-10063:
--

Commit 9c7f0d45eefacdd139f1619defed28707dae705b in lucene's branch 
refs/heads/main from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9c7f0d4 ]

LUCENE-10063: implement SimpleTextKnnvectorsReader.search



> SimpleTextKnnVectorsReader.search needs an implementation
> -
>
> Key: LUCENE-10063
> URL: https://issues.apache.org/jira/browse/LUCENE-10063
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Blocker
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> SimpleText doesn't implement vector search today by throwing an 
> UnsupportedOperationException. We worked around this by disabling SimpleText 
> on tests that use vectors until now, but this isn't a good solution: 
> SimpleText should implement APIs correctly and only be disabled on tests that 
> expect a binary format or that are too slow with SimpleText.
> Let's implement this method via linear scan for now?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8723) Bad interaction bewteen WordDelimiterGraphFilter, StopFilter and FlattenGraphFilter

2021-08-31 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407572#comment-17407572
 ] 

Michael Sokolov commented on LUCENE-8723:
-

I wonder if WDGF and SynonymGraphFilter can also be used together now? If we 
have managed to get all our filters able to consume graphs then we could 
actually remove the (currently deprecated) non-graph versions (SynonymFilter, 
WordDelimiterFilter)

> Bad interaction bewteen WordDelimiterGraphFilter, StopFilter and 
> FlattenGraphFilter
> ---
>
> Key: LUCENE-8723
> URL: https://issues.apache.org/jira/browse/LUCENE-8723
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.7.1, 8.0, 8.3
>Reporter: Nicolás Lichtmaier
>Priority: Major
> Fix For: main (9.0), 8.10
>
>
> I was debugging an issue (missing tokens after analysis) and when I enabled 
> Java assertions I uncovered a bug when using WordDelimiterGraphFilter + 
> StopFilter + FlattenGraphFilter.
> I could reproduce the issue in a small piece of code. This code gives an 
> assertion failure when assertions are enabled (-ea java option):
> {code:java}
>     Builder builder = CustomAnalyzer.builder();
>     builder.withTokenizer(StandardTokenizerFactory.class);
>     builder.addTokenFilter(WordDelimiterGraphFilterFactory.class, 
> "preserveOriginal", "1");
>     builder.addTokenFilter(StopFilterFactory.class);
>     builder.addTokenFilter(FlattenGraphFilterFactory.class);
>     Analyzer analyzer = builder.build();
>      
>     TokenStream ts = analyzer.tokenStream("*", new StringReader("x7in"));
>     ts.reset();
>     while(ts.incrementToken())
>         ;
> {code}
> This gives:
> {code}
> Exception in thread "main" java.lang.AssertionError: 2
>      at 
> org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:195)
>      at 
> org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:258)
>      at com.wolfram.textsearch.AnalyzerError.main(AnalyzerError.java:32)
> {code}
> Maybe removing stop words after WordDelimiterGraphFilter is wrong, I don't 
> know. However is the only way to process stop-words generated by that filter. 
> In any case, it should not eat tokens or produce assertions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] msokolov commented on pull request #60: Remove Google Analytics from Lucene site

2021-08-31 Thread GitBox


msokolov commented on pull request #60:
URL: https://github.com/apache/lucene-site/pull/60#issuecomment-909479543


   Thanks, Jan and Uwe!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-31 Thread GitBox


mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r699665391



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  @Override
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+checkOrdinalBounds(ordinals);
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+getPathFromCache(ordinals);

Review comment:
   And maybe if we wind up removing this cache entirely, we don't need to 
do this issue!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9460) getPath in DirectoryTaxonomyReader should throw an exception

2021-08-31 Thread Gautam Worah (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407660#comment-17407660
 ] 

Gautam Worah commented on LUCENE-9460:
--

We are addressing this in this LUCENE-9476 
[PR|https://github.com/apache/lucene/pull/179]

> getPath in DirectoryTaxonomyReader should throw an exception
> 
>
> Key: LUCENE-9460
> URL: https://issues.apache.org/jira/browse/LUCENE-9460
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.5.2
>Reporter: Gautam Worah
>Priority: Trivial
>
> This issue is a spillover from [LUCENE-9450 
> PR|https://github.com/apache/lucene-solr/pull/1733] and was suggested by 
> [~mikemccand]
> If the {{ordinal}} is out of bound it indicates that the user called their 
> main {{IndexReader}} and the {{TaxonomyReader}} in the wrong order. 
> In this case, we should throw an {{IllegalArgumentException}} to warn the 
> user instead of returning {{null}}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10077) Closing the DirTaxonomyReader while another thread access the cache can throw NPE

2021-08-31 Thread Marc D'Mello (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407667#comment-17407667
 ] 

Marc D'Mello commented on LUCENE-10077:
---

Hi, I would like to work on this issue.

> Closing the DirTaxonomyReader while another thread access the cache can throw 
> NPE
> -
>
> Key: LUCENE-10077
> URL: https://issues.apache.org/jira/browse/LUCENE-10077
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Gautam Worah
>Priority: Minor
>
> When we close a {{DirectoryTaxonomyReader}} in {{doClose}}, we set the 
> {{categoryCache}} to null. But if a thread is next after this {{doClose}} 
> call, it will still try to acquire a lock and {{synchronize}} on it. This 
> will result in an NPE.
> This works well today, because we operate on the assumption that the user 
> will always call {{doClose}} after all threads have completed. 
>  One suggestion by [~mikemccand] in this 
> [PR|https://github.com/apache/lucene/pull/179#discussion_r697880516] was to 
> make categoryCache final and throw an AlreadyClosedException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10081) KoreanTokenizer should check the max backtrace gap on whitespaces

2021-08-31 Thread Jim Ferenczi (Jira)
Jim Ferenczi created LUCENE-10081:
-

 Summary: KoreanTokenizer should check the max backtrace gap on 
whitespaces
 Key: LUCENE-10081
 URL: https://issues.apache.org/jira/browse/LUCENE-10081
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Jim Ferenczi


Today the KoreanTokenizer keeps track of the whitespaces that appear before a 
known term in order to apply a space penalty factor. These whitespaces are 
considered part of the next term so the backtrace gap limit is not applied. 
As a result, the position buffer can grow up to the maximum number of 
consecutive whitespaces in the input. This is problematic since the buffer is 
reused on reset() so we should ensure that the max backtrace gap limit is 
applied on consecutive whitespaces consistently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10063) SimpleTextKnnVectorsReader.search needs an implementation

2021-08-31 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407693#comment-17407693
 ] 

Julie Tibshirani commented on LUCENE-10063:
---

I noticed a few test failures pop up. Here are reproduction lines:

{code}
./gradlew test --tests 
TestSimpleTextKnnVectorsFormat.testRandomWithUpdatesAndGraph 
-Dtests.seed=8FEDEC85BA7F05D7
{code}

{code}
./gradlew test --tests TestKnnVectorQuery -Dtests.codec=SimpleText
{code}

I also realized we maybe forgot to account for {{Bits acceptDocs}} in the 
search method?

> SimpleTextKnnVectorsReader.search needs an implementation
> -
>
> Key: LUCENE-10063
> URL: https://issues.apache.org/jira/browse/LUCENE-10063
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Blocker
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> SimpleText doesn't implement vector search today by throwing an 
> UnsupportedOperationException. We worked around this by disabling SimpleText 
> on tests that use vectors until now, but this isn't a good solution: 
> SimpleText should implement APIs correctly and only be disabled on tests that 
> expect a binary format or that are too slow with SimpleText.
> Let's implement this method via linear scan for now?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi opened a new pull request #272: LUCENE-10081: KoreanTokenizer should check the max backtrace gap on whitespaces

2021-08-31 Thread GitBox


jimczi opened a new pull request #272:
URL: https://github.com/apache/lucene/pull/272


   This change ensures that we don't skip consecutive whitespaces without 
checking the maximum backtrace gap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #267: LUCENE-10054 Handle hierarchy in graph construction and search

2021-08-31 Thread GitBox


jtibshirani commented on a change in pull request #267:
URL: https://github.com/apache/lucene/pull/267#discussion_r699760822



##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -154,7 +210,7 @@ public static NeighborQueue search(
 visited.set(friendOrd);
 
 float score = similarityFunction.compare(query, 
vectors.vectorValue(friendOrd));
-if (results.size() < numSeed || bound.check(score) == false) {
+if (results.size() < topK || bound.check(score) == false) {
   candidates.add(friendOrd, score);
   if (acceptOrds == null || acceptOrds.get(friendOrd)) {

Review comment:
   Do we also need to check if `level > 0` here? Maybe it's more solid to  
pass in `null` for `acceptOrds` when searching upper levels, so we don't need 
to remember to always check this.

##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -206,11 +275,65 @@ public void seek(int level, int targetNode) {
 upto = -1;
   }
 
+  /**
+   * Positions the graph on the given level. Must be used before iterating 
over nodes on this level
+   * with the method {@code nextNodeOnLevel()}.
+   *
+   * Package private access to use only for tests
+   */
+  void seekLevel(int level) {

Review comment:
   It feels a little confusing that we have both `seekLevel(level)`, which 
positions the level, and `seek(level, targetNode)` which doesn't. Maybe we can 
refine this API once as part of the on-disk PR, when we'll have a better idea 
of the final graph layout.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10063) SimpleTextKnnVectorsReader.search needs an implementation

2021-08-31 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407740#comment-17407740
 ] 

Michael Sokolov commented on LUCENE-10063:
--

Ooh, thank you for pointing that out. Sorry for the sloppy work here. I can 
only say I got distracted. Will follow up with a PR soon

> SimpleTextKnnVectorsReader.search needs an implementation
> -
>
> Key: LUCENE-10063
> URL: https://issues.apache.org/jira/browse/LUCENE-10063
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Blocker
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> SimpleText doesn't implement vector search today by throwing an 
> UnsupportedOperationException. We worked around this by disabling SimpleText 
> on tests that use vectors until now, but this isn't a good solution: 
> SimpleText should implement APIs correctly and only be disabled on tests that 
> expect a binary format or that are too slow with SimpleText.
> Let's implement this method via linear scan for now?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10068) Switch to a "double barrel" HPPC cache for the taxonomy LRU cache

2021-08-31 Thread Gautam Worah (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Worah updated LUCENE-10068:
--
Attachment: disable_taxo_category_cache_benchmark

> Switch to a "double barrel" HPPC cache for the taxonomy LRU cache
> -
>
> Key: LUCENE-10068
> URL: https://issues.apache.org/jira/browse/LUCENE-10068
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.8.1
>Reporter: Gautam Worah
>Priority: Minor
> Attachments: disable_taxo_category_cache_benchmark
>
>
> While working on an unrelated getBulkPath API 
> [PR|https://github.com/apache/lucene/pull/179], [~mikemccand] and I came 
> across a nice optimization that could be made to the taxonomy cache.
> The taxonomy cache today caches frequently used ordinals and their 
> corresponding FacetLabels. It uses the existing LRUHashMap (backed by a 
> LinkedList) class for its implementation.
> This implementation performs sub optimally when it has a large number of 
> threads accessing it, and consumes a large amount of RAM.
> [~mikemccand] suggested the idea of a two array backed HPPC int->FacetLabel 
> cache. The basic idea behind the cache being:
>  # We use two hashmaps primary and secondary.
>  # In case of a cache miss in the primary and a cache hit in the secondary, 
> we add the key to the primary map as well.
>  # In case of a cache miss in both the maps, we add it to the primary map.
>  # When we reach (make this check each time we insert?) a large number of 
> elements in say the primary cache, (say larger than the existing 
> {color:#871094}DEFAULT_CACHE_VALUE{color}=4000), we dump the secondary map 
> and copy all the values of the primary map into it.
> The idea was originally explained in 
> [this|https://github.com/apache/lucene/pull/179#discussion_r692907559] 
> comment.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10068) Switch to a "double barrel" HPPC cache for the taxonomy LRU cache

2021-08-31 Thread Gautam Worah (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407758#comment-17407758
 ] 

Gautam Worah edited comment on LUCENE-10068 at 9/1/21, 2:11 AM:


I was initially trying to benchmark the hit rate of the category cache but 
[~mikemccand]  suggested that I plainly disable it and see if I affects 
benchmarks. 
[Here|https://github.com/gautamworah96/lucene/tree/testtaxocachehitrate] is a 
branch that does that.

Full Results are attached to this JIRA in a file

TLDR, we don't see any regression. I guess we can either increase the size of 
the cache and try experimenting again? Maybe 10k? or else just remove it 
entirely (preferred).

Makes me wonder if the other taxonomy cache.. the ordinal cache is needed or no 
:/









was (Author: gworah):
I was initially trying to benchmark the hit rate of the category cache but 
[~mikemccand]  suggested that I plainly disable it and see if I affects 
benchmarks. 
[Here|https://github.com/gautamworah96/lucene/tree/testtaxocachehitrate] is a 
branch that does that.

Full Results are attached to this JIRA in a file

TLDR, we don't see any regression. I guess we can either increase the size of 
the cache and try experimenting again? Maybe 10k? or else just remove it 
entirely (preferred).

Makes me wonder if the ordinal cache is needed or no :/








> Switch to a "double barrel" HPPC cache for the taxonomy LRU cache
> -
>
> Key: LUCENE-10068
> URL: https://issues.apache.org/jira/browse/LUCENE-10068
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.8.1
>Reporter: Gautam Worah
>Priority: Minor
> Attachments: disable_taxo_category_cache_benchmark
>
>
> While working on an unrelated getBulkPath API 
> [PR|https://github.com/apache/lucene/pull/179], [~mikemccand] and I came 
> across a nice optimization that could be made to the taxonomy cache.
> The taxonomy cache today caches frequently used ordinals and their 
> corresponding FacetLabels. It uses the existing LRUHashMap (backed by a 
> LinkedList) class for its implementation.
> This implementation performs sub optimally when it has a large number of 
> threads accessing it, and consumes a large amount of RAM.
> [~mikemccand] suggested the idea of a two array backed HPPC int->FacetLabel 
> cache. The basic idea behind the cache being:
>  # We use two hashmaps primary and secondary.
>  # In case of a cache miss in the primary and a cache hit in the secondary, 
> we add the key to the primary map as well.
>  # In case of a cache miss in both the maps, we add it to the primary map.
>  # When we reach (make this check each time we insert?) a large number of 
> elements in say the primary cache, (say larger than the existing 
> {color:#871094}DEFAULT_CACHE_VALUE{color}=4000), we dump the secondary map 
> and copy all the values of the primary map into it.
> The idea was originally explained in 
> [this|https://github.com/apache/lucene/pull/179#discussion_r692907559] 
> comment.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10068) Switch to a "double barrel" HPPC cache for the taxonomy LRU cache

2021-08-31 Thread Gautam Worah (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407758#comment-17407758
 ] 

Gautam Worah commented on LUCENE-10068:
---

I was initially trying to benchmark the hit rate of the category cache but 
[~mikemccand]  suggested that I plainly disable it and see if I affects 
benchmarks. 
[Here|https://github.com/gautamworah96/lucene/tree/testtaxocachehitrate] is a 
branch that does that.

Full Results are attached to this JIRA in a file

TLDR, we don't see any regression. I guess we can either increase the size of 
the cache and try experimenting again? Maybe 10k? or else just remove it 
entirely (preferred).

Makes me wonder if the ordinal cache is needed or no :/








> Switch to a "double barrel" HPPC cache for the taxonomy LRU cache
> -
>
> Key: LUCENE-10068
> URL: https://issues.apache.org/jira/browse/LUCENE-10068
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.8.1
>Reporter: Gautam Worah
>Priority: Minor
> Attachments: disable_taxo_category_cache_benchmark
>
>
> While working on an unrelated getBulkPath API 
> [PR|https://github.com/apache/lucene/pull/179], [~mikemccand] and I came 
> across a nice optimization that could be made to the taxonomy cache.
> The taxonomy cache today caches frequently used ordinals and their 
> corresponding FacetLabels. It uses the existing LRUHashMap (backed by a 
> LinkedList) class for its implementation.
> This implementation performs sub optimally when it has a large number of 
> threads accessing it, and consumes a large amount of RAM.
> [~mikemccand] suggested the idea of a two array backed HPPC int->FacetLabel 
> cache. The basic idea behind the cache being:
>  # We use two hashmaps primary and secondary.
>  # In case of a cache miss in the primary and a cache hit in the secondary, 
> we add the key to the primary map as well.
>  # In case of a cache miss in both the maps, we add it to the primary map.
>  # When we reach (make this check each time we insert?) a large number of 
> elements in say the primary cache, (say larger than the existing 
> {color:#871094}DEFAULT_CACHE_VALUE{color}=4000), we dump the secondary map 
> and copy all the values of the primary map into it.
> The idea was originally explained in 
> [this|https://github.com/apache/lucene/pull/179#discussion_r692907559] 
> comment.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-31 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r699792786



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  @Override
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+checkOrdinalBounds(ordinals);
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+getPathFromCache(ordinals);

Review comment:
   Benchmarks show that the category cache is not very effective. Looks 
like we might indeed have to remove it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #273: LUCENE-10063: test fixes relating to SimpleTextKnnVectorsReader

2021-08-31 Thread GitBox


msokolov commented on pull request #273:
URL: https://github.com/apache/lucene/pull/273#issuecomment-909813740


   I ran all tests with `-Dtests.codec=SimpleText` and I ran 
`TestSimpleTextKnnVectorsFormat` 100 times


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn merged pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments

2021-08-31 Thread GitBox


zacharymorn merged pull request #128:
URL: https://github.com/apache/lucene/pull/128


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9662) CheckIndex should be concurrent

2021-08-31 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407763#comment-17407763
 ] 

ASF subversion and git services commented on LUCENE-9662:
-

Commit 424192e1704664dc0ebc55109feaad5990b945cb in lucene's branch 
refs/heads/main from zacharymorn
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=424192e ]

LUCENE-9662: CheckIndex should be concurrent  - parallelizing index check 
across segments (#128)



> CheckIndex should be concurrent
> ---
>
> Key: LUCENE-9662
> URL: https://issues.apache.org/jira/browse/LUCENE-9662
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 16h 40m
>  Remaining Estimate: 0h
>
> I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, 
> using a single core out of the 128 cores the box has.
> It seems like this is an embarrassingly parallel problem, if the index has 
> multiple segments, and would finish much more quickly on concurrent hardware 
> if we did "thread per segment".
> If wanted to get even further concurrency, each part of the Lucene index that 
> is checked is also independent, so it could be "thread per segment per part".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #267: LUCENE-10054 Handle hierarchy in graph construction and search

2021-08-31 Thread GitBox


msokolov commented on a change in pull request #267:
URL: https://github.com/apache/lucene/pull/267#discussion_r699806490



##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -107,32 +113,82 @@ public static NeighborQueue search(
   Random random)
   throws IOException {
 int size = graphValues.size();
+int boundedNumSeed = Math.min(numSeed, 2 * size);
+NeighborQueue results;
+
+if (graphValues.maxLevel() == 0) {
+  // search in SNW; generate a number of entry points randomly
+  final int[] eps = new int[boundedNumSeed];
+  for (int i = 0; i < boundedNumSeed; i++) {
+eps[i] = random.nextInt(size);
+  }
+  return searchLevel(query, topK, 0, eps, vectors, similarityFunction, 
graphValues, acceptOrds);
+} else {
+  // search in hierarchical SNW

Review comment:
   I notice you use `SNW` throughout, but elsewhere `HNSW` -- should we 
refer to `NSW` (navigable small-world) graphs?

##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -107,32 +113,82 @@ public static NeighborQueue search(
   Random random)
   throws IOException {
 int size = graphValues.size();
+int boundedNumSeed = Math.min(numSeed, 2 * size);
+NeighborQueue results;
+
+if (graphValues.maxLevel() == 0) {
+  // search in SNW; generate a number of entry points randomly
+  final int[] eps = new int[boundedNumSeed];
+  for (int i = 0; i < boundedNumSeed; i++) {
+eps[i] = random.nextInt(size);
+  }
+  return searchLevel(query, topK, 0, eps, vectors, similarityFunction, 
graphValues, acceptOrds);
+} else {
+  // search in hierarchical SNW
+  int[] eps = new int[] {graphValues.entryNode()};
+  for (int level = graphValues.maxLevel(); level >= 1; level--) {
+results =
+HnswGraph.searchLevel(
+query, 1, level, eps, vectors, similarityFunction, 
graphValues, acceptOrds);
+eps = new int[] {results.pop()};
+  }
+  results =
+  HnswGraph.searchLevel(
+  query, boundedNumSeed, 0, eps, vectors, similarityFunction, 
graphValues, acceptOrds);
+  while (results.size() > topK) {
+results.pop();
+  }
+  return results;
+}
+  }
 
+  /**
+   * Searches for the nearest neighbors of a query vector in a given level
+   *
+   * @param query search query vector
+   * @param topK the number of nearest to query results to return

Review comment:
   Currently topK is always ==eps.length; I wonder if we need a topK 
parameter to searchLevel?

##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -107,32 +113,82 @@ public static NeighborQueue search(
   Random random)
   throws IOException {
 int size = graphValues.size();
+int boundedNumSeed = Math.min(numSeed, 2 * size);
+NeighborQueue results;
+
+if (graphValues.maxLevel() == 0) {
+  // search in SNW; generate a number of entry points randomly
+  final int[] eps = new int[boundedNumSeed];
+  for (int i = 0; i < boundedNumSeed; i++) {
+eps[i] = random.nextInt(size);

Review comment:
   we don't want repeats here, I think? At least, we don't allow them in 
the current NSW impl.

##
File path: 
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java
##
@@ -146,20 +168,72 @@ void addGraphNode(int node, float[] value) throws 
IOException {
  * nearest neighbors that are closer to the new node than they are to the 
previously-selected
  * neighbors
  */
-addDiverseNeighbors(node, candidates);
+addDiverseNeighbors(0, node, candidates);
+  }
+
+  // build hierarchical navigable small world graph (multi-layered)
+  void buildHNSW(RandomAccessVectorValues vectors) throws IOException {
+long start = System.nanoTime(), t = start;
+// start at node 1! node 0 is added implicitly, in the constructor
+for (int node = 1; node < vectors.size(); node++) {
+  addGraphNodeHNSW(node, vectors.vectorValue(node));
+  if (node % 1 == 0) {

Review comment:
   can we refactor and share with the other place we do this?

##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -188,15 +244,28 @@ public int size() {
   }
 
   // TODO: optimize RAM usage so not to store references for all nodes for 
levels > 0
+  // TODO: add extra levels if level >= numLevels
   public void addNode(int level, int node) {
 if (level > 0) {
+  // if the new node introduces a new level, make this node the graph's 
new entry point
+  if (level > curMaxLevel) {
+curMaxLevel = level;
+entryNode = node;
+// add more levels if needed
+if (level >= graph.size()) {

Review comment:
   Wait - what does `graph.size()` mean here? Is it the number of nodes in 
level 0? Or the number of levels? Oh I r

[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments

2021-08-31 Thread GitBox


zacharymorn commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-909872219


   Hi @mikemccand, I've merged this PR and will wait for an update on the 
[nightly check index time 
page](https://home.apache.org/~mikemccand/lucenebench/checkIndexTime.html). 
Once the result there looks good, I believe we should backport this change to 
8x as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #240: LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)

2021-08-31 Thread GitBox


zacharymorn commented on a change in pull request #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r699834141



##
File path: lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java
##
@@ -407,97 +410,14 @@ public static TopFieldCollector create(Sort sort, int 
numHits, int totalHitsThre
* field is indexed both with doc values and points. In this case, there 
is an assumption that
* the same data is stored in these points and doc values.
* @return a {@link TopFieldCollector} instance which will sort the results 
by the sort criteria.
+   * @deprecated This method is being deprecated in favor of using the 
constructor of {@link
+   * TopFieldCollectorManager} due to its support for concurrency in 
IndexSearcher
*/
+  @Deprecated
   public static TopFieldCollector create(
   Sort sort, int numHits, FieldDoc after, int totalHitsThreshold) {
-if (totalHitsThreshold < 0) {
-  throw new IllegalArgumentException(
-  "totalHitsThreshold must be >= 0, got " + totalHitsThreshold);
-}
-
-return create(
-sort,
-numHits,
-after,
-HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits)),
-null /* bottomValueChecker */);
-  }
-
-  /**
-   * Same as above with additional parameters to allow passing in the 
threshold checker and the max
-   * score accumulator.
-   */
-  static TopFieldCollector create(

Review comment:
   Sorry just realized I missed a few comments earlier (they were collapsed 
on github UI). I think this one and the one from `TopScoreDocCollector.java` 
should be safe to remove directly even when backporting to 8x, as they are 
package private and hence ideally no users should be using them (and assuming 
we don't support hacky access to these methods such as reflection or byte code 
manipulation) ?

##
File path: 
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java
##
@@ -192,9 +191,13 @@ public void collect(int doc) throws IOException {
*
* NOTE: The instances returned by this method pre-allocate a full 
array of length
* numHits, and fill the array with sentinel objects.
+   *
+   * @deprecated This method is being deprecated in favor of using the 
constructor of {@link
+   * TopScoreDocCollectorManager} due to its support for concurrency in 
IndexSearcher
*/
+  @Deprecated

Review comment:
   Resolved from the discussion in 
https://github.com/apache/lucene/pull/240#discussion_r692101093. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #240: LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)

2021-08-31 Thread GitBox


zacharymorn commented on a change in pull request #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r699834253



##
File path: 
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java
##
@@ -209,61 +212,14 @@ public static TopScoreDocCollector create(int numHits, 
int totalHitsThreshold) {
*
* NOTE: The instances returned by this method pre-allocate a full 
array of length
* numHits, and fill the array with sentinel objects.
+   *
+   * @deprecated This method is being deprecated in favor of using the 
constructor of {@link
+   * TopScoreDocCollectorManager} due to its support for concurrency in 
IndexSearcher
*/
+  @Deprecated
   public static TopScoreDocCollector create(int numHits, ScoreDoc after, int 
totalHitsThreshold) {
-return create(
-numHits, after, 
HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits)), null);
-  }
-
-  static TopScoreDocCollector create(

Review comment:
   Please see rely in  
https://github.com/apache/lucene/pull/240#discussion_r699834141.

##
File path: 
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollector.java
##
@@ -209,61 +212,14 @@ public static TopScoreDocCollector create(int numHits, 
int totalHitsThreshold) {
*
* NOTE: The instances returned by this method pre-allocate a full 
array of length
* numHits, and fill the array with sentinel objects.
+   *
+   * @deprecated This method is being deprecated in favor of using the 
constructor of {@link
+   * TopScoreDocCollectorManager} due to its support for concurrency in 
IndexSearcher
*/
+  @Deprecated
   public static TopScoreDocCollector create(int numHits, ScoreDoc after, int 
totalHitsThreshold) {
-return create(
-numHits, after, 
HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits)), null);
-  }
-
-  static TopScoreDocCollector create(

Review comment:
   Please see reply in  
https://github.com/apache/lucene/pull/240#discussion_r699834141.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8723) Bad interaction bewteen WordDelimiterGraphFilter, StopFilter and FlattenGraphFilter

2021-08-31 Thread Geoffrey Lawson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407830#comment-17407830
 ] 

Geoffrey Lawson commented on LUCENE-8723:
-

I know SynonymGraphFilter still has issues consuming graphs so we can't 
deprecate the non-graph version yet. The change made for this issue was mostly 
for how `FlattenGraphFilter` operated on graphs after consuming them. It would 
be interesting if the graph reading component from `FlattenGraphFilter` could 
be extracted for other filters to use to read graphs. How to correctly handle 
holes across use cases may be complicated though.

> Bad interaction bewteen WordDelimiterGraphFilter, StopFilter and 
> FlattenGraphFilter
> ---
>
> Key: LUCENE-8723
> URL: https://issues.apache.org/jira/browse/LUCENE-8723
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.7.1, 8.0, 8.3
>Reporter: Nicolás Lichtmaier
>Priority: Major
> Fix For: main (9.0), 8.10
>
>
> I was debugging an issue (missing tokens after analysis) and when I enabled 
> Java assertions I uncovered a bug when using WordDelimiterGraphFilter + 
> StopFilter + FlattenGraphFilter.
> I could reproduce the issue in a small piece of code. This code gives an 
> assertion failure when assertions are enabled (-ea java option):
> {code:java}
>     Builder builder = CustomAnalyzer.builder();
>     builder.withTokenizer(StandardTokenizerFactory.class);
>     builder.addTokenFilter(WordDelimiterGraphFilterFactory.class, 
> "preserveOriginal", "1");
>     builder.addTokenFilter(StopFilterFactory.class);
>     builder.addTokenFilter(FlattenGraphFilterFactory.class);
>     Analyzer analyzer = builder.build();
>      
>     TokenStream ts = analyzer.tokenStream("*", new StringReader("x7in"));
>     ts.reset();
>     while(ts.incrementToken())
>         ;
> {code}
> This gives:
> {code}
> Exception in thread "main" java.lang.AssertionError: 2
>      at 
> org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:195)
>      at 
> org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:258)
>      at com.wolfram.textsearch.AnalyzerError.main(AnalyzerError.java:32)
> {code}
> Maybe removing stop words after WordDelimiterGraphFilter is wrong, I don't 
> know. However is the only way to process stop-words generated by that filter. 
> In any case, it should not eat tokens or produce assertions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8740) AssertionError FlattenGraphFilter

2021-08-31 Thread Geoffrey Lawson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407845#comment-17407845
 ] 

Geoffrey Lawson commented on LUCENE-8740:
-

Hello [~markus17] I'm confirming if the recent change to `FlattenGraphFilter` 
has resolved this issue. I can confirm that the attached test fails without the 
fix with an empty assertion error from 
`FlattenGraphFilter.releaseBufferedToken`. With the fix I do not get this 
error. The test still fails because the output of minHashFilter is a hash and 
the test is expecting text, but I believe the underlying issue is resolved.

> AssertionError FlattenGraphFilter
> -
>
> Key: LUCENE-8740
> URL: https://issues.apache.org/jira/browse/LUCENE-8740
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.5, 8.0
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 8.1, main (9.0)
>
> Attachments: LUCENE-8740.patch
>
>
> Our unit tests picked up an unusual AssertionError in FlattenGraphFilter 
> which manifests itself only in very specific circumstances involving 
> WordDelimiterGraph, StopFilter, FlattenGraphFilter and MinhashFilter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org