[GitHub] [lucene] wuda0112 commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-08-26 Thread GitBox


wuda0112 commented on pull request #224:
URL: https://github.com/apache/lucene/pull/224#issuecomment-906934607


   Now, there has no binary content gets written to the file !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #240: LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)

2021-08-26 Thread GitBox


zacharymorn commented on a change in pull request #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r697155622



##
File path: 
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java
##
@@ -180,6 +185,7 @@ protected int withTopDocs(IndexSearcher searcher, Query q, 
TopDocs hits) throws
 return res;
   }
 
+  @Deprecated

Review comment:
   I see where you are coming from. However, upon looking at this method 
more closely, I'm afraid this method is effectively not useful, since the 
result of using this collector was commented out :D :
   
   
https://github.com/apache/lucene/blob/3b3f9600c2ea6023f5400a364c0921ba29667584/lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java#L119-L121
   
   So I'm now leaning more towards just removing this method altogether if no 
users ever noticed / complaint about this.  What do you think @gsmiller 
@jpountz ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #240: LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)

2021-08-26 Thread GitBox


zacharymorn commented on a change in pull request #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r697150848



##
File path: 
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollectorManager.java
##
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Collection;
+
+/**
+ * Create a TopScoreDocCollectorManager which uses a shared hit counter to 
maintain number of hits
+ * and a shared {@link MaxScoreAccumulator} to propagate the minimum score 
across segments
+ *
+ * Note that a new collectorManager should be created for each search due 
to its internal states.
+ */
+public class TopScoreDocCollectorManager
+implements CollectorManager {
+  private final int numHits;
+  private final ScoreDoc after;
+  private final HitsThresholdChecker hitsThresholdChecker;
+  private final MaxScoreAccumulator minScoreAcc;
+
+  /**
+   * Creates a new {@link TopScoreDocCollectorManager} given the number of 
hits to collect and the
+   * number of hits to count accurately.
+   *
+   * NOTE: If the total hit count of the top docs is less than or 
exactly {@code
+   * totalHitsThreshold} then this value is accurate. On the other hand, if 
the {@link
+   * TopDocs#totalHits} value is greater than {@code totalHitsThreshold} then 
its value is a lower
+   * bound of the hit count. A value of {@link Integer#MAX_VALUE} will make 
the hit count accurate
+   * but will also likely make query processing slower.
+   *
+   * NOTE: The instances returned by this method pre-allocate a full 
array of length
+   * numHits, and fill the array with sentinel objects.
+   *
+   * @param numHits the number of results to collect.
+   * @param after the previous doc after which matching docs will be collected.
+   * @param totalHitsThreshold the number of docs to count accurately. If the 
query matches more
+   * than {@code totalHitsThreshold} hits then its hit count will be a 
lower bound. On the other
+   * hand if the query matches less than or exactly {@code 
totalHitsThreshold} hits then the hit
+   * count of the result will be accurate. {@link Integer#MAX_VALUE} may 
be used to make the hit
+   * count accurate, but this will also make query processing slower.
+   * @param supportsConcurrency to use thread-safe and slower internal states 
for count tracking.
+   */
+  public TopScoreDocCollectorManager(
+  int numHits, ScoreDoc after, int totalHitsThreshold, boolean 
supportsConcurrency) {

Review comment:
   I think for the purpose of ensuring correct / thread-safe usage of 
Lucene provided `TopXXXCollectorManager` classes & `IndexSearch#search` API, 
this would probably be the simplest and yet effective solution, without any 
special change needed from users. 
   
   The other scenarios you are considering, such as creating multiple 
collectors to run sequentially, will probably happen when users need to do 
something special about these classes and API. For that, users can still 
achieve it via various means, such as providing their custom implementations of 
`CollectorManager`, overriding the existing method from one of the 
`TopXXXCollectorManager` to remove the check, or simply just using 
`TopXXXCollector` instead of `CollectorManager` and also skipping the `reduce` 
step. So users should still have the freedom to customize these APIs to tailor 
their special needs?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments

2021-08-26 Thread GitBox


zacharymorn commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r697141222



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/store/MockDirectoryWrapper.java
##
@@ -895,7 +895,11 @@ public synchronized void close() throws IOException {
 System.out.println("\nNOTE: MockDirectoryWrapper: now run 
CheckIndex");
   }
 
-  TestUtil.checkIndex(this, getCrossCheckTermVectorsOnClose(), true, 
null);
+  // Methods in MockDirectoryWrapper hold locks on this, which will 
cause deadlock when
+  // TestUtil#checkIndex checks segment concurrently using another 
thread, but making
+  // call back to synchronized methods such as 
MockDirectoryWrapper#fileLength.
+  // Hence passing concurrent = false to this method to turn off 
concurrent checks.
+  TestUtil.checkIndex(this, getCrossCheckTermVectorsOnClose(), true, 
false, null);

Review comment:
   > Maybe open a follow-on issue to fix this sync situation so that we 
could, randomly, sometimes use concurrency in CheckIndex from tests? 
   
   Sounds good. I've created this issue 
https://issues.apache.org/jira/browse/LUCENE-10071 for following up on this.
   
   > Maybe we could start by making some of the TestUtil.checkIndex use 
concurrency, just not the one that MDW invokes?
   
   Yes this is already done in 
https://github.com/apache/lucene/pull/128/commits/138b72e9f2512df257c9acf01516bd071c9fb1d4,
 where `CheckIndex#checkIndex` got a new parameter `concurrent`, and most of 
the invocations except the one in MDW would pass in `true` to enable 
concurrency. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments

2021-08-26 Thread GitBox


zacharymorn commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r697140979



##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -605,209 +681,115 @@ public Status checkIndex(List onlySegments) 
throws IOException {
 result.newSegments.clear();
 result.maxSegmentName = -1;
 
-for (int i = 0; i < numSegments; i++) {
-  final SegmentCommitInfo info = sis.info(i);
-  long segmentName = Long.parseLong(info.info.name.substring(1), 
Character.MAX_RADIX);
-  if (segmentName > result.maxSegmentName) {
-result.maxSegmentName = segmentName;
-  }
-  if (onlySegments != null && !onlySegments.contains(info.info.name)) {
-continue;
-  }
-  Status.SegmentInfoStatus segInfoStat = new Status.SegmentInfoStatus();
-  result.segmentInfos.add(segInfoStat);
-  msg(
-  infoStream,
-  "  "
-  + (1 + i)
-  + " of "
-  + numSegments
-  + ": name="
-  + info.info.name
-  + " maxDoc="
-  + info.info.maxDoc());
-  segInfoStat.name = info.info.name;
-  segInfoStat.maxDoc = info.info.maxDoc();
-
-  final Version version = info.info.getVersion();
-  if (info.info.maxDoc() <= 0) {
-throw new RuntimeException("illegal number of documents: maxDoc=" + 
info.info.maxDoc());
-  }
-
-  int toLoseDocCount = info.info.maxDoc();
-
-  SegmentReader reader = null;
-
-  try {
-msg(infoStream, "version=" + (version == null ? "3.0" : version));
-msg(infoStream, "id=" + 
StringHelper.idToString(info.info.getId()));
-final Codec codec = info.info.getCodec();
-msg(infoStream, "codec=" + codec);
-segInfoStat.codec = codec;
-msg(infoStream, "compound=" + info.info.getUseCompoundFile());
-segInfoStat.compound = info.info.getUseCompoundFile();
-msg(infoStream, "numFiles=" + info.files().size());
-Sort indexSort = info.info.getIndexSort();
-if (indexSort != null) {
-  msg(infoStream, "sort=" + indexSort);
-}
-segInfoStat.numFiles = info.files().size();
-segInfoStat.sizeMB = info.sizeInBytes() / (1024. * 1024.);
-msg(infoStream, "size (MB)=" + nf.format(segInfoStat.sizeMB));
-Map diagnostics = info.info.getDiagnostics();
-segInfoStat.diagnostics = diagnostics;
-if (diagnostics.size() > 0) {
-  msg(infoStream, "diagnostics = " + diagnostics);
-}
-
-if (!info.hasDeletions()) {
-  msg(infoStream, "no deletions");
-  segInfoStat.hasDeletions = false;
-} else {
-  msg(infoStream, "has deletions [delGen=" + info.getDelGen() + 
"]");
-  segInfoStat.hasDeletions = true;
-  segInfoStat.deletionsGen = info.getDelGen();
+// checks segments sequentially
+if (executorService == null) {
+  for (int i = 0; i < numSegments; i++) {
+final SegmentCommitInfo info = sis.info(i);
+updateMaxSegmentName(result, info);
+if (onlySegments != null && !onlySegments.contains(info.info.name)) {
+  continue;
 }
 
-long startOpenReaderNS = System.nanoTime();
-if (infoStream != null) infoStream.print("test: open 
reader.");
-reader = new SegmentReader(info, sis.getIndexCreatedVersionMajor(), 
IOContext.DEFAULT);
 msg(
 infoStream,
-String.format(
-Locale.ROOT, "OK [took %.3f sec]", nsToSec(System.nanoTime() - 
startOpenReaderNS)));
+(1 + i)
++ " of "
++ numSegments
++ ": name="
++ info.info.name
++ " maxDoc="
++ info.info.maxDoc());
+Status.SegmentInfoStatus segmentInfoStatus = testSegment(sis, info, 
infoStream);
+
+processSegmentInfoStatusResult(result, info, segmentInfoStatus);
+  }
+} else {
+  ByteArrayOutputStream[] outputs = new ByteArrayOutputStream[numSegments];
+  @SuppressWarnings({"unchecked", "rawtypes"})
+  CompletableFuture[] futures = new 
CompletableFuture[numSegments];
+
+  // checks segments concurrently
+  List segmentCommitInfos = new ArrayList<>();
+  for (SegmentCommitInfo sci : sis) {
+segmentCommitInfos.add(sci);
+  }
 
-segInfoStat.openReaderPassed = true;
+  // sort segmentCommitInfos by segment size, as smaller segment tends to 
finish faster, and

Review comment:
   Sounds good!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



--

[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments

2021-08-26 Thread GitBox


zacharymorn commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r697140951



##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -3622,6 +3860,7 @@ public static void main(String[] args) throws 
IOException, InterruptedException
 boolean doSlowChecks = false;

Review comment:
   I've created a spin-off issue available here 
https://issues.apache.org/jira/browse/LUCENE-10074.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10074) Remove unneeded default value assignment

2021-08-26 Thread Zach Chen (Jira)
Zach Chen created LUCENE-10074:
--

 Summary: Remove unneeded default value assignment
 Key: LUCENE-10074
 URL: https://issues.apache.org/jira/browse/LUCENE-10074
 Project: Lucene - Core
  Issue Type: Task
Reporter: Zach Chen


This is a spin-off issue from discussion here 
[https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we 
would like to see if there's any automatic checking mechanism (ecj ?) that can 
be enabled to detect and warn about unneeded default value assignments in 
future changes, as well as in the existing code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10060) Ensure DrillSidewaysQuery instances don't get cached

2021-08-26 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405515#comment-17405515
 ] 

Greg Miller commented on LUCENE-10060:
--

[~jpountz] agreed that it would be nice to rethink DrillSideways. Thanks for 
providing more context on {{SegmentReader#core}} and caching. Very helpful!

> Ensure DrillSidewaysQuery instances don't get cached
> 
>
> Key: LUCENE-10060
> URL: https://issues.apache.org/jira/browse/LUCENE-10060
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: main (9.0), 8.10
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We need to make sure DSQ instances don't end up in the query cache. -It's 
> important that the {{DrillSidewaysScorer}} (bulk scorer implementation) 
> actually runs during query evaluation in order to populate the "sideways" 
> {{FacetsCollector}} instances with "near miss" docs. If it gets cached, this 
> won't happen.-
> There may also be an implication around {{acceptDocs}} getting honored as 
> well. [~zacharymorn] may be able to provide more details.
> UPDATE: The original issue I detailed above isn't actually an issue since 
> {{DrillDownQuery}} doesn't implement {{equals}}, so the cache always misses 
> and it always executes the {{BulkScorer}} ( {{DrillSidewaysScorer}} ). 
> Tricky! There is a separate issue found by Zach (as mentioned above) related 
> to "acceptDocs" though. See below conversation and link off to the separate 
> [PR 
> conversation|https://github.com/apache/lucene/pull/240#discussion_r692154001] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #240: LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)

2021-08-26 Thread GitBox


gsmiller commented on a change in pull request #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r697049547



##
File path: 
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollectorManager.java
##
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Collection;
+
+/**
+ * Create a TopScoreDocCollectorManager which uses a shared hit counter to 
maintain number of hits
+ * and a shared {@link MaxScoreAccumulator} to propagate the minimum score 
across segments
+ *
+ * Note that a new collectorManager should be created for each search due 
to its internal states.
+ */
+public class TopScoreDocCollectorManager
+implements CollectorManager {
+  private final int numHits;
+  private final ScoreDoc after;
+  private final HitsThresholdChecker hitsThresholdChecker;
+  private final MaxScoreAccumulator minScoreAcc;
+
+  /**
+   * Creates a new {@link TopScoreDocCollectorManager} given the number of 
hits to collect and the
+   * number of hits to count accurately.
+   *
+   * NOTE: If the total hit count of the top docs is less than or 
exactly {@code
+   * totalHitsThreshold} then this value is accurate. On the other hand, if 
the {@link
+   * TopDocs#totalHits} value is greater than {@code totalHitsThreshold} then 
its value is a lower
+   * bound of the hit count. A value of {@link Integer#MAX_VALUE} will make 
the hit count accurate
+   * but will also likely make query processing slower.
+   *
+   * NOTE: The instances returned by this method pre-allocate a full 
array of length
+   * numHits, and fill the array with sentinel objects.
+   *
+   * @param numHits the number of results to collect.
+   * @param after the previous doc after which matching docs will be collected.
+   * @param totalHitsThreshold the number of docs to count accurately. If the 
query matches more
+   * than {@code totalHitsThreshold} hits then its hit count will be a 
lower bound. On the other
+   * hand if the query matches less than or exactly {@code 
totalHitsThreshold} hits then the hit
+   * count of the result will be accurate. {@link Integer#MAX_VALUE} may 
be used to make the hit
+   * count accurate, but this will also make query processing slower.
+   * @param supportsConcurrency to use thread-safe and slower internal states 
for count tracking.
+   */
+  public TopScoreDocCollectorManager(
+  int numHits, ScoreDoc after, int totalHitsThreshold, boolean 
supportsConcurrency) {

Review comment:
   I need to think about it a little more, but I'm not sure putting the 
responsibility of guarding against multiple `Collector` creation in the 
`CollectorManager` feels quite right to me. Technically, there would be nothing 
wrong with creating multiple collectors to run sequentially if you wanted to 
(although I'm not sure why you would), so it feels a little over-restrictive 
maybe? But I really can't think of any reason why someone would do that, and 
it's simpler than exposing the information to `IndexSearcher` and less 
error-prone than relying on `IndexSearcher` to check/enforce it. So I think 
this is probably the best way to go, it just feels a little funny to me for 
some reason.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #240: LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)

2021-08-26 Thread GitBox


gsmiller commented on a change in pull request #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r697047554



##
File path: lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java
##
@@ -527,7 +503,10 @@ public TopDocs search(Query query, int n) throws 
IOException {
*
* @throws TooManyClauses If a query would exceed {@link 
IndexSearcher#getMaxClauseCount()}
* clauses.
+   * @deprecated This method is being deprecated in favor of {@link 
IndexSearcher#search(Query,

Review comment:
   +1 makes sense to keep it in 9.0. As for my comment about 
`@lucene.deprecated`, I was getting this mixed up with something else. 
Apologies for any confusion!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #240: LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)

2021-08-26 Thread GitBox


gsmiller commented on a change in pull request #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r697046771



##
File path: 
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java
##
@@ -180,6 +185,7 @@ protected int withTopDocs(IndexSearcher searcher, Query q, 
TopDocs hits) throws
 return res;
   }
 
+  @Deprecated

Review comment:
   I was more thinking along the lines of keeping the change you had but 
providing a protected method that would allow sub-classes to provide their own 
`CollectorManager` if they wanted to do so (which would give users a migration 
path if they were previously providing their own `Collector`). So in 9.0, I 
think you'd move completely to the `CollectorManager` approach like you had. 
What do you think of that approach?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sonatype-lift[bot] commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-26 Thread GitBox


sonatype-lift[bot] commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697042596



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  @Override
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+checkOrdinalBounds(ordinals);
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+getPathFromCache(ordinals);

Review comment:
   *THREAD_SAFETY_VIOLATION:*  Read/Write race. Non-private method 
`DirectoryTaxonomyReader.getBulkPath(...)` indirectly reads without 
synchronization from `this.categoryCache`. Potentially races with write in 
method `DirectoryTaxonomyReader.doClose()`.
Reporting because this access may occur on a background thread.
   (at-me [in a reply](https://help.sonatype.com/lift) with `help` or `ignore`)

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -318,23 +322,16 @@ public FacetLabel getPath(int ordinal) throws IOException 
{
 // doOpenIfChanged, we need to ensure that the ordinal is one that this DTR
 // instance recognizes. Therefore we do this check up front, before we hit
 // the cache.
-if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
-  return null;
-}
+checkOrdinalBounds(ordinal);
 
-// TODO: can we use an int-based hash impl, such as IntToObjectMap,
-// wrapped as LRU?
-Integer catIDInteger = Integer.valueOf(ordinal);
-synchronized (categoryCache) {
-  FacetLabel res = categoryCache.get(catIDInteger);
-  if (res != null) {
-return res;
-  }
+FacetLabel[] ordinalPath = getPathFromCache(ordinal);

Review comment:
   *THREAD_SAFETY_VIOLATION:*  Read/Write race. Non-private method 
`DirectoryTaxonomyReader.getPath(...)` indirectly reads without synchronization 
from `this.categoryCache`. Potentially races with write in method 
`DirectoryTaxonomyReader.doClose()`.
Reporting because this access may occur on a background thread.
   (at-me [in a reply](https://help.sonatype.com/lift) with `help` or `ignore`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: i

[jira] [Commented] (LUCENE-10059) Assertion error in JapaneseTokenizer backtrace

2021-08-26 Thread Anh Dung Bui (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405507#comment-17405507
 ] 

Anh Dung Bui commented on LUCENE-10059:
---

Since both of them are depending on analysis/common, I think we could put the 
shared code there, the unit test might be a bit difficult since it depends on 
the dictionary.

I found the original Jira of Nori tokenizer (LUCENE-8231), it seems it was 
created with some tweaks here and then from the Kuromoji and it was agreed that 
it's not needed initially to share code between the two.

I can put a PR to fix the issue in nori then another (Jira) to refactor, or I 
can also try to put both in one PR. But the Kuromoji tokenizer is quite 
complicated and there could be a lot of code to refactor out, I think the 
former is better. Does anyone have a preference on this?

> Assertion error in JapaneseTokenizer backtrace
> --
>
> Key: LUCENE-10059
> URL: https://issues.apache.org/jira/browse/LUCENE-10059
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.8
>Reporter: Anh Dung Bui
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> There is a rare case which causes an AssertionError in the backtrace step of 
> JapaneseTokenizer that we (Amazon Product Search) found in our tests.
> If there is a text span of length 1024 (determined by 
> [MAX_BACKTRACE_GAP|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L116])
>  where the regular backtrace is not called, a [forced 
> backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L781]
>  will be applied. If the partially best path at this point happens to end at 
> the last pos, and since there is always a [final 
> backtrace|https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseTokenizer.java#L1044]
>  applied at the end, the final backtrace will try to backtrace from and to 
> the same position, causing an AssertionError in RollingCharBuffer.get() when 
> it tries to generate an empty buffer.
> We are fixing it by returning prematurely in the backtrace() method when the 
> from and to pos are the same:
> {code:java}
> if (endPos == lastBackTracePos) {
>   return;
> }
> {code}
> The backtrace() method is essentially no-op when this condition happens, thus 
> when _-ea_ is not enabled, it can still output the correct tokens.
> We will open a PR for this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-26 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696973661



##
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
 taxoReader.close();
 dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+float PROBABILITY_OF_COMMIT = 0.5f;
+Directory src = newDirectory();
+DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+String randomArray[] = new String[random().nextInt(1000)];
+// adding a smaller bound on ints ensures that we will have some duplicate 
ordinals in random
+// test cases
+Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+int allOrdinals[] = new int[randomArray.length];
+
+for (int i = 0; i < randomArray.length; i++) {
+  allPaths[i] = new FacetLabel(randomArray[i]);
+  w.addCategory(allPaths[i]);
+  // add random commits to create multiple segments in the index
+  if (random().nextFloat() < PROBABILITY_OF_COMMIT) {
+w.commit();
+  }
+}
+w.commit();
+w.close();
+
+DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+for (int i = 0; i < allPaths.length; i++) {

Review comment:
   Sure. I modified the test so that it uses random multiple threads, each 
thread use random iterations, each iteration tests the path of random number of 
ordinals . 
   
   Thanks for suggesting this idea. 
   
   Especially helpful in cases like ours where the cache storage/retrieval 
operations are complex
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-26 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697014864



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+// TODO: can we use an int-based hash impl, such as IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  return categoryCache.get(ordinal);
+}
+  }
+
+  private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException 
{
+if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+  throw new IllegalArgumentException(
+  "ordinal "
+  + ordinal
+  + " is out of the range of the indexReader "
+  + indexReader.toString()
+  + ". The maximum possible ordinal number is "
+  + (indexReader.maxDoc() - 1));
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+for (int i = 0; i < ordinalsLength; i++) {
+  // check whether the ordinal is valid before accessing the cache
+  checkOrdinalBounds(ordinals[i]);
+  // check the cache before trying to find it in the index
+  FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+  if (ordinalPath != null) {
+bulkPath[i] = ordinalPath;
+  }
+}
+
+/* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+new InPlaceMergeSorter() {
+  @Override
+  protected void swap(int i, int j) {
+int x = ordinals[i];
+ordinals[i] = ordinals[j];
+ordinals[j] = x;
+
+x = originalPosition[i];
+originalPosition[i] = originalPosition[j];
+originalPosition[j] = x;
+  }
+
+  @Override
+  public int compare(int i, int j) {
+return Integer.compare(ordinals[i], ordinals[j]);
+  }
+}.sort(0, ordinalsLength);
+
+int readerIndex;
+int leafReaderMaxDoc = 0;
+int leafReaderDocBase = 0;
+LeafReader leafReader;
+LeafReaderContext leafReaderContext;
+BinaryDocValues values = null;
+
+for (int i = 0; i < ordinalsLength; i++) {
+  if (bulkPath[originalPosition[i]] == null) {
+/*
+If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find 
the next leaf that contains our ordinal.
+Remember: ordinals[i] operates in the global ordinal space and hence 
we add leafReaderDocBase to the leafReaderMaxDoc
+(which is the maxDoc of the specific leaf)
+ */
+if (values == null || ordinals[i] >= leafReaderDocBase + 
leafReaderMaxDoc) {

Review comment:
   So this was a performance bug and not a correctness bug. 
   
   **The performance bug was that for leaves 2 and all leaves after that, we 
were recalculating the leaf for each ordinal.** Instead what the correct code 
does now is that it checks if the current leaf is sufficient for calculating 
the values, if not, **then** it tries to find the next leaf.
   
   It is a bit hard to write a test that checks if we are entering that if 
condition or no. Do you have any ideas on testing these kinds of conditionals?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10060) Ensure DrillSidewaysQuery instances don't get cached

2021-08-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405483#comment-17405483
 ] 

ASF subversion and git services commented on LUCENE-10060:
--

Commit b45164059fa2c2964cf3d89f28f480582f05e62b in lucene-solr's branch 
refs/heads/branch_8x from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b451640 ]

Fix a DrillSideways unit test I broke when adding more tests in LUCENE-10060 
(#2562)



> Ensure DrillSidewaysQuery instances don't get cached
> 
>
> Key: LUCENE-10060
> URL: https://issues.apache.org/jira/browse/LUCENE-10060
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: main (9.0), 8.10
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We need to make sure DSQ instances don't end up in the query cache. -It's 
> important that the {{DrillSidewaysScorer}} (bulk scorer implementation) 
> actually runs during query evaluation in order to populate the "sideways" 
> {{FacetsCollector}} instances with "near miss" docs. If it gets cached, this 
> won't happen.-
> There may also be an implication around {{acceptDocs}} getting honored as 
> well. [~zacharymorn] may be able to provide more details.
> UPDATE: The original issue I detailed above isn't actually an issue since 
> {{DrillDownQuery}} doesn't implement {{equals}}, so the cache always misses 
> and it always executes the {{BulkScorer}} ( {{DrillSidewaysScorer}} ). 
> Tricky! There is a separate issue found by Zach (as mentioned above) related 
> to "acceptDocs" though. See below conversation and link off to the separate 
> [PR 
> conversation|https://github.com/apache/lucene/pull/240#discussion_r692154001] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gsmiller merged pull request #2562: Fix a DrillSideways unit test I broke when adding more tests in LUCENE-10060

2021-08-26 Thread GitBox


gsmiller merged pull request #2562:
URL: https://github.com/apache/lucene-solr/pull/2562


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #268: Fix a DrillSideways unit test I broke when adding more tests in LUCENE-10060

2021-08-26 Thread GitBox


gsmiller merged pull request #268:
URL: https://github.com/apache/lucene/pull/268


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10060) Ensure DrillSidewaysQuery instances don't get cached

2021-08-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405481#comment-17405481
 ] 

ASF subversion and git services commented on LUCENE-10060:
--

Commit 3b3f9600c2ea6023f5400a364c0921ba29667584 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3b3f960 ]

Fix a DrillSideways unit test I broke when adding more tests in LUCENE-10060 
(#268)



> Ensure DrillSidewaysQuery instances don't get cached
> 
>
> Key: LUCENE-10060
> URL: https://issues.apache.org/jira/browse/LUCENE-10060
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: main (9.0), 8.10
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We need to make sure DSQ instances don't end up in the query cache. -It's 
> important that the {{DrillSidewaysScorer}} (bulk scorer implementation) 
> actually runs during query evaluation in order to populate the "sideways" 
> {{FacetsCollector}} instances with "near miss" docs. If it gets cached, this 
> won't happen.-
> There may also be an implication around {{acceptDocs}} getting honored as 
> well. [~zacharymorn] may be able to provide more details.
> UPDATE: The original issue I detailed above isn't actually an issue since 
> {{DrillDownQuery}} doesn't implement {{equals}}, so the cache always misses 
> and it always executes the {{BulkScorer}} ( {{DrillSidewaysScorer}} ). 
> Tricky! There is a separate issue found by Zach (as mentioned above) related 
> to "acceptDocs" though. See below conversation and link off to the separate 
> [PR 
> conversation|https://github.com/apache/lucene/pull/240#discussion_r692154001] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] damianparus closed pull request #175: Skipping merger

2021-08-26 Thread GitBox


damianparus closed pull request #175:
URL: https://github.com/apache/lucene-solr/pull/175


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10067) investigate 6/23/2021 -> 6/24/2021 drop in facets perf

2021-08-26 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405480#comment-17405480
 ] 

Robert Muir commented on LUCENE-10067:
--

Thank you [~jpountz], now we are better off than we started!

> investigate 6/23/2021 -> 6/24/2021 drop in facets perf
> --
>
> Key: LUCENE-10067
> URL: https://issues.apache.org/jira/browse/LUCENE-10067
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: main (9.0)
>
>
> Just looking at perf graph (and recent gains from LUCENE-5309), it is unclear 
> why performance dropped on that date, and if we paved new, better 
> performance, or instead just regained some lost-ground only in the single 
> value case?
> Example:
> https://home.apache.org/~mikemccand/lucenebench/BrowseDayOfYearSSDVFacets.html
> after some debugging, it looks like LUCENE-9613 change may be responsible. It 
> was the only relevant change between the 6/23 and 6/24 benchmarks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9969) DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机

2021-08-26 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405470#comment-17405470
 ] 

Greg Miller commented on LUCENE-9969:
-

[~chengfengfeng] glad you were able to find a solution for your application! 
I'm going to resolve this issue, but if you're interesting in proposing an 
improvement to Lucene itself, please reopen this and attach a pull request. I'd 
be happy to have a look!

> DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机
> 
>
> Key: LUCENE-9969
> URL: https://issues.apache.org/jira/browse/LUCENE-9969
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 6.6.2
>Reporter: FengFeng Cheng
>Priority: Trivial
> Attachments: image-2021-05-24-13-43-43-289.png
>
>
> 首先数据量很大,jvm内存为90G,但是TaxonomyIndexArrays几乎占走了一半
> !image-2021-05-24-13-43-43-289.png!
> 请问对于TaxonomyReader是否有更好的使用方式或者其他的优化?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9969) DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机

2021-08-26 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-9969.
-
Resolution: Information Provided

> DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机
> 
>
> Key: LUCENE-9969
> URL: https://issues.apache.org/jira/browse/LUCENE-9969
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 6.6.2
>Reporter: FengFeng Cheng
>Priority: Trivial
> Attachments: image-2021-05-24-13-43-43-289.png
>
>
> 首先数据量很大,jvm内存为90G,但是TaxonomyIndexArrays几乎占走了一半
> !image-2021-05-24-13-43-43-289.png!
> 请问对于TaxonomyReader是否有更好的使用方式或者其他的优化?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

2021-08-26 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405467#comment-17405467
 ] 

Greg Miller commented on LUCENE-10033:
--

[~weizijun] thanks for running this! The "10k" source is really meant as a test 
that things are setup properly, but it doesn't really give accurate QPS 
results. Can you please try running wikimedium10m or wikimediumall?

> Encode doc values in smaller blocks of values, like postings
> 
>
> Key: LUCENE-10033
> URL: https://issues.apache.org/jira/browse/LUCENE-10033
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: benchmark
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

2021-08-26 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405467#comment-17405467
 ] 

Greg Miller edited comment on LUCENE-10033 at 8/26/21, 9:07 PM:


[~weizijun] thanks for running this! The "10k" source is really meant as a test 
that things are setup properly, but it doesn't really give accurate QPS 
results. Can you please try running wikimedium10m or wikimediumall?

Even in that 10k result you posted though, the performance of the SSDV faceting 
tasks is showing a heavy regression. Lets see if that holds with a 10m or all 
run.


was (Author: gsmiller):
[~weizijun] thanks for running this! The "10k" source is really meant as a test 
that things are setup properly, but it doesn't really give accurate QPS 
results. Can you please try running wikimedium10m or wikimediumall?

> Encode doc values in smaller blocks of values, like postings
> 
>
> Key: LUCENE-10033
> URL: https://issues.apache.org/jira/browse/LUCENE-10033
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: benchmark
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-26 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696973661



##
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
 taxoReader.close();
 dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+float PROBABILITY_OF_COMMIT = 0.5f;
+Directory src = newDirectory();
+DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+String randomArray[] = new String[random().nextInt(1000)];
+// adding a smaller bound on ints ensures that we will have some duplicate 
ordinals in random
+// test cases
+Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+int allOrdinals[] = new int[randomArray.length];
+
+for (int i = 0; i < randomArray.length; i++) {
+  allPaths[i] = new FacetLabel(randomArray[i]);
+  w.addCategory(allPaths[i]);
+  // add random commits to create multiple segments in the index
+  if (random().nextFloat() < PROBABILITY_OF_COMMIT) {
+w.commit();
+  }
+}
+w.commit();
+w.close();
+
+DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+for (int i = 0; i < allPaths.length; i++) {

Review comment:
   Sure. I modified the test so that it uses random multiple threads, each 
thread use random iterations, each iteration tests the path of random number of 
ordinals . Debugging this test will be a task if it fails :|




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-26 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696973661



##
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
 taxoReader.close();
 dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+float PROBABILITY_OF_COMMIT = 0.5f;
+Directory src = newDirectory();
+DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+String randomArray[] = new String[random().nextInt(1000)];
+// adding a smaller bound on ints ensures that we will have some duplicate 
ordinals in random
+// test cases
+Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+int allOrdinals[] = new int[randomArray.length];
+
+for (int i = 0; i < randomArray.length; i++) {
+  allPaths[i] = new FacetLabel(randomArray[i]);
+  w.addCategory(allPaths[i]);
+  // add random commits to create multiple segments in the index
+  if (random().nextFloat() < PROBABILITY_OF_COMMIT) {
+w.commit();
+  }
+}
+w.commit();
+w.close();
+
+DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+for (int i = 0; i < allPaths.length; i++) {

Review comment:
   Sure. I modified the test to that it uses random multiple threads, each 
thread use random iterations, each iteration tests the path of random number of 
ordinals . Debugging this test will be a task if it fails :|




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-26 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696969300



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+// TODO: can we use an int-based hash impl, such as IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  return categoryCache.get(ordinal);
+}
+  }
+
+  private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+  throws IllegalArgumentException {
+if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+  throw new IllegalArgumentException(
+  "ordinal "
+  + ordinal
+  + " is out of the range of the indexReader "
+  + indexReader.toString());
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on DocValues.
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+int indexReaderMaxDoc = indexReader.maxDoc();
+
+for (int i = 0; i < ordinalsLength; i++) {
+  // check whether the ordinal is valid before accessing the cache
+  checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+  // check the cache before trying to find it in the index
+  FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+  if (ordinalPath != null) {
+bulkPath[i] = ordinalPath;
+  }
+}
+
+/* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+new InPlaceMergeSorter() {
+  @Override
+  protected void swap(int i, int j) {
+int x = ordinals[i];
+ordinals[i] = ordinals[j];
+ordinals[j] = x;
+
+x = originalPosition[i];
+originalPosition[i] = originalPosition[j];
+originalPosition[j] = x;
+  }
+  ;
+
+  @Override
+  public int compare(int i, int j) {
+return Integer.compare(ordinals[i], ordinals[j]);
+  }
+}.sort(0, ordinalsLength);
+
+int readerIndex;
+int leafReaderMaxDoc = 0;
+int leafReaderDocBase = 0;
+LeafReader leafReader;
+LeafReaderContext leafReaderContext;
+BinaryDocValues values = null;
+
+for (int i = 0; i < ordinalsLength; i++) {
+  if (bulkPath[originalPosition[i]] == null) {
+/*
+If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that 
contains our ordinal
+ */
+if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+  readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+  leafReaderContext = indexReader.leaves().get(readerIndex);
+  leafReader = leafReaderContext.reader();
+  leafReaderMaxDoc = leafReader.maxDoc();
+  leafReaderDocBase = leafReaderContext.docBase;
+  values = leafReader.getBinaryDocValues(Consts.FULL);
+
+  /*
+  If the index is constructed with the older StoredFields it will not 
have any BinaryDocValues field and will return null
+   */
+  if (values == null) {
+return getBulkPathForOlderIndexes(ordinals);
+  }
+}
+boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+assert success;
+bulkPath[originalPosition[i]] =
+new 
FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+
+// add the value to the categoryCache after computation
+synchronized (categoryCache) {

Review comment:
   Sure. I've implemented this in the next iteration where the 
`uncachedOrdinalPositions` array stores the positions of ordinals that were not 
cached. 
   
   We later on use these positions in one go to bulk insert paths into the 
cache.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL 

[jira] [Commented] (LUCENE-10060) Ensure DrillSidewaysQuery instances don't get cached

2021-08-26 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405449#comment-17405449
 ] 

Greg Miller commented on LUCENE-10060:
--

OK, I think I actually broke the tests, not the implementation, so it should 
hopefully be an easier fix. Will have a PR up shortly.

> Ensure DrillSidewaysQuery instances don't get cached
> 
>
> Key: LUCENE-10060
> URL: https://issues.apache.org/jira/browse/LUCENE-10060
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: main (9.0), 8.10
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We need to make sure DSQ instances don't end up in the query cache. -It's 
> important that the {{DrillSidewaysScorer}} (bulk scorer implementation) 
> actually runs during query evaluation in order to populate the "sideways" 
> {{FacetsCollector}} instances with "near miss" docs. If it gets cached, this 
> won't happen.-
> There may also be an implication around {{acceptDocs}} getting honored as 
> well. [~zacharymorn] may be able to provide more details.
> UPDATE: The original issue I detailed above isn't actually an issue since 
> {{DrillDownQuery}} doesn't implement {{equals}}, so the cache always misses 
> and it always executes the {{BulkScorer}} ( {{DrillSidewaysScorer}} ). 
> Tricky! There is a separate issue found by Zach (as mentioned above) related 
> to "acceptDocs" though. See below conversation and link off to the separate 
> [PR 
> conversation|https://github.com/apache/lucene/pull/240#discussion_r692154001] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10060) Ensure DrillSidewaysQuery instances don't get cached

2021-08-26 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405441#comment-17405441
 ] 

Greg Miller commented on LUCENE-10060:
--

[~jpountz] thanks for the heads up. I'll dig into it.

> Ensure DrillSidewaysQuery instances don't get cached
> 
>
> Key: LUCENE-10060
> URL: https://issues.apache.org/jira/browse/LUCENE-10060
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: main (9.0), 8.10
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We need to make sure DSQ instances don't end up in the query cache. -It's 
> important that the {{DrillSidewaysScorer}} (bulk scorer implementation) 
> actually runs during query evaluation in order to populate the "sideways" 
> {{FacetsCollector}} instances with "near miss" docs. If it gets cached, this 
> won't happen.-
> There may also be an implication around {{acceptDocs}} getting honored as 
> well. [~zacharymorn] may be able to provide more details.
> UPDATE: The original issue I detailed above isn't actually an issue since 
> {{DrillDownQuery}} doesn't implement {{equals}}, so the cache always misses 
> and it always executes the {{BulkScorer}} ( {{DrillSidewaysScorer}} ). 
> Tricky! There is a separate issue found by Zach (as mentioned above) related 
> to "acceptDocs" though. See below conversation and link off to the separate 
> [PR 
> conversation|https://github.com/apache/lucene/pull/240#discussion_r692154001] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10067) investigate 6/23/2021 -> 6/24/2021 drop in facets perf

2021-08-26 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10067.
---
Fix Version/s: main (9.0)
   Resolution: Fixed

New benchmark results look good:
https://home.apache.org/~mikemccand/lucenebench/BrowseMonthSSDVFacets.html
https://home.apache.org/~mikemccand/lucenebench/BrowseDayOfYearSSDVFacets.html

> investigate 6/23/2021 -> 6/24/2021 drop in facets perf
> --
>
> Key: LUCENE-10067
> URL: https://issues.apache.org/jira/browse/LUCENE-10067
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: main (9.0)
>
>
> Just looking at perf graph (and recent gains from LUCENE-5309), it is unclear 
> why performance dropped on that date, and if we paved new, better 
> performance, or instead just regained some lost-ground only in the single 
> value case?
> Example:
> https://home.apache.org/~mikemccand/lucenebench/BrowseDayOfYearSSDVFacets.html
> after some debugging, it looks like LUCENE-9613 change may be responsible. It 
> was the only relevant change between the 6/23 and 6/24 benchmarks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

2021-08-26 Thread GitBox


jtibshirani commented on a change in pull request #262:
URL: https://github.com/apache/lucene/pull/262#discussion_r696842465



##
File path: 
lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws 
IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) 
throws IOException {
-throw new UnsupportedOperationException();
+VectorValues values = getVectorValues(field);
+if (values == null) {
+  return null;
+}
+if (target.length != values.dimension()) {
+  throw new IllegalArgumentException(
+  "incorrect dimension for field "
+  + field
+  + "; expected "
+  + values.dimension()
+  + " but target has "
+  + target.length);
+}
+FieldInfo info = readState.fieldInfos.fieldInfo(field);
+VectorSimilarityFunction vectorSimilarity = 
info.getVectorSimilarityFunction();
+HitQueue topK = new HitQueue(k, false);
+int doc;
+while ((doc = values.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  float[] vector = values.vectorValue();
+  float score = vectorSimilarity.compare(vector, target);
+  if (vectorSimilarity.reversed) {
+score = 1 / (score + 1);
+  }
+  topK.insertWithOverflow(new ScoreDoc(doc, score));
+}
+ScoreDoc[] topScoreDocs = new ScoreDoc[topK.size()];
+int i = 0;
+for (ScoreDoc scoreDoc : topK) {
+  topScoreDocs[i++] = scoreDoc;
+}
+Arrays.sort(topScoreDocs, Comparator.comparingInt(x -> x.doc));

Review comment:
   I was apparently confused too :) It'd also be great to document what the 
TotalHits part of the result means, this is a bit subtle. I think it's the 
number of docs that were visited during the search.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

2021-08-26 Thread GitBox


jtibshirani commented on a change in pull request #262:
URL: https://github.com/apache/lucene/pull/262#discussion_r696839771



##
File path: 
lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws 
IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) 
throws IOException {
-throw new UnsupportedOperationException();
+VectorValues values = getVectorValues(field);

Review comment:
   Thanks, that's good to know. I guess we could simplify the handling in 
some other places then too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)

2021-08-26 Thread GitBox


gsmiller commented on a change in pull request #264:
URL: https://github.com/apache/lucene/pull/264#discussion_r696810061



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/DocValuesOrdinalsReader.java
##
@@ -41,12 +40,7 @@ public DocValuesOrdinalsReader(String field) {
 
   @Override
   public OrdinalsSegmentReader getReader(LeafReaderContext context) throws 
IOException {

Review comment:
   Yeah, we can definitely simplify this. I took the path of least 
resistance as a first cut for testing impact, but I'll work on simplifying and 
cleaning this up. Thanks for the suggestion!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)

2021-08-26 Thread GitBox


gsmiller commented on a change in pull request #264:
URL: https://github.com/apache/lucene/pull/264#discussion_r696809143



##
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
##
@@ -410,7 +411,16 @@ private void processFacetFields(
 
   // Facet counts:
   // DocValues are considered stored fields:
-  doc.add(new BinaryDocValuesField(indexFieldName, 
dedupAndEncode(ordinals.get(;
+  IntsRef o = ordinals.get();
+  Arrays.sort(o.ints, o.offset, o.length);

Review comment:
   Thanks for the catch! I lifted this from the old code, which looks like 
it was also doing this incorrectly (apologies for the oversight!). I'll for 
sure correct it. Looks like we were never burned by this since 
`IntsRefBuilder#get` asserts that the offset is always 0... :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)

2021-08-26 Thread GitBox


gsmiller commented on a change in pull request #264:
URL: https://github.com/apache/lucene/pull/264#discussion_r696806470



##
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
##
@@ -410,7 +411,16 @@ private void processFacetFields(
 
   // Facet counts:
   // DocValues are considered stored fields:

Review comment:
   :) will do




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] epugh merged pull request #2560: SOLR-15410: Always use -Xverbosegclog for OpenJ9

2021-08-26 Thread GitBox


epugh merged pull request #2560:
URL: https://github.com/apache/lucene-solr/pull/2560


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova opened a new pull request #267: Handle hierarchy in graph construction and search

2021-08-26 Thread GitBox


mayya-sharipova opened a new pull request #267:
URL: https://github.com/apache/lucene/pull/267


   This patch handles hierarchy in graph construction and search,
   but only in memory.
   
   Work left for future: handle hierarchy on disk


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)

2021-08-26 Thread GitBox


rmuir commented on a change in pull request #264:
URL: https://github.com/apache/lucene/pull/264#discussion_r696757277



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/DocValuesOrdinalsReader.java
##
@@ -59,16 +53,21 @@ public void get(int docID, IntsRef ordinals) throws 
IOException {
   "docs out of order: lastDocID=" + lastDocID + " vs docID=" + 
docID);
 }
 lastDocID = docID;
-if (docID > values.docID()) {
-  values.advance(docID);
-}
-final BytesRef bytes;
-if (values.docID() == docID) {
-  bytes = values.binaryValue();
-} else {
-  bytes = new BytesRef(BytesRef.EMPTY_BYTES);
+
+ordinals.offset = 0;
+ordinals.length = 0;
+
+if (dv.advanceExact(docID)) {
+  int count = dv.docValueCount();
+  if (ordinals.ints.length < count) {
+ordinals.ints = ArrayUtil.grow(ordinals.ints, count);
+  }
+
+  for (int i = 0; i < count; i++) {
+ordinals.ints[ordinals.length] = (int) dv.nextValue();

Review comment:
   it is just a numeric field: lucene doesn't assign the ordinals. facets 
is the one adding numeric values. Could we do your suggested check at 
write-time instead?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)

2021-08-26 Thread GitBox


rmuir commented on a change in pull request #264:
URL: https://github.com/apache/lucene/pull/264#discussion_r696756609



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/DocValuesOrdinalsReader.java
##
@@ -59,16 +53,21 @@ public void get(int docID, IntsRef ordinals) throws 
IOException {
   "docs out of order: lastDocID=" + lastDocID + " vs docID=" + 
docID);
 }
 lastDocID = docID;
-if (docID > values.docID()) {
-  values.advance(docID);
-}
-final BytesRef bytes;
-if (values.docID() == docID) {
-  bytes = values.binaryValue();
-} else {
-  bytes = new BytesRef(BytesRef.EMPTY_BYTES);
+
+ordinals.offset = 0;
+ordinals.length = 0;
+
+if (dv.advanceExact(docID)) {
+  int count = dv.docValueCount();
+  if (ordinals.ints.length < count) {
+ordinals.ints = ArrayUtil.grow(ordinals.ints, count);
+  }
+
+  for (int i = 0; i < count; i++) {
+ordinals.ints[ordinals.length] = (int) dv.nextValue();

Review comment:
   not sure it is the best tradeoff in this hot loop?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

2021-08-26 Thread GitBox


jpountz commented on a change in pull request #262:
URL: https://github.com/apache/lucene/pull/262#discussion_r696747416



##
File path: 
lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##
@@ -140,7 +145,36 @@ public VectorValues getVectorValues(String field) throws 
IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) 
throws IOException {
-throw new UnsupportedOperationException();
+VectorValues values = getVectorValues(field);
+if (values == null) {
+  return null;
+}

Review comment:
   Let's remove this null check? It shouldn't be necessary since 
`getVectorValues()` is expected to return a non-null value when the number of 
dimentions is greater than 0 on the `FieldInfo`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

2021-08-26 Thread GitBox


msokolov commented on a change in pull request #262:
URL: https://github.com/apache/lucene/pull/262#discussion_r696706087



##
File path: 
lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws 
IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) 
throws IOException {
-throw new UnsupportedOperationException();
+VectorValues values = getVectorValues(field);

Review comment:
   Thanks, let's follow the convention of relying on callers to do such 
checking then.

##
File path: 
lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws 
IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) 
throws IOException {
-throw new UnsupportedOperationException();
+VectorValues values = getVectorValues(field);
+if (values == null) {
+  return null;
+}
+if (target.length != values.dimension()) {
+  throw new IllegalArgumentException(
+  "incorrect dimension for field "
+  + field
+  + "; expected "
+  + values.dimension()
+  + " but target has "
+  + target.length);
+}
+FieldInfo info = readState.fieldInfos.fieldInfo(field);
+VectorSimilarityFunction vectorSimilarity = 
info.getVectorSimilarityFunction();
+HitQueue topK = new HitQueue(k, false);
+int doc;
+while ((doc = values.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  float[] vector = values.vectorValue();
+  float score = vectorSimilarity.compare(vector, target);
+  if (vectorSimilarity.reversed) {
+score = 1 / (score + 1);
+  }
+  topK.insertWithOverflow(new ScoreDoc(doc, score));
+}
+ScoreDoc[] topScoreDocs = new ScoreDoc[topK.size()];
+int i = 0;
+for (ScoreDoc scoreDoc : topK) {
+  topScoreDocs[i++] = scoreDoc;
+}
+Arrays.sort(topScoreDocs, Comparator.comparingInt(x -> x.doc));

Review comment:
   No, that's exactly right - should be sorted by score here, not by docid 
- I was confused having just written the Query implementation. I do see the 
`KnnVectorsReader` javadoc doesn't explicitly state what the contract is 
supposed to be; let's rectify that in a separate issue.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10060) Ensure DrillSidewaysQuery instances don't get cached

2021-08-26 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405278#comment-17405278
 ] 

Adrien Grand commented on LUCENE-10060:
---

[~gsmiller] Your recent push seems to be triggering test failures on main and 
branch_8x, e.g.

{noformat}
gradlew :lucene:facet:test --tests 
"org.apache.lucene.facet.TestDrillSideways.testRandom" -Ptests.jvms=4 
-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=4E66AA89940F75A0 
-Ptests.file.encoding=UTF-8

16:37:02 org.apache.lucene.facet.TestDrillSideways > testRandom FAILED
16:37:02 java.lang.AssertionError: expected:<1.4262214> but was:<0.6111962>
16:37:02 at 
__randomizedtesting.SeedInfo.seed([4E66AA89940F75A0:3C2A8F86256FC3D3]:0)
16:37:02 at org.junit.Assert.fail(Assert.java:89)
16:37:02 at org.junit.Assert.failNotEquals(Assert.java:835)
16:37:02 at org.junit.Assert.assertEquals(Assert.java:577)
16:37:02 at org.junit.Assert.assertEquals(Assert.java:701)
16:37:02 at 
org.apache.lucene.facet.TestDrillSideways.verifyEquals(TestDrillSideways.java:1634)
16:37:02 at 
org.apache.lucene.facet.TestDrillSideways.testRandom(TestDrillSideways.java:1304)
16:37:02 at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
16:37:02 at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
16:37:02 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
16:37:02 at java.base/java.lang.reflect.Method.invoke(Method.java:566)
16:37:02 at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
16:37:02 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
16:37:02 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
16:37:02 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
16:37:02 at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
16:37:02 at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
16:37:02 at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
16:37:02 at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
16:37:02 at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
16:37:02 at org.junit.rules.RunRules.evaluate(RunRules.java:20)
16:37:02 at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
16:37:02 at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
16:37:02 at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
16:37:02 at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
16:37:02 at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
16:37:02 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
16:37:02 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
16:37:02 at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
16:37:02 at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
16:37:02 at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
16:37:02 at 
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
16:37:02 at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
16:37:02 at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
16:37:02 at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
16:37:02 at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
16:37:02 at 
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
16:37:02 at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
16:37:02 at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
16:37:02 at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailure

[GitHub] [lucene-solr] mikemccand commented on pull request #2559: LUCENE-10051 Fix lucene branch_8x run ant run-task error

2021-08-26 Thread GitBox


mikemccand commented on pull request #2559:
URL: https://github.com/apache/lucene-solr/pull/2559#issuecomment-906475832


   Awesome, thanks!  I'll review and push.
   
   Mike McCandless
   
   http://blog.mikemccandless.com
   
   
   On Thu, Aug 26, 2021 at 10:28 AM xiaoshi ***@***.***> wrote:
   
   > Hi @mikemccand  As you suggested, I have
   > also submitted a PR for lucene_8x: LUCENE-10051.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > ,
   > or unsubscribe
   > 

   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > 

   > or Android
   > 

   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10060) Ensure DrillSidewaysQuery instances don't get cached

2021-08-26 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405269#comment-17405269
 ] 

Adrien Grand commented on LUCENE-10060:
---

bq.  I assume what's going on here is that the caching ignores deleted docs and 
expects they'll be handled later

This is correct. The query cache caches based on the so-called "segment core" 
(see SegmentReader#core), ie. everything but live docs and doc-value updates. 
This allows cache entries to be reused across reopens as long as a segment 
hasn't been merged away. But this requires cache entries to be computed on the 
entire segment and follow-up searches to apply deletes on top of cache entries.

DrillSidewaysQuery is a bit abusing the API, it shouldn't be a Query. It would 
be nice to refactor it to something else than a Query at some point.

> Ensure DrillSidewaysQuery instances don't get cached
> 
>
> Key: LUCENE-10060
> URL: https://issues.apache.org/jira/browse/LUCENE-10060
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: main (9.0), 8.10
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We need to make sure DSQ instances don't end up in the query cache. -It's 
> important that the {{DrillSidewaysScorer}} (bulk scorer implementation) 
> actually runs during query evaluation in order to populate the "sideways" 
> {{FacetsCollector}} instances with "near miss" docs. If it gets cached, this 
> won't happen.-
> There may also be an implication around {{acceptDocs}} getting honored as 
> well. [~zacharymorn] may be able to provide more details.
> UPDATE: The original issue I detailed above isn't actually an issue since 
> {{DrillDownQuery}} doesn't implement {{equals}}, so the cache always misses 
> and it always executes the {{BulkScorer}} ( {{DrillSidewaysScorer}} ). 
> Tricky! There is a separate issue found by Zach (as mentioned above) related 
> to "acceptDocs" though. See below conversation and link off to the separate 
> [PR 
> conversation|https://github.com/apache/lucene/pull/240#discussion_r692154001] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] xiaoshi2013 commented on pull request #2559: LUCENE-10051 Fix lucene branch_8x run ant run-task error

2021-08-26 Thread GitBox


xiaoshi2013 commented on pull request #2559:
URL: https://github.com/apache/lucene-solr/pull/2559#issuecomment-906463308


   Hi @mikemccand As you suggested, I have also submitted a PR for lucene_8x: 
LUCENE-10051.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] xiaoshi2013 removed a comment on pull request #253: LUCENE-10058: fix gradle lucene:benchmark:run error

2021-08-26 Thread GitBox


xiaoshi2013 removed a comment on pull request #253:
URL: https://github.com/apache/lucene/pull/253#issuecomment-905706189


   > Thanks @xiaoshi2013! Could you also open a backport PR against 
[`branch_8x` in `lucene-solr` github 
repository](https://github.com/apache/lucene-solr/tree/branch_8x)? Thanks!
   
   Hi @mikemccand As you suggested, I have also submitted a PR for lucene_8x: 
LUCENE-10051, the PR is https://github.com/apache/lucene-solr/pull/2559


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wuda0112 commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-08-26 Thread GitBox


wuda0112 commented on pull request #224:
URL: https://github.com/apache/lucene/pull/224#issuecomment-906451844


   Ok , got it. I will keep work on it.  Thanks for your patience


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9917) Reduce block size for BEST_SPEED

2021-08-26 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405249#comment-17405249
 ] 

Adrien Grand commented on LUCENE-9917:
--

Thank you [~rcmuir]!

> Reduce block size for BEST_SPEED
> 
>
> Key: LUCENE-9917
> URL: https://issues.apache.org/jira/browse/LUCENE-9917
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.10
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> As benchmarks suggested major savings and minor slowdowns with larger block 
> sizes, I had increased them on LUCENE-9486. However it looks like this 
> slowdown is still problematic for some users, so I plan to go back to a 
> smaller block size, something like 10*16kB to get closer to the amount of 
> data we had to decompress per document when we had 16kB blocks without shared 
> dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9917) Reduce block size for BEST_SPEED

2021-08-26 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-9917.
--
Fix Version/s: 8.10
   Resolution: Fixed

> Reduce block size for BEST_SPEED
> 
>
> Key: LUCENE-9917
> URL: https://issues.apache.org/jira/browse/LUCENE-9917
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.10
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> As benchmarks suggested major savings and minor slowdowns with larger block 
> sizes, I had increased them on LUCENE-9486. However it looks like this 
> slowdown is still problematic for some users, so I plan to go back to a 
> smaller block size, something like 10*16kB to get closer to the amount of 
> data we had to decompress per document when we had 16kB blocks without shared 
> dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-08-26 Thread GitBox


jpountz commented on pull request #224:
URL: https://github.com/apache/lucene/pull/224#issuecomment-906442903


   Ah, I see what you mean. Let's ignore my suggestion then and go with your 
initial suggestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9917) Reduce block size for BEST_SPEED

2021-08-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405244#comment-17405244
 ] 

ASF subversion and git services commented on LUCENE-9917:
-

Commit 37368d3e405c6de3119c035d7cb7dec0f28c8a67 in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=37368d3 ]

LUCENE-9917: Smaller block sizes for BEST_SPEED. (#257)

This reduces the block size for BEST_SPEED in order to trade some compression
ratio in exchange for better retrieval speed.


> Reduce block size for BEST_SPEED
> 
>
> Key: LUCENE-9917
> URL: https://issues.apache.org/jira/browse/LUCENE-9917
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> As benchmarks suggested major savings and minor slowdowns with larger block 
> sizes, I had increased them on LUCENE-9486. However it looks like this 
> slowdown is still problematic for some users, so I plan to go back to a 
> smaller block size, something like 10*16kB to get closer to the amount of 
> data we had to decompress per document when we had 16kB blocks without shared 
> dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10060) Ensure DrillSidewaysQuery instances don't get cached

2021-08-26 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10060.
--
Fix Version/s: 8.10
   main (9.0)
   Resolution: Fixed

> Ensure DrillSidewaysQuery instances don't get cached
> 
>
> Key: LUCENE-10060
> URL: https://issues.apache.org/jira/browse/LUCENE-10060
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: main (9.0), 8.10
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We need to make sure DSQ instances don't end up in the query cache. -It's 
> important that the {{DrillSidewaysScorer}} (bulk scorer implementation) 
> actually runs during query evaluation in order to populate the "sideways" 
> {{FacetsCollector}} instances with "near miss" docs. If it gets cached, this 
> won't happen.-
> There may also be an implication around {{acceptDocs}} getting honored as 
> well. [~zacharymorn] may be able to provide more details.
> UPDATE: The original issue I detailed above isn't actually an issue since 
> {{DrillDownQuery}} doesn't implement {{equals}}, so the cache always misses 
> and it always executes the {{BulkScorer}} ( {{DrillSidewaysScorer}} ). 
> Tricky! There is a separate issue found by Zach (as mentioned above) related 
> to "acceptDocs" though. See below conversation and link off to the separate 
> [PR 
> conversation|https://github.com/apache/lucene/pull/240#discussion_r692154001] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10060) Ensure DrillSidewaysQuery instances don't get cached

2021-08-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405242#comment-17405242
 ] 

ASF subversion and git services commented on LUCENE-10060:
--

Commit 1ae34dd129d791f61d58202e2d993ab2055ad1d5 in lucene-solr's branch 
refs/heads/branch_8x from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1ae34dd ]

LUCENE-10060: Ensure DrillSidewaysQuery instances never get cached (#2561)



> Ensure DrillSidewaysQuery instances don't get cached
> 
>
> Key: LUCENE-10060
> URL: https://issues.apache.org/jira/browse/LUCENE-10060
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We need to make sure DSQ instances don't end up in the query cache. -It's 
> important that the {{DrillSidewaysScorer}} (bulk scorer implementation) 
> actually runs during query evaluation in order to populate the "sideways" 
> {{FacetsCollector}} instances with "near miss" docs. If it gets cached, this 
> won't happen.-
> There may also be an implication around {{acceptDocs}} getting honored as 
> well. [~zacharymorn] may be able to provide more details.
> UPDATE: The original issue I detailed above isn't actually an issue since 
> {{DrillDownQuery}} doesn't implement {{equals}}, so the cache always misses 
> and it always executes the {{BulkScorer}} ( {{DrillSidewaysScorer}} ). 
> Tricky! There is a separate issue found by Zach (as mentioned above) related 
> to "acceptDocs" though. See below conversation and link off to the separate 
> [PR 
> conversation|https://github.com/apache/lucene/pull/240#discussion_r692154001] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gsmiller merged pull request #2561: LUCENE-10060: Ensure DrillSidewaysQuery instances never get cached

2021-08-26 Thread GitBox


gsmiller merged pull request #2561:
URL: https://github.com/apache/lucene-solr/pull/2561


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wuda0112 commented on pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-08-26 Thread GitBox


wuda0112 commented on pull request #224:
URL: https://github.com/apache/lucene/pull/224#issuecomment-906430050


   > This looks good to me. Since we don't need to encode these numbers 
differently maybe we could have a single abstract method, e.g.
   > 
   > MultiLevelSkipListWriter
   > 
   > ```java
   >  /**
   >   * Write a small positive long on a variable number of bytes.
   >   *
   >   * @param l the long to write
   >   * @param out the output to write to
   >   */
   >  protected void writeVLong(long l, DataOutput out) throws IOException{
   >out.writeVLong(l);
   >  }
   > ```
   > 
   > MultiLevelSkipListReader
   > 
   > ```java
   >   /**
   >* Read a long written via {@link MultiLevelSkipListWriter#writeVLong}.
   >*
   >* @param in the IndexInput to read from
   >*/
   >   protected long readVLong(IndexInput in) throws IOException {
   > return in.readVLong();
   >   }
   > ```
   > 
   > This is a small enough change to an internal class that I don't feel the 
need to open a separate issue, we could do it on this PR.
   
   There has one thing i want to confirm, since subclass do not know the 
numbers is writing stands for(eg. for level length or child pointer), so 
subclass can not add modifier and hierarchy befor the numbers, so the file may 
be looks like case 1:
   
   Case 1: no hierarchy and modifier
   ```
   field title
term lucene
doc 499
  freq 1
  pos 3
skipList 
   1024
  level 2
skipDoc 218
skipDocFP 15469
impacts 
  impact 
freq 1
norm 1
impactsEnd 
   64
   ```
   
   Case 2: we may expect
   ```
   field title
term lucene
doc 499
  freq 1
  pos 3
skipList 
  1024
  level 2
skipDoc 218
skipDocFP 15469
impacts 
  impact 
freq 1
norm 1
impactsEnd 
   childPointer 64
   ```
   Of course, **it is completely no need to change API to unsuitable state and 
it may be just only for this PR**, so for me both case 1 and case 2 are ok! 
What is your opinion, and case 1 is ok?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gsmiller opened a new pull request #2561: LUCENE-10060: Ensure DrillSidewaysQuery instances never get cached

2021-08-26 Thread GitBox


gsmiller opened a new pull request #2561:
URL: https://github.com/apache/lucene-solr/pull/2561


   Backport of a bug fix from `main`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for faceting

2021-08-26 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405218#comment-17405218
 ] 

Greg Miller commented on LUCENE-10062:
--

Hmm, so I ran an internal benchmarking tool against our Lucene application 
(Amazon Product Search) and the results were not nearly as compelling. It looks 
like there wasn't much impact to red-line QPS or the latency (in particular, of 
our facet-counting step). It also looks like the index got bigger with this 
change by ~4%. I suspect there's a significant different between the two tests 
with respect to how many facet categories each doc is storing on average, 
probably highlighting the gap between these solutions where one is doing delta 
encoding and one isn't.

I'm certainly not saying this should be a show-stopper for trying to more 
forward with this change, but it would be really good to understand if our 
internal use-case is an outlier here or if the {{luceneutil}} testing is the 
outlier. I'd obviously want to avoid a situation where our benchmarks think 
this is a great improvement but most common Lucene users see a regression! If 
anyone else has an application they're able to benchmark the change with, that 
could provide some more interesting data points. I'll also see if I can dig in 
more on our internal application and look to see if things can be sped up.

> Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for 
> faceting
> 
>
> Key: LUCENE-10062
> URL: https://issues.apache.org/jira/browse/LUCENE-10062
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We currently encode taxonomy ordinals using varint style packing in a binary 
> doc values field. I suspect there have been a number of improvements to 
> SortedNumericDocValues since taxonomy faceting was first introduced, and I 
> plan to explore replacing the custom binary format we have today with a 
> SORTED_NUMERIC type dv field instead.
> I'll report benchmark results and index size impact here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] colvinco commented on pull request #2560: SOLR-15410: Always use -Xverbosegclog for OpenJ9

2021-08-26 Thread GitBox


colvinco commented on pull request #2560:
URL: https://github.com/apache/lucene-solr/pull/2560#issuecomment-906404688


   @epugh fyi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10060) Ensure DrillSidewaysQuery instances don't get cached

2021-08-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405215#comment-17405215
 ] 

ASF subversion and git services commented on LUCENE-10060:
--

Commit dbf7e1865f9901cece78584a145aebe778af0b20 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=dbf7e18 ]

LUCENE-10060: Ensure DrillSidewaysQuery instances never get cached (#261)



> Ensure DrillSidewaysQuery instances don't get cached
> 
>
> Key: LUCENE-10060
> URL: https://issues.apache.org/jira/browse/LUCENE-10060
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We need to make sure DSQ instances don't end up in the query cache. -It's 
> important that the {{DrillSidewaysScorer}} (bulk scorer implementation) 
> actually runs during query evaluation in order to populate the "sideways" 
> {{FacetsCollector}} instances with "near miss" docs. If it gets cached, this 
> won't happen.-
> There may also be an implication around {{acceptDocs}} getting honored as 
> well. [~zacharymorn] may be able to provide more details.
> UPDATE: The original issue I detailed above isn't actually an issue since 
> {{DrillDownQuery}} doesn't implement {{equals}}, so the cache always misses 
> and it always executes the {{BulkScorer}} ( {{DrillSidewaysScorer}} ). 
> Tricky! There is a separate issue found by Zach (as mentioned above) related 
> to "acceptDocs" though. See below conversation and link off to the separate 
> [PR 
> conversation|https://github.com/apache/lucene/pull/240#discussion_r692154001] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #261: LUCENE-10060: Ensure DrillSidewaysQuery instances never get cached

2021-08-26 Thread GitBox


gsmiller commented on pull request #261:
URL: https://github.com/apache/lucene/pull/261#issuecomment-906397479


   Thanks @zacharymorn for the quick review and double-checking the fix against 
your PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #261: LUCENE-10060: Ensure DrillSidewaysQuery instances never get cached

2021-08-26 Thread GitBox


gsmiller merged pull request #261:
URL: https://github.com/apache/lucene/pull/261


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #257: LUCENE-9917: Smaller block sizes for BEST_SPEED.

2021-08-26 Thread GitBox


jpountz merged pull request #257:
URL: https://github.com/apache/lucene/pull/257


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9917) Reduce block size for BEST_SPEED

2021-08-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405211#comment-17405211
 ] 

ASF subversion and git services commented on LUCENE-9917:
-

Commit f1fdd2465c0edd6c75cbbf404b7d5a381b84e6c8 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f1fdd24 ]

LUCENE-9917: Smaller block sizes for BEST_SPEED. (#257)

This reduces the block size for BEST_SPEED in order to trade some compression
ratio in exchange for better retrieval speed.

> Reduce block size for BEST_SPEED
> 
>
> Key: LUCENE-9917
> URL: https://issues.apache.org/jira/browse/LUCENE-9917
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As benchmarks suggested major savings and minor slowdowns with larger block 
> sizes, I had increased them on LUCENE-9486. However it looks like this 
> slowdown is still problematic for some users, so I plan to go back to a 
> smaller block size, something like 10*16kB to get closer to the amount of 
> data we had to decompress per document when we had 16kB blocks without shared 
> dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10073) Allow very small merges to merge more than segmentsPerTier segments?

2021-08-26 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10073:
-

 Summary: Allow very small merges to merge more than 
segmentsPerTier segments?
 Key: LUCENE-10073
 URL: https://issues.apache.org/jira/browse/LUCENE-10073
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


If you are doing lots of concurrent indexing, NRT search regularly publishes 
many tiny segments, which in-turn pushes a lot of pressure on merging, which 
needs to constantly merge these tiny segments so that the total number of 
segments of the index remains under the budget.

In parallel, TieredMergePolicy's behavior is to merge aggressively segments 
that are below the floor size. The budget of the number of segments allowed in 
the index is computed as if all segments were larger than the floor size, and 
merges that only contain segments below the floor size get a perfect skew which 
guarantees them to get a better score than any merge that contains one or more 
segments above the floor size.

I'm considering reducing the merging overhead in the NRT case by raising 
maxMergeAtOnce and allowing merges to merge more than mergeFactor segments as 
long as the number of merged segments is below maxMergeAtOnce and the merged 
segment size is below the floor segment size.

Said otherwise, "normal" merges would be allowed to merge up to mergeFactor 
segments like today, while small merges (size of the merged segment < floor 
segment bytes) could go up to maxMergeAtOnce.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)

2021-08-26 Thread GitBox


jpountz commented on a change in pull request #264:
URL: https://github.com/apache/lucene/pull/264#discussion_r696569831



##
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
##
@@ -410,7 +411,16 @@ private void processFacetFields(
 
   // Facet counts:
   // DocValues are considered stored fields:
-  doc.add(new BinaryDocValuesField(indexFieldName, 
dedupAndEncode(ordinals.get(;
+  IntsRef o = ordinals.get();
+  Arrays.sort(o.ints, o.offset, o.length);

Review comment:
   The last parameter of Arrays#sort is the end index, not the length.
   
   ```suggestion
 Arrays.sort(o.ints, o.offset, o.offset+o.length);
   ```

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/DocValuesOrdinalsReader.java
##
@@ -59,16 +53,21 @@ public void get(int docID, IntsRef ordinals) throws 
IOException {
   "docs out of order: lastDocID=" + lastDocID + " vs docID=" + 
docID);
 }
 lastDocID = docID;
-if (docID > values.docID()) {
-  values.advance(docID);
-}
-final BytesRef bytes;
-if (values.docID() == docID) {
-  bytes = values.binaryValue();
-} else {
-  bytes = new BytesRef(BytesRef.EMPTY_BYTES);
+
+ordinals.offset = 0;
+ordinals.length = 0;
+
+if (dv.advanceExact(docID)) {

Review comment:
   this change probably contributes to the speedup, we had seen 
non-negligible speedups when moving from advance to advanceExact

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/DocValuesOrdinalsReader.java
##
@@ -41,12 +40,7 @@ public DocValuesOrdinalsReader(String field) {
 
   @Override
   public OrdinalsSegmentReader getReader(LeafReaderContext context) throws 
IOException {

Review comment:
   Do we actually need the OrdinalsSegmentReader or could we use 
`SortedNumericDocValues` instead? Both seem to be about fetching some sorted 
integers for a given doc ID. This might help get some more performance as we 
would no longer have to copy these ints into an IntsRef.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)

2021-08-26 Thread GitBox


mikemccand commented on pull request #264:
URL: https://github.com/apache/lucene/pull/264#issuecomment-906343623


   I think a simple approach for the back-compat is to switch on 
`SegmentInfos.getIndexCreatedVersionMajor()`.  If the index was created 
pre-8.10, then we use the legacy encoding (`BDV`).  Else, we use the new 
encoding (`SSDV`).   This avoids having the same field name trying to use two 
different doc-values types, mixing old and new segments in the index, etc.  The 
code/switching is simpler.  But one downside is that users must fully create an 
entirely new index to get the benefits of the new encoding -- they cannot just 
upgrade Lucene and `.forceMerge(1)` and get the new tech.  I think that's a 
fine tradeoff.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a change in pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)

2021-08-26 Thread GitBox


mikemccand commented on a change in pull request #264:
URL: https://github.com/apache/lucene/pull/264#discussion_r696530876



##
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
##
@@ -410,7 +411,16 @@ private void processFacetFields(
 
   // Facet counts:
   // DocValues are considered stored fields:

Review comment:
   Hmm maybe remove this old and misleading comment?

##
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
##
@@ -410,7 +411,16 @@ private void processFacetFields(
 
   // Facet counts:
   // DocValues are considered stored fields:
-  doc.add(new BinaryDocValuesField(indexFieldName, 
dedupAndEncode(ordinals.get(;
+  IntsRef o = ordinals.get();
+  Arrays.sort(o.ints, o.offset, o.length);
+  int prev = -1;
+  for (int i = 0; i < o.length; i++) {
+int ord = o.ints[o.offset + i];
+if (ord > prev) {
+  doc.add(new SortedNumericDocValuesField(indexFieldName, ord));

Review comment:
   Lucene also does this same sorting during indexing, so it is redundant 
here.  But we do indeed need to dedup.  Are we sure nothing above this has 
already dedup'd the added SSDV facet labels?

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/DocValuesOrdinalsReader.java
##
@@ -59,16 +53,21 @@ public void get(int docID, IntsRef ordinals) throws 
IOException {
   "docs out of order: lastDocID=" + lastDocID + " vs docID=" + 
docID);
 }
 lastDocID = docID;
-if (docID > values.docID()) {
-  values.advance(docID);
-}
-final BytesRef bytes;
-if (values.docID() == docID) {
-  bytes = values.binaryValue();
-} else {
-  bytes = new BytesRef(BytesRef.EMPTY_BYTES);
+
+ordinals.offset = 0;
+ordinals.length = 0;
+
+if (dv.advanceExact(docID)) {
+  int count = dv.docValueCount();
+  if (ordinals.ints.length < count) {
+ordinals.ints = ArrayUtil.grow(ordinals.ints, count);
+  }
+
+  for (int i = 0; i < count; i++) {
+ordinals.ints[ordinals.length] = (int) dv.nextValue();

Review comment:
   Maybe use `Math.toIntExact` instead of `(int)` for better safety (in 
case somehow a too-big `long` shows up)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)

2021-08-26 Thread GitBox


mikemccand commented on pull request #264:
URL: https://github.com/apache/lucene/pull/264#issuecomment-906312429


   > In benchmarks, using numeric doc values to store taxonomy facet ordinals 
shows almost a 400% qps improvement in browse-related taxonomy-based tasks 
(instead of custom delta-encoding into a binary doc values field).
   
   HOLY SMOKES!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-26 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696495553



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+// TODO: can we use an int-based hash impl, such as IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  return categoryCache.get(ordinal);
+}
+  }
+
+  private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException 
{
+if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+  throw new IllegalArgumentException(
+  "ordinal "
+  + ordinal
+  + " is out of the range of the indexReader "
+  + indexReader.toString()
+  + ". The maximum possible ordinal number is "
+  + (indexReader.maxDoc() - 1));
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+for (int i = 0; i < ordinalsLength; i++) {
+  // check whether the ordinal is valid before accessing the cache
+  checkOrdinalBounds(ordinals[i]);
+  // check the cache before trying to find it in the index
+  FacetLabel ordinalPath = getPathFromCache(ordinals[i]);

Review comment:
   Yes. This should obviously be better. Fixed in the next commit




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10068) Switch to a "double barrel" HPPC cache for the taxonomy LRU cache

2021-08-26 Thread Gautam Worah (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405108#comment-17405108
 ] 

Gautam Worah edited comment on LUCENE-10068 at 8/26/21, 9:48 AM:
-

Some history here: We previously had a [double barrel 
LRU|https://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/util/DoubleBarrelLRUCache.html]
 cache in Lucene 3.6 but we ended up removing it I think. Jira Search throws 
some ancient Jira 
[tickets|http://jirasearch.mikemccandless.com/search.py?chg=new&text=DoubleBarrelLRUCache&a1=&a2=&page=0&searcher=42159&sort=recentlyUpdated&format=list&id=1xy07b2ketsi&newText=DoubleBarrelLRUCache]


was (Author: gworah):
Some history here: We previously had a[ double barrel 
LRU|https://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/util/DoubleBarrelLRUCache.html]
 cache in Lucene 3.6 but we ended up removing it I think. Jira Search throws 
some ancient Jira 
[tickets|http://jirasearch.mikemccandless.com/search.py?chg=new&text=DoubleBarrelLRUCache&a1=&a2=&page=0&searcher=42159&sort=recentlyUpdated&format=list&id=1xy07b2ketsi&newText=DoubleBarrelLRUCache]

> Switch to a "double barrel" HPPC cache for the taxonomy LRU cache
> -
>
> Key: LUCENE-10068
> URL: https://issues.apache.org/jira/browse/LUCENE-10068
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.8.1
>Reporter: Gautam Worah
>Priority: Minor
>
> While working on an unrelated getBulkPath API 
> [PR|https://github.com/apache/lucene/pull/179], [~mikemccand] and I came 
> across a nice optimization that could be made to the taxonomy cache.
> The taxonomy cache today caches frequently used ordinals and their 
> corresponding FacetLabels. It uses the existing LRUHashMap (backed by a 
> LinkedList) class for its implementation.
> This implementation performs sub optimally when it has a large number of 
> threads accessing it, and consumes a large amount of RAM.
> [~mikemccand] suggested the idea of a two array backed HPPC int->FacetLabel 
> cache. The basic idea behind the cache being:
>  # We use two hashmaps primary and secondary.
>  # In case of a cache miss in the primary and a cache hit in the secondary, 
> we add the key to the primary map as well.
>  # In case of a cache miss in both the maps, we add it to the primary map.
>  # When we reach (make this check each time we insert?) a large number of 
> elements in say the primary cache, (say larger than the existing 
> {color:#871094}DEFAULT_CACHE_VALUE{color}=4000), we dump the secondary map 
> and copy all the values of the primary map into it.
> The idea was originally explained in 
> [this|https://github.com/apache/lucene/pull/179#discussion_r692907559] 
> comment.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10068) Switch to a "double barrel" HPPC cache for the taxonomy LRU cache

2021-08-26 Thread Gautam Worah (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405108#comment-17405108
 ] 

Gautam Worah commented on LUCENE-10068:
---

Some history here: We previously had a[ double barrel 
LRU|https://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/util/DoubleBarrelLRUCache.html]
 cache in Lucene 3.6 but we ended up removing it I think. Jira Search throws 
some ancient Jira 
[tickets|http://jirasearch.mikemccandless.com/search.py?chg=new&text=DoubleBarrelLRUCache&a1=&a2=&page=0&searcher=42159&sort=recentlyUpdated&format=list&id=1xy07b2ketsi&newText=DoubleBarrelLRUCache]

> Switch to a "double barrel" HPPC cache for the taxonomy LRU cache
> -
>
> Key: LUCENE-10068
> URL: https://issues.apache.org/jira/browse/LUCENE-10068
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.8.1
>Reporter: Gautam Worah
>Priority: Minor
>
> While working on an unrelated getBulkPath API 
> [PR|https://github.com/apache/lucene/pull/179], [~mikemccand] and I came 
> across a nice optimization that could be made to the taxonomy cache.
> The taxonomy cache today caches frequently used ordinals and their 
> corresponding FacetLabels. It uses the existing LRUHashMap (backed by a 
> LinkedList) class for its implementation.
> This implementation performs sub optimally when it has a large number of 
> threads accessing it, and consumes a large amount of RAM.
> [~mikemccand] suggested the idea of a two array backed HPPC int->FacetLabel 
> cache. The basic idea behind the cache being:
>  # We use two hashmaps primary and secondary.
>  # In case of a cache miss in the primary and a cache hit in the secondary, 
> we add the key to the primary map as well.
>  # In case of a cache miss in both the maps, we add it to the primary map.
>  # When we reach (make this check each time we insert?) a large number of 
> elements in say the primary cache, (say larger than the existing 
> {color:#871094}DEFAULT_CACHE_VALUE{color}=4000), we dump the secondary map 
> and copy all the values of the primary map into it.
> The idea was originally explained in 
> [this|https://github.com/apache/lucene/pull/179#discussion_r692907559] 
> comment.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10072) Regenerate FST dictionaries (they're out of sync)

2021-08-26 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-10072.
--
Fix Version/s: main (9.0)
   Resolution: Fixed

> Regenerate FST dictionaries (they're out of sync)
> -
>
> Key: LUCENE-10072
> URL: https://issues.apache.org/jira/browse/LUCENE-10072
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: main (9.0)
>
>
> Regenerate leaves modified resources (fst dictionaries). 
> - Identify which commit has changed the binary output.
> - Regenerate static resources so that regenerate remains a no-op status.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9047) Directory APIs should be little endian

2021-08-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405096#comment-17405096
 ] 

ASF subversion and git services commented on LUCENE-9047:
-

Commit f6e3b08ae95376b1c574759522974a8d74f850c5 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f6e3b08 ]

LUCENE-10072: Regenerate FST dictionaries after LUCENE-9047. (#265)



> Directory APIs should be little endian
> --
>
> Key: LUCENE-9047
> URL: https://issues.apache.org/jira/browse/LUCENE-9047
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Ignacio Vera
>Priority: Blocker
> Fix For: main (9.0)
>
>  Time Spent: 12.5h
>  Remaining Estimate: 0h
>
> We started discussing this on LUCENE-9027. It's a shame that we need to keep 
> reversing the order of bytes all the time because our APIs are big endian 
> while the vast majority of architectures are little endian.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10072) Regenerate FST dictionaries (they're out of sync)

2021-08-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405095#comment-17405095
 ] 

ASF subversion and git services commented on LUCENE-10072:
--

Commit f6e3b08ae95376b1c574759522974a8d74f850c5 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f6e3b08 ]

LUCENE-10072: Regenerate FST dictionaries after LUCENE-9047. (#265)



> Regenerate FST dictionaries (they're out of sync)
> -
>
> Key: LUCENE-10072
> URL: https://issues.apache.org/jira/browse/LUCENE-10072
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> Regenerate leaves modified resources (fst dictionaries). 
> - Identify which commit has changed the binary output.
> - Regenerate static resources so that regenerate remains a no-op status.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss merged pull request #265: LUCENE-10072: Regenerate FST dictionaries after LUCENE-9047.

2021-08-26 Thread GitBox


dweiss merged pull request #265:
URL: https://github.com/apache/lucene/pull/265


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10072) Regenerate FST dictionaries (they're out of sync)

2021-08-26 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405063#comment-17405063
 ] 

Dawid Weiss commented on LUCENE-10072:
--

The change is due to: LUCENE-9047: Directory API is now little endian. I'll 
regenerate and commit.

> Regenerate FST dictionaries (they're out of sync)
> -
>
> Key: LUCENE-10072
> URL: https://issues.apache.org/jira/browse/LUCENE-10072
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>
> Regenerate leaves modified resources (fst dictionaries). 
> - Identify which commit has changed the binary output.
> - Regenerate static resources so that regenerate remains a no-op status.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10072) Regenerate FST dictionaries (they're out of sync)

2021-08-26 Thread Dawid Weiss (Jira)
Dawid Weiss created LUCENE-10072:


 Summary: Regenerate FST dictionaries (they're out of sync)
 Key: LUCENE-10072
 URL: https://issues.apache.org/jira/browse/LUCENE-10072
 Project: Lucene - Core
  Issue Type: Task
Reporter: Dawid Weiss
Assignee: Dawid Weiss


Regenerate leaves modified resources (fst dictionaries). 
- Identify which commit has changed the binary output.
- Regenerate static resources so that regenerate remains a no-op status.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10066) Build does not work with JDK16 as gradle's runtime

2021-08-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405045#comment-17405045
 ] 

ASF subversion and git services commented on LUCENE-10066:
--

Commit 39a2fc62d40b3f82000ab4f35ed09c3e10cd6b5e in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=39a2fc6 ]

LUCENE-10066: Build does not work with JDK16 as gradle's runtime (#259)



> Build does not work with JDK16 as gradle's runtime
> --
>
> Key: LUCENE-10066
> URL: https://issues.apache.org/jira/browse/LUCENE-10066
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is because of spotless trying to access internal APIs - can be worked 
> around by opening them [1].
> There is also an odd warning/ error about a deprecated constructor (singleton 
> new Long(0)). I didn't look at it.
> [1] https://github.com/diffplug/spotless/issues/834#issuecomment-819118761



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10066) Build does not work with JDK16 as gradle's runtime

2021-08-26 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-10066.
--
Fix Version/s: main (9.0)
 Assignee: Dawid Weiss
   Resolution: Fixed

> Build does not work with JDK16 as gradle's runtime
> --
>
> Key: LUCENE-10066
> URL: https://issues.apache.org/jira/browse/LUCENE-10066
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is because of spotless trying to access internal APIs - can be worked 
> around by opening them [1].
> There is also an odd warning/ error about a deprecated constructor (singleton 
> new Long(0)). I didn't look at it.
> [1] https://github.com/diffplug/spotless/issues/834#issuecomment-819118761



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss merged pull request #259: LUCENE-10066: Build does not work with JDK16 as gradle's runtime

2021-08-26 Thread GitBox


dweiss merged pull request #259:
URL: https://github.com/apache/lucene/pull/259


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #259: LUCENE-10066: Build does not work with JDK16 as gradle's runtime

2021-08-26 Thread GitBox


dweiss commented on pull request #259:
URL: https://github.com/apache/lucene/pull/259#issuecomment-906189898


   Things are always easy, once solved. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9990) Tracking issue for Gradle upgrade to 7.2

2021-08-26 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405043#comment-17405043
 ] 

Dawid Weiss commented on LUCENE-9990:
-

Everything works up to JDK16 -- I'll commit LUCENE-10066 shortly. If you 
regenerate your local gradle.properties it adds module opens for spotless, it's 
the only thing that wasn't compatible.

> Tracking issue for Gradle upgrade to 7.2
> 
>
> Key: LUCENE-9990
> URL: https://issues.apache.org/jira/browse/LUCENE-9990
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Gautam Worah
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
> Attachments: wip_gradle7_upgrade_patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Gradle 7 has added support for running builds with arbitrary JVMs.
> Today, Gradle 6 only supports running tests with Java 16 and so on.
> I tried to upgrade our Gradle version to 7 and made some progress.
> 1. Removed the JavaInstallationRegistry plugin because it is deprecated in 
> Gradle 7 ( a simple build scan reveals this). This is replaced by the 
> toolchain support added in Gradle 7 and works great. JavaInstallationRegistry 
> is the only deprecated plugin
>  2. Building Lucene with Java 16 gives some weird error when trying to access 
> internal JVM classes. Fixing it with {{--add-opens}} does the trick. Related 
> Github issue: [https://github.com/gradle/gradle/issues/15538]
> What does not work?
> As noted by [~dweiss] 
> [here|https://github.com/palantir/gradle-consistent-versions/issues/700], the 
> gradle-consistent-versions plugin does not support Gradle 7. There was a 
> related [PR|https://github.com/palantir/gradle-consistent-versions/pull/721] 
> but it still a WIP
>  
> Attached is a WIP patch that breaks due to the gradle-consistent-versions 
> plugin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9990) Tracking issue for Gradle upgrade to 7.2

2021-08-26 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405039#comment-17405039
 ] 

Uwe Schindler commented on LUCENE-9990:
---

Hi,
Jenkins seems Happy on Windows, Linux and Map. Also passing RUNTIME_JAVA_HOME 
works on Policeman Jenkins. Build always runs with JDK 11 there, so I see no 
reason to improve compatibility yet. This should be solved by Gradle.

> Tracking issue for Gradle upgrade to 7.2
> 
>
> Key: LUCENE-9990
> URL: https://issues.apache.org/jira/browse/LUCENE-9990
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Gautam Worah
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
> Attachments: wip_gradle7_upgrade_patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Gradle 7 has added support for running builds with arbitrary JVMs.
> Today, Gradle 6 only supports running tests with Java 16 and so on.
> I tried to upgrade our Gradle version to 7 and made some progress.
> 1. Removed the JavaInstallationRegistry plugin because it is deprecated in 
> Gradle 7 ( a simple build scan reveals this). This is replaced by the 
> toolchain support added in Gradle 7 and works great. JavaInstallationRegistry 
> is the only deprecated plugin
>  2. Building Lucene with Java 16 gives some weird error when trying to access 
> internal JVM classes. Fixing it with {{--add-opens}} does the trick. Related 
> Github issue: [https://github.com/gradle/gradle/issues/15538]
> What does not work?
> As noted by [~dweiss] 
> [here|https://github.com/palantir/gradle-consistent-versions/issues/700], the 
> gradle-consistent-versions plugin does not support Gradle 7. There was a 
> related [PR|https://github.com/palantir/gradle-consistent-versions/pull/721] 
> but it still a WIP
>  
> Attached is a WIP patch that breaks due to the gradle-consistent-versions 
> plugin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9990) Tracking issue for Gradle upgrade to 7.2

2021-08-26 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405034#comment-17405034
 ] 

Dawid Weiss commented on LUCENE-9990:
-

I actually ran those already. One of the generated files changes (fst.dat) - I 
think it's because of FST implementation changes; this could be regenerated and 
committed to make sure it's in sync with the codebase.

> Tracking issue for Gradle upgrade to 7.2
> 
>
> Key: LUCENE-9990
> URL: https://issues.apache.org/jira/browse/LUCENE-9990
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Gautam Worah
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
> Attachments: wip_gradle7_upgrade_patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Gradle 7 has added support for running builds with arbitrary JVMs.
> Today, Gradle 6 only supports running tests with Java 16 and so on.
> I tried to upgrade our Gradle version to 7 and made some progress.
> 1. Removed the JavaInstallationRegistry plugin because it is deprecated in 
> Gradle 7 ( a simple build scan reveals this). This is replaced by the 
> toolchain support added in Gradle 7 and works great. JavaInstallationRegistry 
> is the only deprecated plugin
>  2. Building Lucene with Java 16 gives some weird error when trying to access 
> internal JVM classes. Fixing it with {{--add-opens}} does the trick. Related 
> Github issue: [https://github.com/gradle/gradle/issues/15538]
> What does not work?
> As noted by [~dweiss] 
> [here|https://github.com/palantir/gradle-consistent-versions/issues/700], the 
> gradle-consistent-versions plugin does not support Gradle 7. There was a 
> related [PR|https://github.com/palantir/gradle-consistent-versions/pull/721] 
> but it still a WIP
>  
> Attached is a WIP patch that breaks due to the gradle-consistent-versions 
> plugin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10067) investigate 6/23/2021 -> 6/24/2021 drop in facets perf

2021-08-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405029#comment-17405029
 ] 

ASF subversion and git services commented on LUCENE-10067:
--

Commit 2d7590a3555c5afb205bc781cd227d0c9e3d47a3 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2d7590a ]

LUCENE-9613, LUCENE-10067: Further specialize ordinals. (#260)



> investigate 6/23/2021 -> 6/24/2021 drop in facets perf
> --
>
> Key: LUCENE-10067
> URL: https://issues.apache.org/jira/browse/LUCENE-10067
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Just looking at perf graph (and recent gains from LUCENE-5309), it is unclear 
> why performance dropped on that date, and if we paved new, better 
> performance, or instead just regained some lost-ground only in the single 
> value case?
> Example:
> https://home.apache.org/~mikemccand/lucenebench/BrowseDayOfYearSSDVFacets.html
> after some debugging, it looks like LUCENE-9613 change may be responsible. It 
> was the only relevant change between the 6/23 and 6/24 benchmarks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9613) Create blocks for ords when it helps in Lucene80DocValuesFormat

2021-08-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405028#comment-17405028
 ] 

ASF subversion and git services commented on LUCENE-9613:
-

Commit 2d7590a3555c5afb205bc781cd227d0c9e3d47a3 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2d7590a ]

LUCENE-9613, LUCENE-10067: Further specialize ordinals. (#260)



> Create blocks for ords when it helps in Lucene80DocValuesFormat
> ---
>
> Key: LUCENE-9613
> URL: https://issues.apache.org/jira/browse/LUCENE-9613
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Currently for sorted(-set) values, we always write ords using 
> log2(valueCount) bits per entry. However in several cases like when the field 
> is used in the index sort, or if one value is _very_common, splitting into 
> blocks like we do for numerics would help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #260: LUCENE-9613, LUCENE-10067: Further specialize ordinals.

2021-08-26 Thread GitBox


jpountz merged pull request #260:
URL: https://github.com/apache/lucene/pull/260


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

2021-08-26 Thread GitBox


jpountz commented on a change in pull request #262:
URL: https://github.com/apache/lucene/pull/262#discussion_r696363685



##
File path: 
lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws 
IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) 
throws IOException {
-throw new UnsupportedOperationException();
+VectorValues values = getVectorValues(field);
+if (values == null) {
+  return null;
+}
+if (target.length != values.dimension()) {
+  throw new IllegalArgumentException(
+  "incorrect dimension for field "
+  + field
+  + "; expected "
+  + values.dimension()
+  + " but target has "
+  + target.length);
+}
+FieldInfo info = readState.fieldInfos.fieldInfo(field);
+VectorSimilarityFunction vectorSimilarity = 
info.getVectorSimilarityFunction();
+HitQueue topK = new HitQueue(k, false);
+int doc;
+while ((doc = values.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  float[] vector = values.vectorValue();
+  float score = vectorSimilarity.compare(vector, target);
+  if (vectorSimilarity.reversed) {
+score = 1 / (score + 1);
+  }
+  topK.insertWithOverflow(new ScoreDoc(doc, score));
+}
+ScoreDoc[] topScoreDocs = new ScoreDoc[topK.size()];
+int i = 0;
+for (ScoreDoc scoreDoc : topK) {
+  topScoreDocs[i++] = scoreDoc;
+}
+Arrays.sort(topScoreDocs, Comparator.comparingInt(x -> x.doc));

Review comment:
   I would have expected this method to return vectors by descending score. 
This makes me curious about what the exact contract of this method is regarding 
the order of the hits. If it is unspecified, maybe we should make it clearer in 
javadocs?

##
File path: 
lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws 
IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) 
throws IOException {
-throw new UnsupportedOperationException();
+VectorValues values = getVectorValues(field);

Review comment:
   Our general approach for these problems is that the 
`XXXReader`/`XXXProducer` classes can assume that their methods are only called 
on fields that have the feature enabled according to `FieldInfos`, and it's the 
responsibility of `CodecReader` to check `FieldInfos` before forwarding calls 
to `XXXReader`/`XXXProducer` classes. This seems to already be done correctly 
in `CodecReader#searchNearestVectors`.
   
   So I think we are good, and we should even remove the `if (values == null)` 
check below, which is not necessary and might hide bugs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org