[jira] [Comment Edited] (LUCENE-9855) Reconsider names for ANN related format and APIs

2021-06-07 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359044#comment-17359044
 ] 

Julie Tibshirani edited comment on LUCENE-9855 at 6/8/21, 5:32 AM:
---

I noticed recently we still had this blocker left (I had missed it because I 
filtered issues only on 'Open' not 'Reopened'!)

It seems [~jpountz] and I have a different intuition about {{NeighborsFormat}}! 
My only real suggestion is that 'vectors' appears somewhere in the name -- this 
feels like it accurately describes the data and will match user's expectations. 
Any of these names seem like they'd work to me: {{VectorNeighborsFormat}}, 
{{NumericVectorsFormat}}, {{DenseVectorsFormat}}. It naming proves too 
difficult I guess we could also fall back to {{VectorsFormat}} for now, making 
sure it's at least plural?


was (Author: julietibs):
I noticed recently we still had this blocker left (I had missed it because I 
filtered issues only on 'Opened'!)

It seems [~jpountz] and I have a different intuition about {{NeighborsFormat}}! 
My only real suggestion is that 'vectors' appears somewhere in the name -- this 
feels like it accurately describes the data and will match user's expectations. 
Any of these names seem like they'd work to me: {{VectorNeighborsFormat}}, 
{{NumericVectorsFormat}}, {{DenseVectorsFormat}}. It naming proves too 
difficult I guess we could also fall back to {{VectorsFormat}} for now, making 
sure it's at least plural?

> Reconsider names for ANN related format and APIs
> 
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Priority: Blocker
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9855) Reconsider names for ANN related format and APIs

2021-06-07 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359044#comment-17359044
 ] 

Julie Tibshirani commented on LUCENE-9855:
--

I noticed recently we still had this blocker left (I had missed it because I 
filtered issues only on 'Opened'!)

It seems [~jpountz] and I have a different intuition about {{NeighborsFormat}}! 
My only real suggestion is that 'vectors' appears somewhere in the name -- this 
feels like it accurately describes the data and will match user's expectations. 
Any of these names seem like they'd work to me: {{VectorNeighborsFormat}}, 
{{NumericVectorsFormat}}, {{DenseVectorsFormat}}. It naming proves too 
difficult I guess we could also fall back to {{VectorsFormat}} for now, making 
sure it's at least plural?

> Reconsider names for ANN related format and APIs
> 
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Priority: Blocker
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #176: LUCENE-9905: Make sure to use configured vector format when merging

2021-06-07 Thread GitBox


jtibshirani commented on a change in pull request #176:
URL: https://github.com/apache/lucene/pull/176#discussion_r647045642



##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldVectorFormat.java
##
@@ -114,14 +114,7 @@ public void close() throws IOException {
 }
 
 private VectorWriter getInstance(FieldInfo field) throws IOException {
-  VectorFormat format = null;

Review comment:
   I think the original logic was copied from `PerFieldDocValuesFormat`. My 
understanding is that it only applied when writing to a segment with doc value 
updates (and not when merging). `PerFieldPostingsFormat` never actually 
consults the per field format when writing. I'd appreciate if someone could 
double-check this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments

2021-06-07 Thread GitBox


zacharymorn commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-856394004


   Hi @mikemccand, I have addressed the comment above with some additional 
changes and posted some updated results. Could you please let me know if they 
look ready?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments

2021-06-07 Thread GitBox


zacharymorn commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r647068377



##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -843,6 +812,258 @@ public Status checkIndex(List onlySegments) 
throws IOException {
 return result;
   }
 
+  private void updateMaxSegmentName(Status result, SegmentCommitInfo info) {
+long segmentName = Long.parseLong(info.info.name.substring(1), 
Character.MAX_RADIX);
+if (segmentName > result.maxSegmentName) {
+  result.maxSegmentName = segmentName;
+}
+  }
+
+  private void processSegmentInfoStatusResult(
+  Status result, SegmentCommitInfo info, Status.SegmentInfoStatus 
segmentInfoStatus) {
+result.segmentInfos.add(segmentInfoStatus);
+if (segmentInfoStatus.error != null) {
+  result.totLoseDocCount += segmentInfoStatus.toLoseDocCount;
+  result.numBadSegments++;
+} else {
+  // Keeper
+  result.newSegments.add(info.clone());
+}
+  }
+
+  private  CompletableFuture runAsyncSegmentCheck(
+  Callable asyncCallable, ExecutorService executorService) {
+return CompletableFuture.supplyAsync(callableToSupplier(asyncCallable), 
executorService);
+  }
+
+  private  Supplier callableToSupplier(Callable callable) {
+return () -> {
+  try {
+return callable.call();
+  } catch (RuntimeException | Error e) {
+throw e;
+  } catch (Throwable e) {
+throw new CompletionException(e);
+  }
+};
+  }
+
+  private Status.SegmentInfoStatus testSegment(
+  SegmentInfos sis, SegmentCommitInfo info, PrintStream infoStream) throws 
IOException {
+Status.SegmentInfoStatus segInfoStat = new Status.SegmentInfoStatus();
+segInfoStat.name = info.info.name;
+segInfoStat.maxDoc = info.info.maxDoc();
+
+final Version version = info.info.getVersion();
+if (info.info.maxDoc() <= 0) {
+  throw new CheckIndexException(" illegal number of documents: maxDoc=" + 
info.info.maxDoc());
+}
+
+int toLoseDocCount = info.info.maxDoc();
+
+SegmentReader reader = null;
+
+try {
+  msg(infoStream, "version=" + (version == null ? "3.0" : version));
+  msg(infoStream, "id=" + StringHelper.idToString(info.info.getId()));
+  final Codec codec = info.info.getCodec();
+  msg(infoStream, "codec=" + codec);
+  segInfoStat.codec = codec;
+  msg(infoStream, "compound=" + info.info.getUseCompoundFile());
+  segInfoStat.compound = info.info.getUseCompoundFile();
+  msg(infoStream, "numFiles=" + info.files().size());
+  Sort indexSort = info.info.getIndexSort();
+  if (indexSort != null) {
+msg(infoStream, "sort=" + indexSort);
+  }
+  segInfoStat.numFiles = info.files().size();
+  segInfoStat.sizeMB = info.sizeInBytes() / (1024. * 1024.);
+  // nf#format is not thread-safe, and would generate random non valid 
results in concurrent
+  // setting
+  synchronized (nf) {
+msg(infoStream, "size (MB)=" + nf.format(segInfoStat.sizeMB));
+  }
+  Map diagnostics = info.info.getDiagnostics();
+  segInfoStat.diagnostics = diagnostics;
+  if (diagnostics.size() > 0) {
+msg(infoStream, "diagnostics = " + diagnostics);
+  }
+
+  if (!info.hasDeletions()) {
+msg(infoStream, "no deletions");
+segInfoStat.hasDeletions = false;
+  } else {
+msg(infoStream, "has deletions [delGen=" + info.getDelGen() + "]");
+segInfoStat.hasDeletions = true;
+segInfoStat.deletionsGen = info.getDelGen();
+  }
+
+  long startOpenReaderNS = System.nanoTime();
+  if (infoStream != null) infoStream.print("test: open 
reader.");
+  reader = new SegmentReader(info, sis.getIndexCreatedVersionMajor(), 
IOContext.DEFAULT);
+  msg(
+  infoStream,
+  String.format(
+  Locale.ROOT, "OK [took %.3f sec]", nsToSec(System.nanoTime() - 
startOpenReaderNS)));
+
+  segInfoStat.openReaderPassed = true;
+
+  long startIntegrityNS = System.nanoTime();
+  if (infoStream != null) infoStream.print("test: check 
integrity.");
+  reader.checkIntegrity();
+  msg(
+  infoStream,
+  String.format(
+  Locale.ROOT, "OK [took %.3f sec]", nsToSec(System.nanoTime() - 
startIntegrityNS)));
+
+  if (reader.maxDoc() != info.info.maxDoc()) {
+throw new CheckIndexException(
+"SegmentReader.maxDoc() "
++ reader.maxDoc()
++ " != SegmentInfo.maxDoc "
++ info.info.maxDoc());
+  }
+
+  final int numDocs = reader.numDocs();
+  toLoseDocCount = numDocs;

Review comment:
   Same as above 
https://github.com/apache/lucene/pull/128#discussion_r642722424, this might be 
intended behavior?

##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java

[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-07 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359001#comment-17359001
 ] 

Zach Chen commented on LUCENE-9976:
---

No worry [~jpountz], and hope you had a great vacation! I'm looking forward to 
mine coming up in a few weeks! :D
{quote}It's a bit worrying that this bug only got caught by 
TestExpressionSorts, I wonder why the test cases we have in TestWANDScorer 
didn't catch it.
{quote}
That's a great call. I played around with the tests there a bit and came up 
with one new test that would fail around 80% of the time (not sure if there's 
clause ordering or other randomness kicked in) without the fix. From that, I 
think the _ConstantScoreQuery_ used heavily in those tests might have masked 
the issue a bit?

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #175: LUCENE-9990: gradle7 support

2021-06-07 Thread GitBox


gautamworah96 commented on a change in pull request #175:
URL: https://github.com/apache/lucene/pull/175#discussion_r647053364



##
File path: gradle/testing/alternative-jdk-support.gradle
##
@@ -18,60 +18,60 @@
 // This adds support for compiling and testing against a different Java 
runtime.
 // This is the only way to build against JVMs not yet supported by Gradle 
itself.
 
-JavaInstallationRegistry registry = 
extensions.getByType(JavaInstallationRegistry)
-
-JavaInstallation currentJvm = 
registry.installationForCurrentVirtualMachine.get()
-
-JavaInstallation altJvm = {
-  def runtimeJavaHome = propertyOrDefault("runtime.java.home", 
System.getenv('RUNTIME_JAVA_HOME'))
-  if (!runtimeJavaHome) {
-return currentJvm
-  } else {
-return registry.installationForDirectory(
-layout.projectDirectory.dir(runtimeJavaHome)).get()
-  }
-}()
-
-// Set up root project's property.
-rootProject.ext.runtimeJava = altJvm
-rootProject.ext.runtimeJavaVersion = altJvm.javaVersion
-
-if (!currentJvm.javaExecutable.equals(altJvm.javaExecutable)) {
-  // Set up java toolchain tasks to use the alternative Java.
-  // This is a related Gradle issue for the future:
-  // https://github.com/gradle/gradle/issues/1652
-
-  configure(rootProject) {
-task altJvmWarning() {
-  doFirst {
-logger.warn("""NOTE: Alternative java toolchain will be used for 
compilation and tests:
-  Project will use Java ${altJvm.javaVersion} from: 
${altJvm.installationDirectory}
-  Gradle runs with Java ${currentJvm.javaVersion} from: 
${currentJvm.installationDirectory}
-""")
-  }
-}
-  }
-
-  // Set up toolchain-dependent tasks to use the alternative JVM.
-  allprojects {
-// Any tests
-tasks.withType(Test) {
-  dependsOn ":altJvmWarning"
-  executable = altJvm.javaExecutable
-}
-
-// Any javac compilation tasks
-tasks.withType(JavaCompile) {
-  dependsOn ":altJvmWarning"
-  options.fork = true
-  options.forkOptions.javaHome = altJvm.installationDirectory.asFile
-}
-
-// Javadoc compilation.
-def javadocExecutable = altJvm.jdk.get().javadocExecutable.asFile
-tasks.matching { it.name == "renderJavadoc" || it.name == 
"renderSiteJavadoc" }.all {
-  dependsOn ":altJvmWarning"
-  executable = javadocExecutable
-}
-  }
-}
+//JavaInstallationRegistry registry = 
extensions.getByType(JavaInstallationRegistry)

Review comment:
   Sure.
   Gradle 7 has something similar to what we have already setup today i.e., 
using an env variable to send in the location of the JVM. I can get it running 
and then add instructions in lucene/help/jvms.txt, cleanup the comments in this 
file as well!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #175: LUCENE-9990: gradle7 support

2021-06-07 Thread GitBox


gautamworah96 commented on a change in pull request #175:
URL: https://github.com/apache/lucene/pull/175#discussion_r647053364



##
File path: gradle/testing/alternative-jdk-support.gradle
##
@@ -18,60 +18,60 @@
 // This adds support for compiling and testing against a different Java 
runtime.
 // This is the only way to build against JVMs not yet supported by Gradle 
itself.
 
-JavaInstallationRegistry registry = 
extensions.getByType(JavaInstallationRegistry)
-
-JavaInstallation currentJvm = 
registry.installationForCurrentVirtualMachine.get()
-
-JavaInstallation altJvm = {
-  def runtimeJavaHome = propertyOrDefault("runtime.java.home", 
System.getenv('RUNTIME_JAVA_HOME'))
-  if (!runtimeJavaHome) {
-return currentJvm
-  } else {
-return registry.installationForDirectory(
-layout.projectDirectory.dir(runtimeJavaHome)).get()
-  }
-}()
-
-// Set up root project's property.
-rootProject.ext.runtimeJava = altJvm
-rootProject.ext.runtimeJavaVersion = altJvm.javaVersion
-
-if (!currentJvm.javaExecutable.equals(altJvm.javaExecutable)) {
-  // Set up java toolchain tasks to use the alternative Java.
-  // This is a related Gradle issue for the future:
-  // https://github.com/gradle/gradle/issues/1652
-
-  configure(rootProject) {
-task altJvmWarning() {
-  doFirst {
-logger.warn("""NOTE: Alternative java toolchain will be used for 
compilation and tests:
-  Project will use Java ${altJvm.javaVersion} from: 
${altJvm.installationDirectory}
-  Gradle runs with Java ${currentJvm.javaVersion} from: 
${currentJvm.installationDirectory}
-""")
-  }
-}
-  }
-
-  // Set up toolchain-dependent tasks to use the alternative JVM.
-  allprojects {
-// Any tests
-tasks.withType(Test) {
-  dependsOn ":altJvmWarning"
-  executable = altJvm.javaExecutable
-}
-
-// Any javac compilation tasks
-tasks.withType(JavaCompile) {
-  dependsOn ":altJvmWarning"
-  options.fork = true
-  options.forkOptions.javaHome = altJvm.installationDirectory.asFile
-}
-
-// Javadoc compilation.
-def javadocExecutable = altJvm.jdk.get().javadocExecutable.asFile
-tasks.matching { it.name == "renderJavadoc" || it.name == 
"renderSiteJavadoc" }.all {
-  dependsOn ":altJvmWarning"
-  executable = javadocExecutable
-}
-  }
-}
+//JavaInstallationRegistry registry = 
extensions.getByType(JavaInstallationRegistry)

Review comment:
   Sure.
   Gradle 7 has something similar to what we have already setup today i.e., 
using an env variable to send in the location of the JVM. I can get it running 
and then add instructions in lucene/help/jvms.txt 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #175: LUCENE-9990: gradle7 support

2021-06-07 Thread GitBox


gautamworah96 commented on a change in pull request #175:
URL: https://github.com/apache/lucene/pull/175#discussion_r647040460



##
File path: gradle/java/javac.gradle
##
@@ -16,6 +16,15 @@
  */
 
 // Configure Java project defaults.
+java {
+  toolchain {
+languageVersion = JavaLanguageVersion.of(11)
+  }
+}
+
+tasks.withType( JavaCompile ).configureEach {

Review comment:
   Thanks for adding and reviewing the patch!
   
   Sure. gradle/hacks makes sense. I'll also follow up on the ticket to see if 
there is a better way to solve this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #176: LUCENE-9905: Make sure to use configured vector format when merging

2021-06-07 Thread GitBox


jtibshirani commented on a change in pull request #176:
URL: https://github.com/apache/lucene/pull/176#discussion_r647045642



##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldVectorFormat.java
##
@@ -114,14 +114,7 @@ public void close() throws IOException {
 }
 
 private VectorWriter getInstance(FieldInfo field) throws IOException {
-  VectorFormat format = null;

Review comment:
   I think the original logic was copied from `PerFieldDocValuesFormat`. My 
understanding is that it is *not* meant to be applied during merging (and 
`PerFieldPostingsFormat` doesn't even include it). I'd appreciate if someone 
could double-check this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new pull request #176: LUCENE-9905: Make sure to use configured vector format when merging

2021-06-07 Thread GitBox


jtibshirani opened a new pull request #176:
URL: https://github.com/apache/lucene/pull/176


   Before when creating a VectorWriter for merging, we would always load the
   default implementation. So if the format was configured with parameters, they
   were ignored.
   
   This issue was caught by `TestKnnGraph#testMergeProducesSameGraph`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #175: LUCENE-9990: gradle7 support

2021-06-07 Thread GitBox


gautamworah96 commented on a change in pull request #175:
URL: https://github.com/apache/lucene/pull/175#discussion_r647040440



##
File path: gradle/wrapper/gradle-wrapper.jar.version
##
@@ -1 +1 @@
-6.8.3
+7.0.2

Review comment:
   I checked https://services.gradle.org/distributions/ and the SHA256 of 
the gradle-7.0.2-wrapper.jar.sha256 has remained the same. It was the same for 
the 6.6.1 to 6.8.3 transition as well

##
File path: gradle/java/javac.gradle
##
@@ -16,6 +16,15 @@
  */
 
 // Configure Java project defaults.
+java {
+  toolchain {
+languageVersion = JavaLanguageVersion.of(11)
+  }
+}
+
+tasks.withType( JavaCompile ).configureEach {

Review comment:
   Sure. gradle/hacks makes sense. I'll also follow up on the ticket to see 
if there is a better way to solve this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #172: LUCENE-9992: write empty vector fields when merging

2021-06-07 Thread GitBox


rmuir commented on pull request #172:
URL: https://github.com/apache/lucene/pull/172#issuecomment-856360112


   I hit the issue locally just running `./gradlew test`. A second time the 
tests pass.
   
   I think it isn't great that simple base-cases like empty are only exposed in 
random test failures (although: better than nothing!). So I am curious if there 
are any ideas on improving testing.
   
   I think there are a few challenges:
   * filenames/extensions used should be codec-private/impl, so we try to avoid 
hardcoding them
   * most codec tests are in base shared generic classes in test-framework: a 
good thing
   * new abstractions to the index (*Formats) are added rarely, so all the 
implicit/unwritten "rules" are forgotten by everyone.
   
   So today, nobody tests simple stuff like: should a filename exist or not? 
And it leaves us with too many unwritten assumptions about how Formats should 
behave in the corner cases.
   
   Maybe it should be as simple as various BaseXYZTestCase accepting a list of 
used file extensions as an abstract method or parameter? Then e.g. 
BaseDocValuesFormatTestCase could have some simple unit tests making asserts 
about e.g. what happens wrt filenames when there's no docvalues, or stuff gets 
merged away, or whatever the cases are? Maybe we should do it for every part of 
the index?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #174: LUCENE-8739: Add ZSTD support.

2021-06-07 Thread GitBox


rmuir commented on pull request #174:
URL: https://github.com/apache/lucene/pull/174#issuecomment-856311545


   What is the improvement? Especially interested in performance versus a 
"good" zlib: e.g. your comment here: 
https://issues.apache.org/jira/browse/LUCENE-8739?focusedCommentId=17322015=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17322015
   
   I'm just curious because it definitely adds complexity by adding a 
third-party dependency/native code to lucene-codecs. JNA is complicated (e.g. 
.so hell with various "system versions" of it, doesn't work out of box with 
non-executable /tmp, doesn't support all architectures, etc)
   
   If the improvement can be done in an easier way (e.g. instructions to luser 
on how to swap in better zlib), that seems like a better way to go?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions

2021-06-07 Thread GitBox


gsmiller commented on a change in pull request #159:
URL: https://github.com/apache/lucene/pull/159#discussion_r646946460



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -390,18 +398,44 @@ protected boolean scoreSubDocsAtOnce() {
 return false;
   }
 
-  /** Result of a drill sideways search, including the {@link Facets} and 
{@link TopDocs}. */
+  /**
+   * Result of a drill sideways search, including the {@link Facets} and 
{@link TopDocs} {@link

Review comment:
   I'm finding this documentation a little difficult to parse when I read 
it. Maybe something like:
   
   ```
 /**
  * Result of a drill sideways search, including the {@link Facets} and 
{@link TopDocs}.
  * The {@link FacetsCollector}s for the drill down and drill sideways 
dimensions are also exposed
  * for advanced use-cases that need access to them as an alternative to 
accessing the
  * {@code Facets}.
  */
   ```

##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -277,7 +278,10 @@ public ScoreMode scoreMode() {
 drillDownCollector,
 drillSidewaysCollectors,
 drillDownDims.keySet().toArray(new String[0])),
-null);
+null,
+drillDownCollector,
+drillSidewaysCollectors,
+drillDownDims.keySet().toArray(new String[0]));

Review comment:
   It would be nice to store the result of 
`drillDownDims.keySet().toArray(new String[0]))` locally and then reference it 
in these two places instead of calling this `toArray` method twice.

##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -390,18 +398,44 @@ protected boolean scoreSubDocsAtOnce() {
 return false;
   }
 
-  /** Result of a drill sideways search, including the {@link Facets} and 
{@link TopDocs}. */
+  /**
+   * Result of a drill sideways search, including the {@link Facets} and 
{@link TopDocs} {@link
+   * FacetsCollector} for drill down and drill sideways passed to {@link 
#buildFacetsResult} to
+   * compute {@link Facets} are also exposed here as an alternative to 
subclassing DrillSideways to
+   * obtain them
+   */
   public static class DrillSidewaysResult {
 /** Combined drill down and sideways results. */
 public final Facets facets;
 
 /** Hits. */
 public final TopDocs hits;
 
+/** FacetsCollector for drill down. */
+public final FacetsCollector drillDowns;
+
+/**
+ * Facets Collector for drill sideways Array of FacetsCollector maps 
directly to Array of
+ * String[] drillSidewaysDims i.e. ith drillSidewaysDims String maps to 
ith drillSideways
+ * FacetsCollector
+ */
+public final FacetsCollector[] drillSideways;

Review comment:
   I'd propose a more descriptive name, like `drillSidewaysFacetsCollectors`

##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -545,7 +584,10 @@ private DrillDownQuery getDrillDownQuery(
 drillSidewaysCollectors,
 drillDownDims.keySet().toArray(new String[0])),
 null,
-collectorResult);
+collectorResult,
+drillDownCollector,
+drillSidewaysCollectors,
+drillDownDims.keySet().toArray(new String[0]));

Review comment:
   As with above, there's an opportunity to store the result of the 
`toArray()` call locally instead of calling it twice.

##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -390,18 +398,44 @@ protected boolean scoreSubDocsAtOnce() {
 return false;
   }
 
-  /** Result of a drill sideways search, including the {@link Facets} and 
{@link TopDocs}. */
+  /**
+   * Result of a drill sideways search, including the {@link Facets} and 
{@link TopDocs} {@link
+   * FacetsCollector} for drill down and drill sideways passed to {@link 
#buildFacetsResult} to
+   * compute {@link Facets} are also exposed here as an alternative to 
subclassing DrillSideways to
+   * obtain them
+   */
   public static class DrillSidewaysResult {
 /** Combined drill down and sideways results. */
 public final Facets facets;
 
 /** Hits. */
 public final TopDocs hits;
 
+/** FacetsCollector for drill down. */
+public final FacetsCollector drillDowns;
+
+/**
+ * Facets Collector for drill sideways Array of FacetsCollector maps 
directly to Array of
+ * String[] drillSidewaysDims i.e. ith drillSidewaysDims String maps to 
ith drillSideways
+ * FacetsCollector
+ */

Review comment:
   I think some users might have difficulty understanding this 
documentation. Maybe something like:
   
   ```suggestion
   /**
* FacetsCollectors populated for each drill sideways dimension. Each 
collector exposes
* the hits that match on all DrillDownQuery dimensions, but treating 
their corresponding
* sideways 

[GitHub] [lucene] msokolov commented on pull request #172: LUCENE-9992: write empty vector fields when merging

2021-06-07 Thread GitBox


msokolov commented on pull request #172:
URL: https://github.com/apache/lucene/pull/172#issuecomment-856259875


   Hmm sorry for the force-pushing, I rebased locally and ... git is hard. I'm 
not sure what I was thinking before; I actually had a change that did not 
special-case the empty case, and yes it just seems to work. Why was I seeing 
this subs.size() being 0 I do not know, but I can no longer reproduce


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gsmiller opened a new pull request #2507: LUCENE-9946: Support multi-value fields in range facet counting

2021-06-07 Thread GitBox


gsmiller opened a new pull request #2507:
URL: https://github.com/apache/lucene-solr/pull/2507


   This is a backport from `lucene/main`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9905) Revise approach to specifying NN algorithm

2021-06-07 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-9905.
--
Resolution: Fixed

> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: main (9.0)
>Reporter: Julie Tibshirani
>Priority: Blocker
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9905) Revise approach to specifying NN algorithm

2021-06-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358832#comment-17358832
 ] 

ASF subversion and git services commented on LUCENE-9905:
-

Commit 05ae738fc90bbacae333b4818d7dc2880f27f3fd in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=05ae738 ]

LUCENE-9905: Move HNSW build parameters to codec (#166)

Previously, the max connections and beam width parameters could be configured as
field type attributes. This PR moves them to be parameters on
Lucene90HnswVectorFormat, to avoid exposing details of the vector format
implementation in the API.

> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: main (9.0)
>Reporter: Julie Tibshirani
>Priority: Blocker
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #166: LUCENE-9905: Move HNSW build parameters to codec

2021-06-07 Thread GitBox


jtibshirani merged pull request #166:
URL: https://github.com/apache/lucene/pull/166


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gsmiller opened a new pull request #2506: LUCENE-9962, LUCENE-9944, LUCENE-9988: DrillSideways improvement backports

2021-06-07 Thread GitBox


gsmiller opened a new pull request #2506:
URL: https://github.com/apache/lucene-solr/pull/2506


   * LUCENE-9962: Allow DrillSideways sub-classes to provide their own "drill 
down" facet counting implementation (or null)
   * LUCENE-9944: Allow DrillSideways users to pass a CollectorManager without 
requiring an ExecutorService (and concurrent DrillSideways approach)
   * LUCENE-9988: Fix DrillSideways bug discovered in randomized testing
   
   Grouped these three changes together since 9944 relies on 9962, and 9988 
just fixes a bug introduced in 9944. All three of these are backports from 
`lucene/main`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-07 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358789#comment-17358789
 ] 

Haoyu Zhai commented on LUCENE-9983:


[~broustant] in the adversarial test case, I added 3 static counters for 
measuring the avg and max size we seen in the set, and result is we're seeing 
1800+ states averagely and 24000 states at most. I record the set size each 
time we call {{size()}} (basically each iteration) to calculate the average so 
it might not be very accurate.

[~mikemccand] ah thanks my bad, I didn't realize {{determinize}} is called at 
construction time. I'll benchmark that.

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #172: LUCENE-9992: write empty vector fields when merging

2021-06-07 Thread GitBox


msokolov commented on a change in pull request #172:
URL: https://github.com/apache/lucene/pull/172#discussion_r646834733



##
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##
@@ -104,9 +104,12 @@ private void mergeVectors(FieldInfo mergeFieldInfo, final 
MergeState mergeState)
 }
   }
 }
-// Create a new VectorValues by iterating over the sub vectors, mapping 
the resulting
-// docids using docMaps in the mergeState.
-if (subs.size() > 0) {
+if (subs.size() == 0) {
+  // all segments being merged have no vectors
+  writeField(mergeFieldInfo, VectorValues.EMPTY);
+} else {
+  // Create a new VectorValues by iterating over the sub vectors, mapping 
the resulting
+  // docids using docMaps in the mergeState.
   writeField(mergeFieldInfo, new VectorValuesMerger(subs, mergeState));
 }

Review comment:
   Hmm I observed subs.size() being 0 in this test case - that's exactly 
the problem. I'm not sure what the root cause is there, but perhaps if we made 
this more like doc values in some way we could fix further up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] balmukundblr commented on a change in pull request #132: Parallel processing

2021-06-07 Thread GitBox


balmukundblr commented on a change in pull request #132:
URL: https://github.com/apache/lucene/pull/132#discussion_r646815233



##
File path: 
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##
@@ -100,21 +100,24 @@ public void close() throws IOException {
 
   @Override
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, 
IOException {
-Path f = null;
-String name = null;
-synchronized (this) {
-  if (nextFile >= inputFiles.size()) {
-// exhausted files, start a new round, unless forever set to false.
-if (!forever) {
-  throw new NoMoreDataException();
-}
-nextFile = 0;
-iteration++;
-  }
-  f = inputFiles.get(nextFile++);
-  name = f.toRealPath() + "_" + iteration;
+if (docCountArrCreated == false) {
+  docCountArrInit();
 }
 
+//Extract ThreadIndex from unique ThreadName (at position 12), which is 
set with '"IndexThread-"+index', in TaskSequence.java's doParallelTasks()
+int threadIndex = 
Integer.parseInt(Thread.currentThread().getName().substring(12));
+assert (threadIndex >= 0 && threadIndex < docCountArr.length):"Please 
check threadIndex or docCountArr length";
+int stride = threadIndex + docCountArr[threadIndex] * docCountArr.length;
+int inFileSize = inputFiles.size();
+
+//Modulo Operator covers all three possible senarios i.e. 1. If 
inputFiles.size() < Num Of Threads 2.inputFiles.size() == Num Of Threads 
3.inputFiles.size() > Num Of Threads
+int fileIndex = stride % inFileSize;

Review comment:
   Sorry Mike, i forgot to mention that i've tested with inFileSize == 0 
and it throws expected exception.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sejal-pawar commented on a change in pull request #159: LUCENE-9945: extend DrillSideways to expose FacetCollector and drill-down dimensions

2021-06-07 Thread GitBox


sejal-pawar commented on a change in pull request #159:
URL: https://github.com/apache/lucene/pull/159#discussion_r646809569



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
##
@@ -335,10 +342,22 @@ protected boolean scoreSubDocsAtOnce() {
 /** Hits. */
 public final TopDocs hits;
 
+/** Collector to compute Facets */
+FacetsCollector drillDowns;
+
+/** drill-down dimensions */
+Map drillDownDims;

Review comment:
   Pushed a revision where I am exposing FacetsCollector `drillDowns, 
FacetsCollector[] drillSideways, String[] drillSidewaysDims` from within 
`DrillSidewaysResult`. Let me know what you think. Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9993) IndexWriter Initialisation dependent on .fnm files

2021-06-07 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358726#comment-17358726
 ] 

Adrien Grand commented on LUCENE-9993:
--

IndexWriter wants field infos in order to create new segments whose field 
numbers are consistent with the field numbers of existing segments, which 
in-turn enables merging optimizations for stored fields and term vectors.

I'd need to check but I don't think that it is a strict requirement, Lucene has 
logic to de-optimize merges in case of mismatching field numbers, which is 
something that can happen due to {{IndexWriter#addIndexes(Directory[])}} anyway.

> IndexWriter Initialisation dependent on .fnm files
> --
>
> Key: LUCENE-9993
> URL: https://issues.apache.org/jira/browse/LUCENE-9993
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Mohit Godwani
>Priority: Trivial
>  Labels: IndexWriter
>
> I am working on creating an abstraction over Lucene wherein I have 2 places 
> where data is stored: local disk and remote cloud storage. In case the host 
> on which index is present gets terminated due to some issue, I want to be 
> able to replicate the index on another host.
> While trying to recreate index on another host, I start by downloading the 
> metadata files associated with the index (segment_N, .si files) and once done 
> with this, I try to initialise an IndexWriter object on top of the local 
> directory to which this has been downloaded from remote storage. This helps 
> me begin indexing the data (I don't have any updateDocs call and only addDocs 
> operation is used) without the need to download data for older segments.
> While doing so, I am seeing error during the initialization of IndexWriter 
> itself as it tries to get the [field number 
> mappings|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L1116]
>  for the previous segments before it can create the IndexWriter object.
> With the compound file system enabled, this requires to download .cfs files 
> from the remote storage which in turn increases the time required to 
> initialize the IndexWriter, and thus the time before which new host can 
> accept the incoming requests increases resulting in the application rejecting 
> a large number of customer requests.
> * Why do we need the fnm files from previous segments while creating the 
> IndexWriter?
> * Could you help with a workaround for this to prevent downloading the extra 
> files apart from commit metadata
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9994) Can IndexingChain better protect against large documents?

2021-06-07 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-9994:


 Summary: Can IndexingChain better protect against large documents?
 Key: LUCENE-9994
 URL: https://issues.apache.org/jira/browse/LUCENE-9994
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


It's easy for a single document to use several times the amount of RAM that is 
configured on IndexWriter by having many fields or many terms on a single 
field. Could we improve IndexingChain to reject such documents before they may 
cause an out-of-memory error? We could make such documents born deleted in the 
new segment like we already do when consuming a TokenStream raises an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9993) IndexWriter Initialisation dependent on .fnm files

2021-06-07 Thread Mohit Godwani (Jira)
Mohit Godwani created LUCENE-9993:
-

 Summary: IndexWriter Initialisation dependent on .fnm files
 Key: LUCENE-9993
 URL: https://issues.apache.org/jira/browse/LUCENE-9993
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Mohit Godwani


I am working on creating an abstraction over Lucene wherein I have 2 places 
where data is stored: local disk and remote cloud storage. In case the host on 
which index is present gets terminated due to some issue, I want to be able to 
replicate the index on another host.

While trying to recreate index on another host, I start by downloading the 
metadata files associated with the index (segment_N, .si files) and once done 
with this, I try to initialise an IndexWriter object on top of the local 
directory to which this has been downloaded from remote storage. This helps me 
begin indexing the data (I don't have any updateDocs call and only addDocs 
operation is used) without the need to download data for older segments.

While doing so, I am seeing error during the initialization of IndexWriter 
itself as it tries to get the [field number 
mappings|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L1116]
 for the previous segments before it can create the IndexWriter object.

With the compound file system enabled, this requires to download .cfs files 
from the remote storage which in turn increases the time required to initialize 
the IndexWriter, and thus the time before which new host can accept the 
incoming requests increases resulting in the application rejecting a large 
number of customer requests.

* Why do we need the fnm files from previous segments while creating the 
IndexWriter?

* Could you help with a workaround for this to prevent downloading the extra 
files apart from commit metadata

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-07 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358650#comment-17358650
 ] 

Michael McCandless commented on LUCENE-9983:


[~zhai7631], also realize that the actual execution of the {{RegexpQuery}} 
won't change from the optimizations we are trying here, i.e. the resulting 
determinized automaton is the same.  So I think we really need a newish 
benchmark that just measures cost of parsing/determinizing the regexp, e.g. 
simply calling {{new RegexpQuery("xxx").}}

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8143) Remove SpanBoostQuery

2021-06-07 Thread Alan Woodward (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Woodward resolved LUCENE-8143.
---
Fix Version/s: main (9.0)
   Resolution: Fixed

> Remove SpanBoostQuery
> -
>
> Key: LUCENE-8143
> URL: https://issues.apache.org/jira/browse/LUCENE-8143
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I initially added it so that span queries could still be boosted, but this 
> was actually a mistake: boosts are ignored on inner span queries, only the 
> boost of the top-level span query, the one that performs scoring, is not 
> ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8143) Remove SpanBoostQuery

2021-06-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358648#comment-17358648
 ] 

ASF subversion and git services commented on LUCENE-8143:
-

Commit dbb4c265d5e0032a2832b7e5b67bfa8b560cbdcc in lucene's branch 
refs/heads/main from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=dbb4c26 ]

LUCENE-8143: Remove no-op SpanBoostQuery (#155)

Boosts are ignored on inner span queries, and top-level boosts can
be applied by using a normal BoostQuery, so SpanBoostQuery
itself is redundant and trappy. This commit removes it entirely.

> Remove SpanBoostQuery
> -
>
> Key: LUCENE-8143
> URL: https://issues.apache.org/jira/browse/LUCENE-8143
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Alan Woodward
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I initially added it so that span queries could still be boosted, but this 
> was actually a mistake: boosts are ignored on inner span queries, only the 
> boost of the top-level span query, the one that performs scoring, is not 
> ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek merged pull request #155: LUCENE-8143: Remove no-op SpanBoostQuery

2021-06-07 Thread GitBox


romseygeek merged pull request #155:
URL: https://github.com/apache/lucene/pull/155


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] balmukundblr commented on a change in pull request #132: Parallel processing

2021-06-07 Thread GitBox


balmukundblr commented on a change in pull request #132:
URL: https://github.com/apache/lucene/pull/132#discussion_r646613630



##
File path: 
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/TaskSequence.java
##
@@ -340,12 +340,17 @@ private int doParallelTasks() throws Exception {
 
 initTasksArray();
 ParallelTask t[] = runningParallelTasks = new ParallelTask[repetitions * 
tasks.size()];
+//Get number of parallel threads from algo file and set it to use in 
ReuersContentSource.java's docCountArrInit()
+this.getRunData().getConfig().setNumThreads(t.length);
 // prepare threads
 int index = 0;
 for (int k = 0; k < repetitions; k++) {
   for (int i = 0; i < tasksArray.length; i++) {
 final PerfTask task = tasksArray[i].clone();
-t[index++] = new ParallelTask(task);
+t[index] = new ParallelTask(task);
+//Setting unique ThreadName with index value which is used in 
ReuersContentSource.java's getNextDocData()

Review comment:
   Incorporated the required changes through adding it in Constants.java 
file and referred from both places. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] balmukundblr commented on a change in pull request #132: Parallel processing

2021-06-07 Thread GitBox


balmukundblr commented on a change in pull request #132:
URL: https://github.com/apache/lucene/pull/132#discussion_r646611128



##
File path: 
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/TaskSequence.java
##
@@ -340,12 +340,17 @@ private int doParallelTasks() throws Exception {
 
 initTasksArray();
 ParallelTask t[] = runningParallelTasks = new ParallelTask[repetitions * 
tasks.size()];
+//Get number of parallel threads from algo file and set it to use in 
ReuersContentSource.java's docCountArrInit()
+this.getRunData().getConfig().setNumThreads(t.length);
 // prepare threads
 int index = 0;
 for (int k = 0; k < repetitions; k++) {
   for (int i = 0; i < tasksArray.length; i++) {
 final PerfTask task = tasksArray[i].clone();
-t[index++] = new ParallelTask(task);
+t[index] = new ParallelTask(task);
+//Setting unique ThreadName with index value which is used in 
ReuersContentSource.java's getNextDocData()
+t[index].setName("IndexThread-" + index);

Review comment:
   Thank you Mike, did the required changes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] balmukundblr commented on a change in pull request #132: Parallel processing

2021-06-07 Thread GitBox


balmukundblr commented on a change in pull request #132:
URL: https://github.com/apache/lucene/pull/132#discussion_r646609916



##
File path: 
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##
@@ -100,21 +100,24 @@ public void close() throws IOException {
 
   @Override
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, 
IOException {
-Path f = null;
-String name = null;
-synchronized (this) {
-  if (nextFile >= inputFiles.size()) {
-// exhausted files, start a new round, unless forever set to false.
-if (!forever) {
-  throw new NoMoreDataException();
-}
-nextFile = 0;
-iteration++;
-  }
-  f = inputFiles.get(nextFile++);
-  name = f.toRealPath() + "_" + iteration;
+if (docCountArrCreated == false) {
+  docCountArrInit();
 }
 
+//Extract ThreadIndex from unique ThreadName (at position 12), which is 
set with '"IndexThread-"+index', in TaskSequence.java's doParallelTasks()
+int threadIndex = 
Integer.parseInt(Thread.currentThread().getName().substring(12));
+assert (threadIndex >= 0 && threadIndex < docCountArr.length):"Please 
check threadIndex or docCountArr length";
+int stride = threadIndex + docCountArr[threadIndex] * docCountArr.length;
+int inFileSize = inputFiles.size();
+
+//Modulo Operator covers all three possible senarios i.e. 1. If 
inputFiles.size() < Num Of Threads 2.inputFiles.size() == Num Of Threads 
3.inputFiles.size() > Num Of Threads
+int fileIndex = stride % inFileSize;

Review comment:
   Mike, its already handling in  ReutersContentSource.java's  setConfig(). 
Please find the code snippet for the same.
   if (inputFiles.size() == 0) {
 throw new RuntimeException("No txt files in dataDir: 
"+dataDir.toAbsolutePath());
   }




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #175: LUCENE-9990: gradle7 support

2021-06-07 Thread GitBox


dweiss commented on a change in pull request #175:
URL: https://github.com/apache/lucene/pull/175#discussion_r646602463



##
File path: gradle/wrapper/gradle-wrapper.jar.version
##
@@ -1 +1 @@
-6.8.3
+7.0.2

Review comment:
   You'll also need to update the hash of the wrapper?

##
File path: gradle/testing/alternative-jdk-support.gradle
##
@@ -18,60 +18,60 @@
 // This adds support for compiling and testing against a different Java 
runtime.
 // This is the only way to build against JVMs not yet supported by Gradle 
itself.
 
-JavaInstallationRegistry registry = 
extensions.getByType(JavaInstallationRegistry)
-
-JavaInstallation currentJvm = 
registry.installationForCurrentVirtualMachine.get()
-
-JavaInstallation altJvm = {
-  def runtimeJavaHome = propertyOrDefault("runtime.java.home", 
System.getenv('RUNTIME_JAVA_HOME'))
-  if (!runtimeJavaHome) {
-return currentJvm
-  } else {
-return registry.installationForDirectory(
-layout.projectDirectory.dir(runtimeJavaHome)).get()
-  }
-}()
-
-// Set up root project's property.
-rootProject.ext.runtimeJava = altJvm
-rootProject.ext.runtimeJavaVersion = altJvm.javaVersion
-
-if (!currentJvm.javaExecutable.equals(altJvm.javaExecutable)) {
-  // Set up java toolchain tasks to use the alternative Java.
-  // This is a related Gradle issue for the future:
-  // https://github.com/gradle/gradle/issues/1652
-
-  configure(rootProject) {
-task altJvmWarning() {
-  doFirst {
-logger.warn("""NOTE: Alternative java toolchain will be used for 
compilation and tests:
-  Project will use Java ${altJvm.javaVersion} from: 
${altJvm.installationDirectory}
-  Gradle runs with Java ${currentJvm.javaVersion} from: 
${currentJvm.installationDirectory}
-""")
-  }
-}
-  }
-
-  // Set up toolchain-dependent tasks to use the alternative JVM.
-  allprojects {
-// Any tests
-tasks.withType(Test) {
-  dependsOn ":altJvmWarning"
-  executable = altJvm.javaExecutable
-}
-
-// Any javac compilation tasks
-tasks.withType(JavaCompile) {
-  dependsOn ":altJvmWarning"
-  options.fork = true
-  options.forkOptions.javaHome = altJvm.installationDirectory.asFile
-}
-
-// Javadoc compilation.
-def javadocExecutable = altJvm.jdk.get().javadocExecutable.asFile
-tasks.matching { it.name == "renderJavadoc" || it.name == 
"renderSiteJavadoc" }.all {
-  dependsOn ":altJvmWarning"
-  executable = javadocExecutable
-}
-  }
-}
+//JavaInstallationRegistry registry = 
extensions.getByType(JavaInstallationRegistry)

Review comment:
   I understand it's a PoC but a plan is needed how to enable what worked 
previously - pointing at an arbitrary JVM home and have it be the default for 
compilation, tests, etc. In the current form the patch removes things that are 
currently used to separate the gradle vs. lucene JVM toolchains - we do need 
this separation (for example to rule out early access JVM problems stemming 
from gradle itself vs. lucene).

##
File path: gradle/java/javac.gradle
##
@@ -16,6 +16,15 @@
  */
 
 // Configure Java project defaults.
+java {
+  toolchain {
+languageVersion = JavaLanguageVersion.of(11)
+  }
+}
+
+tasks.withType( JavaCompile ).configureEach {

Review comment:
   Add the issue reference you mentioned here. The entire workaround thing 
should be best moved under a separate file in gradle/hacks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss opened a new pull request #175: LUCENE-9990: gradle7 support

2021-06-07 Thread GitBox


dweiss opened a new pull request #175:
URL: https://github.com/apache/lucene/pull/175


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a change in pull request #173: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


uschindler commented on a change in pull request #173:
URL: https://github.com/apache/lucene/pull/173#discussion_r646596993



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/mockfile/FilterPath.java
##
@@ -261,10 +267,7 @@ public boolean equals(Object obj) {
* @return innermost Path instance
*/
   public static Path unwrap(Path path) {

Review comment:
   That's how it is used: 
https://github.com/apache/lucene/pull/173/files#diff-efeb36d227b58a0ad42ddf35d0e260ca8010021f8dbc78627333085defa16a13R251
   
   You pass in a `java.nio.files.Path` and it gets unwrapped.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #173: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


dweiss commented on a change in pull request #173:
URL: https://github.com/apache/lucene/pull/173#discussion_r646570141



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/mockfile/FilterPath.java
##
@@ -261,10 +267,7 @@ public boolean equals(Object obj) {
* @return innermost Path instance
*/
   public static Path unwrap(Path path) {

Review comment:
   I thought you might wanted to pass arbitrary objects there... Not much 
can be done then, sure. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #173: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


dweiss commented on a change in pull request #173:
URL: https://github.com/apache/lucene/pull/173#discussion_r646570141



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/mockfile/FilterPath.java
##
@@ -261,10 +267,7 @@ public boolean equals(Object obj) {
* @return innermost Path instance
*/
   public static Path unwrap(Path path) {

Review comment:
   I thought you may want to pass arbitrary objects there... Not much can 
be done then, sure. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #172: LUCENE-9992: write empty vector fields when merging

2021-06-07 Thread GitBox


jpountz commented on a change in pull request #172:
URL: https://github.com/apache/lucene/pull/172#discussion_r646567016



##
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##
@@ -104,9 +104,12 @@ private void mergeVectors(FieldInfo mergeFieldInfo, final 
MergeState mergeState)
 }
   }
 }
-// Create a new VectorValues by iterating over the sub vectors, mapping 
the resulting
-// docids using docMaps in the mergeState.
-if (subs.size() > 0) {
+if (subs.size() == 0) {
+  // all segments being merged have no vectors
+  writeField(mergeFieldInfo, VectorValues.EMPTY);
+} else {
+  // Create a new VectorValues by iterating over the sub vectors, mapping 
the resulting
+  // docids using docMaps in the mergeState.
   writeField(mergeFieldInfo, new VectorValuesMerger(subs, mergeState));
 }

Review comment:
   After looking more at the doc values code, I think that unconditionally 
writing the metadata should fix the test failure we are seeing? This is because 
seeing that vectors are enabled in field infos would imply that at least one of 
the to-be-merged segment has vectors, so `subs.size()` would always be at least 
1?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #173: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


uschindler commented on pull request #173:
URL: https://github.com/apache/lucene/pull/173#issuecomment-855903431


   On the Windown build I have seen this error on Jenkins:
   ```
 1> 2.857s   1.5s:idle P0 [  set repls] top: set replicasIDs=[1] 
tcpPorts=[55991]
 1> 2.888s   1.5s:idle P0 [   indexing] start handling indexing 
socket=Socket[addr=/127.0.0.1,port=55995,localport=55988]
 1> 3.099s   1.7s:idle P0 [  flush] now flush; 1 replicas
 1> 3.100s   1.7s:idle P0 [  flush] top: now flushAndRefresh
 1> 3.127s   1.7s:idle P0 [  flush] unexpected exception handling 
client connection; now failing test:
 1> 3.130s   1.5s: parent [  pump0] java.lang.NoClassDefFoundError: 
jdk/incubator/foreign/MemorySegment
 1> 3.131s   1.5s: parent [  pump0] at 
org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:252)
 1> 3.132s   1.5s: parent [  pump0] at 
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:232)
 1> 3.133s   1.5s: parent [  pump0] at 
org.apache.lucene.util.LuceneTestCase.slowFileExists(LuceneTestCase.java:3053)
 1> 3.134s   1.5s: parent [  pump0] at 
org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:793)
 1> 3.134s   1.5s: parent [  pump0] at 
org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101)
 1> 3.134s   1.5s: parent [  pump0] at 
org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101)
 1> 3.134s   1.5s: parent [  pump0] at 
org.apache.lucene.store.Directory.openChecksumInput(Directory.java:152)
 1> 3.134s   1.5s: parent [  pump0] at 
org.apache.lucene.codecs.lucene90.compressing.FieldsIndexWriter.finish(FieldsIndexWriter.java:124)
 1> 3.135s   1.5s: parent [  pump0] at 
org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.finish(Lucene90CompressingStoredFieldsWriter.java:491)
 1> 3.135s   1.5s: parent [  pump0] at 
org.apache.lucene.index.StoredFieldsConsumer.flush(StoredFieldsConsumer.java:82)
 1> 3.135s   1.5s: parent [  pump0] at 
org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:270)
 1> 3.137s   1.5s: parent [  pump0] at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:386)
 1> 3.137s   1.5s: parent [  pump0] at 
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:497)
 1> 3.137s   1.5s: parent [  pump0] at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:676)
 1> 3.138s   1.5s: parent [  pump0] at 
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:568)
 1> 3.138s   1.5s: parent [  pump0] at 
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:380)
 1> 3.138s   1.5s: parent [  pump0] at 
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:354)
 1> 3.138s   1.5s: parent [  pump0] at 
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:344)
 1> 3.138s   1.5s: parent [  pump0] at 
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:163)
 1> 3.139s   1.5s: parent [  pump0] at 
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:144)
 1> 3.139s   1.5s: parent [  pump0] at 
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:52)
 1> 3.139s   1.5s: parent [  pump0] at 
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
 1> 3.139s   1.5s: parent [  pump0] at 
org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
 1> 3.139s   1.5s: parent [  pump0] at 
org.apache.lucene.replicator.nrt.PrimaryNode.flushAndRefresh(PrimaryNode.java:164)
 1> 3.140s   1.5s: parent [  pump0] at 
org.apache.lucene.replicator.nrt.SimplePrimaryNode.handleFlush(SimplePrimaryNode.java:345)
 1> 3.140s   1.5s: parent [  pump0] at 
org.apache.lucene.replicator.nrt.SimplePrimaryNode.handleOneConnection(SimplePrimaryNode.java:688)
 1> 3.140s   1.5s: parent [  pump0] at 
org.apache.lucene.replicator.nrt.TestSimpleServer$ClientHandler.run(TestSimpleServer.java:98)
 1> 3.140s   1.5s: parent [  pump0] Caused by: 
java.lang.ClassNotFoundException: jdk.incubator.foreign.MemorySegment
 1> 3.140s   1.5s: parent [  pump0] at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:636)
 1> 

[jira] [Commented] (LUCENE-8739) ZSTD Compressor support in Lucene

2021-06-07 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358574#comment-17358574
 ] 

Adrien Grand commented on LUCENE-8739:
--

I opened a PR that uses the exact same approach and block sizes as the default 
codec with DEFLATE, but uses ZSTD instead. It calls ZSTD through JNA, so 
libzstd needs to be installed locally.

> ZSTD Compressor support in Lucene
> -
>
> Key: LUCENE-8739
> URL: https://issues.apache.org/jira/browse/LUCENE-8739
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/codecs
>Reporter: Sean Torres
>Priority: Minor
>  Labels: features
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ZStandard has a great speed and compression ratio tradeoff. 
> ZStandard is open source compression from Facebook.
> More about ZSTD
> [https://github.com/facebook/zstd]
> [https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request #174: LUCENE-8739: Add ZSTD support.

2021-06-07 Thread GitBox


jpountz opened a new pull request #174:
URL: https://github.com/apache/lucene/pull/174


   This is a new stored fields format that leverages ZSTD via JNA.
   
   JNA doesn't have a NOTICE file on their repository, so I copied the bottom 
of the README that seems to cover what the NOTICE file usually does.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #173: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


uschindler commented on pull request #173:
URL: https://github.com/apache/lucene/pull/173#issuecomment-855892370


   I invite everybody to attend this talk: 
https://2021.berlinbuzzwords.de/session/future-lucenes-mmapdirectory-why-use-it-and-whats-coming-java-16-and-later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a change in pull request #173: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


uschindler commented on a change in pull request #173:
URL: https://github.com/apache/lucene/pull/173#discussion_r646550354



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/mockfile/FilterPath.java
##
@@ -261,10 +267,7 @@ public boolean equals(Object obj) {
* @return innermost Path instance
*/
   public static Path unwrap(Path path) {

Review comment:
   ...but wont work: The method here is obsolete, I just kept it inside to 
not touch any test.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a change in pull request #173: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


uschindler commented on a change in pull request #173:
URL: https://github.com/apache/lucene/pull/173#discussion_r646547557



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/mockfile/FilterPath.java
##
@@ -261,10 +267,7 @@ public boolean equals(Object obj) {
* @return innermost Path instance
*/
   public static Path unwrap(Path path) {

Review comment:
   That's a fair point. :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a change in pull request #173: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


uschindler commented on a change in pull request #173:
URL: https://github.com/apache/lucene/pull/173#discussion_r646547379



##
File path: lucene/core/src/java/org/apache/lucene/util/Unwrapable.java
##
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+/**
+ * An object with this interface is a wrapper around another object
+ * (e.g., a filter with a delegate). The method {@link #unwrap()} can
+ * be called to get the wrapped object
+ *
+ * @lucene.internal
+ */
+public interface Unwrapable {
+
+  /** Unwraps this instance */
+  T unwrap();
+  
+  /**
+   * Unwraps all {@code Unwrapable}s around the given object.
+   */
+  @SuppressWarnings("unchecked")
+  public static  T unwrapAll(T o) {

Review comment:
   Thanks, this was some funny hack I applied after I got crazy to find a 
generic solution. What do you think: Maybe we should make a separate issue out 
of Unwrapable ?
   
   The problem here and why I did not add generics for the static method: Not 
all implementations you can pass in may implement unwrapable. 
java.nio.file.Path does not implement unwrappable and the idea of the static 
method is to pass in any Path and it will unwrap, if possible. This is a static 
method, so you can use it with any Path! If the path is not unwrappable, it 
would do nothing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #173: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


dweiss commented on a change in pull request #173:
URL: https://github.com/apache/lucene/pull/173#discussion_r646540797



##
File path: lucene/core/src/java/org/apache/lucene/util/Unwrapable.java
##
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+/**
+ * An object with this interface is a wrapper around another object
+ * (e.g., a filter with a delegate). The method {@link #unwrap()} can
+ * be called to get the wrapped object
+ *
+ * @lucene.internal
+ */
+public interface Unwrapable {
+
+  /** Unwraps this instance */
+  T unwrap();
+  
+  /**
+   * Unwraps all {@code Unwrapable}s around the given object.
+   */
+  @SuppressWarnings("unchecked")
+  public static  T unwrapAll(T o) {

Review comment:
   Make the generic definition as  so that no class 
checks are needed and the compiler can verify if a given type implements 
Unwrapable?

##
File path: 
lucene/test-framework/src/java/org/apache/lucene/mockfile/FilterPath.java
##
@@ -261,10 +267,7 @@ public boolean equals(Object obj) {
* @return innermost Path instance
*/
   public static Path unwrap(Path path) {

Review comment:
   Maybe it should just be a default impl. on the interface itself... Then 
no need for multiple declarations and you'd know the type right away (vide my 
previous comment).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #172: LUCENE-9992: write empty vector fields when merging

2021-06-07 Thread GitBox


msokolov commented on a change in pull request #172:
URL: https://github.com/apache/lucene/pull/172#discussion_r646532102



##
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##
@@ -104,9 +104,12 @@ private void mergeVectors(FieldInfo mergeFieldInfo, final 
MergeState mergeState)
 }
   }
 }
-// Create a new VectorValues by iterating over the sub vectors, mapping 
the resulting
-// docids using docMaps in the mergeState.
-if (subs.size() > 0) {
+if (subs.size() == 0) {
+  // all segments being merged have no vectors
+  writeField(mergeFieldInfo, VectorValues.EMPTY);
+} else {
+  // Create a new VectorValues by iterating over the sub vectors, mapping 
the resulting
+  // docids using docMaps in the mergeState.
   writeField(mergeFieldInfo, new VectorValuesMerger(subs, mergeState));
 }

Review comment:
   Oh about your first point, that was bothering me too. Perhaps it can 
never happen (hidden as you say by this earlier bug, and now by this fix), but 
I agree, we should write the metadata unconditionally there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] uschindler commented on pull request #2176: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


uschindler commented on pull request #2176:
URL: https://github.com/apache/lucene-solr/pull/2176#issuecomment-855874294


   I will close this and would like to move discussion over to 
https://github.com/apache/lucene/pull/173


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] uschindler closed pull request #2176: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


uschindler closed pull request #2176:
URL: https://github.com/apache/lucene-solr/pull/2176


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #172: LUCENE-9992: write empty vector fields when merging

2021-06-07 Thread GitBox


msokolov commented on a change in pull request #172:
URL: https://github.com/apache/lucene/pull/172#discussion_r646529427



##
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##
@@ -104,9 +104,12 @@ private void mergeVectors(FieldInfo mergeFieldInfo, final 
MergeState mergeState)
 }
   }
 }
-// Create a new VectorValues by iterating over the sub vectors, mapping 
the resulting
-// docids using docMaps in the mergeState.
-if (subs.size() > 0) {
+if (subs.size() == 0) {
+  // all segments being merged have no vectors
+  writeField(mergeFieldInfo, VectorValues.EMPTY);
+} else {
+  // Create a new VectorValues by iterating over the sub vectors, mapping 
the resulting
+  // docids using docMaps in the mergeState.
   writeField(mergeFieldInfo, new VectorValuesMerger(subs, mergeState));
 }

Review comment:
   Hm, no, it throws an NPE in DocIDMerger.next(); if you create a 
DocIDMerger with an empty List and then call next(), you get an NPE.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #173: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


uschindler commented on pull request #173:
URL: https://github.com/apache/lucene/pull/173#issuecomment-855872206


   I moved this old pull request from 
https://github.com/apache/lucene-solr/pull/2176 to the Lucene repository:
   - Removed the changes in Solr
   - Updated to Little Endian (see #107, LUCENE-9047) 
   
   All tests still pass, the new policeman Jenkins job is: 
https://jenkins.thetaphi.de/view/Lucene/job/Lucene-jdk16panama-Linux/ (Linux), 
https://jenkins.thetaphi.de/view/Lucene/job/Lucene-jdk16panama-Windows/ 
(Windows)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler opened a new pull request #173: Initial rewrite of MMapDirectory for JDK-16 preview (incubating) Panama APIs (>= JDK-16-ea-b32)

2021-06-07 Thread GitBox


uschindler opened a new pull request #173:
URL: https://github.com/apache/lucene/pull/173


   **INFO: This is a clone/update of 
https://github.com/apache/lucene-solr/pull/2176** _(for more detailed 
discussion see this old PR from the Lucene/Solr combined repository)_
   
   This is just a draft PR for a first insight on memory mapping improvements 
in JDK 16+.
   
   Some background information: Starting with JDK-14, there is a new incubating 
module "jdk.incubator.foreign" that has a new, not yet stable API for accessing 
off-heap memory (and later it will also support calling functions using 
classical MethodHandles that are located in libraries like .so or .dll files). 
This incubator module has several versions:
   - first version: https://openjdk.java.net/jeps/370 (slow, very buggy and 
thread confinement, so making it unuseable with Lucene)
   - second version: https://openjdk.java.net/jeps/383 (still thread 
confinement, but now allows transfer of "ownership" to other threads; this is 
still impossible to use with Lucene.
   - third version in JDK 16: https://openjdk.java.net/jeps/393 (this version 
has included "Support for shared segments"). This now allows us to safely use 
the same external mmaped memory from different threads and also unmap it!
   
   This module more or less overcomes several problems:
   - ByteBuffer API is limited to 32bit (in fact MMapDirectory has to chunk in 
1 GiB portions)
   - There is no official way to unmap ByteBuffers when the file is no longer 
used. There is a way to use `sun.misc.Unsafe` and forcefully unmap segments, 
but any IndexInput accessing the file from another thread will crush the JVM 
with SIGSEGV or SIGBUS. We learned to live with that and we happily apply the 
unsafe unmapping, but that's the main issue.
   
   @uschindler had many discussions with the team at OpenJDK and finally with 
the third incubator, we have an API that works with Lucene. It was very 
fruitful discussions (thanks to @mcimadamore !)
   
   With the third incubator we are now finally able to do some tests 
(especially performance). As this is an incubating module, this PR first 
changes a bit the build system:
   - disable `-Werror` for `:lucene:core`
   - add the incubating module to compiler of `:lucene:core` and enable it for 
all test builds. This is important, as you have to pass `--add-modules 
jdk.incubator.foreign` also at runtime!
   
   The code basically just modifies `MMapDirectory` to use LONG instead of INT 
for the chunk size parameter. In addition it adds `MemorySegmentIndexInput` 
that is a copy of our `ByteBufferIndexInput` (still there, but unused), but 
using MemorySegment instead of ByteBuffer behind the scenes. It works in 
exactly the same way, just the try/catch blocks for supporting EOFException or 
moving to another segment were rewritten.
   
   The openInput code uses `MemorySegment.mapFile()` to get a memory mapping. 
This method is unfortunately a bit buggy in JDK-16-ea-b30, so I added some 
workarounds. See JDK issues: https://bugs.openjdk.java.net/browse/JDK-8259027, 
https://bugs.openjdk.java.net/browse/JDK-8259028, 
https://bugs.openjdk.java.net/browse/JDK-8259032, 
https://bugs.openjdk.java.net/browse/JDK-8259034. The bugs with alignment and 
zero byte mmaps are fixed in b32, this PR was adapted (hacks removed).
   
   It passes all tests and it looks like you can use it to read indexes. The 
default chunk size is now 16 GiB (but you can raise or lower it as you like; 
tests are doing this). Of course you can set it to Long.MAX_VALUE, in that case 
every index file is always mapped to one big memory mapping. My testing with 
Windows 10 have shown, that this is *not a good idea!!!*. Huge mappings 
fragment address space over time and as we can only use like 43 or 46 bits 
(depending on OS), the fragmentation will at some point kill you. So 16 GiB 
looks like a good compromise: Most files will be smaller than 6 GiB anyways 
(unless you optimize your index to one huge segment). So for most Lucene 
installations, the number of segments will equal the number of open files, so 
Elasticsearch huge user consumers will be very happy. The sysctl max_map_count 
may not need to be touched anymore.
   
   In addition, this implements `readLELongs` in a better way than @jpountz did 
(no caching or arbitrary objects). Nevertheless, as the new MemorySegment API 
relies on final, unmodifiable classes and coping memory from a MemorySegment to 
a on-heap Java array, it requires us to wrap all those arrays using a 
MemorySegment each time (e.g. in `readBytes()` or `readLELongs`), there may be 
some overhead du to short living object allocations (those are NOT 
reuseable!!!). _In short: In future we should throw away on coping/loading our 
stuff to heap and maybe throw away IndexInput completely and base our code 
fully on random access. The new foreign-vector APIs will in future also be 
written with MemorySegment in its focus. So you can allocate a vector 

[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-07 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358418#comment-17358418
 ] 

Bruno Roustant commented on LUCENE-9983:


[~zhai7631] do you have some stats about the numbers of NFA / DFA states 
manipulated?

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-07 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358414#comment-17358414
 ] 

Adrien Grand commented on LUCENE-9976:
--

You are right [~zacharymorn]! Sorry for not seeing this issue earlier, I was on 
vacation for the last two weeks.

It's a bit worrying that this bug only got caught by TestExpressionSorts, I 
wonder why the test cases we have in TestWANDScorer didn't catch it.

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #172: LUCENE-9992: write empty vector fields when merging

2021-06-07 Thread GitBox


jpountz commented on a change in pull request #172:
URL: https://github.com/apache/lucene/pull/172#discussion_r646330897



##
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##
@@ -104,9 +104,12 @@ private void mergeVectors(FieldInfo mergeFieldInfo, final 
MergeState mergeState)
 }
   }
 }
-// Create a new VectorValues by iterating over the sub vectors, mapping 
the resulting
-// docids using docMaps in the mergeState.
-if (subs.size() > 0) {
+if (subs.size() == 0) {
+  // all segments being merged have no vectors
+  writeField(mergeFieldInfo, VectorValues.EMPTY);
+} else {
+  // Create a new VectorValues by iterating over the sub vectors, mapping 
the resulting
+  // docids using docMaps in the mergeState.
   writeField(mergeFieldInfo, new VectorValuesMerger(subs, mergeState));
 }

Review comment:
   Would it work if we removed the condition and called 
`writeField(mergeFieldInfo, new VectorValuesMerger(subs, mergeState));` when 
subs is empty too?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org