Re: [I] [Suggestion] Short circuit check for queued flushes in preUpdate() when checkPendingFlushOnUpdate is disabled [lucene]

2024-02-05 Thread via GitHub
CaptainDredge commented on issue #13079: URL: https://github.com/apache/lucene/issues/13079#issuecomment-1928949795 cc: @mgodwan, @backslasht -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

[I] [Suggestion] Short circuit check for queued flushes in preUpdate() when checkPendingFlushOnUpdate is disabled [lucene]

2024-02-05 Thread via GitHub
CaptainDredge opened a new issue, #13079: URL: https://github.com/apache/lucene/issues/13079 ### Description `numQueuedFlushes()` is a blocking function which gets called in `DocumentWriter#preUpdate()` to check if there are any queued flushes. If `checkPendingFlushOnUpdate` is disab

Re: [I] No static method version() crash in android project [lucene]

2024-02-05 Thread via GitHub
bjhexn commented on issue #13078: URL: https://github.com/apache/lucene/issues/13078#issuecomment-1928929465 Android gradle compileOptions { sourceCompatibility = JavaVersion.VERSION_17 targetCompatibility = JavaVersion.VERSION_17 } kotlinOptions {

[I] No static method version() crash in android project [lucene]

2024-02-05 Thread via GitHub
bjhexn opened a new issue, #13078: URL: https://github.com/apache/lucene/issues/13078 ### Description private fun createIndex(): IndexWriter { val path = Path(getDir("lunece", 0).path) val fsDirectory = FSDirectory.open(path) val analyzer = Standar

Re: [PR] Add getter for SynonymQuery#field [lucene]

2024-02-05 Thread via GitHub
AndreyBozhko commented on PR #13077: URL: https://github.com/apache/lucene/pull/13077#issuecomment-1928773466 Thanks for the review @dungba88 - I added the javadoc as well (tried to match the style of other javadocs in the file). -- This is an automated message from the Apache Git Service

Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]

2024-02-05 Thread via GitHub
dungba88 commented on PR #12831: URL: https://github.com/apache/lucene/pull/12831#issuecomment-1928655191 Thank you for merging @mikemccand ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] Backport SOLR-14765 to branch_8_11 [lucene-solr]

2024-02-05 Thread via GitHub
risdenk commented on PR #2682: URL: https://github.com/apache/lucene-solr/pull/2682#issuecomment-1928651515 Sorry for delayed response @HoustonPutman no need to wait for me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

[PR] Add getter for SynonymQuery#field [lucene]

2024-02-05 Thread via GitHub
AndreyBozhko opened a new pull request, #13077: URL: https://github.com/apache/lucene/pull/13077 ### Description Since all the query terms must have the same field, the value is exposed anyway via ```java synonymQuery.getTerms().get(0).field() ``` but it's cleaner if o

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1928027541 This my final code: ```java @Override public int binaryHammingDistance(byte[] a, byte[] b) { int distance = 0, i = 0; for (final int upperBound = a.length

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927995019 I figured that the `& 0x` is useless. You only need it when widening into int. Will update my branch and paste code here. -- This is an automated message from the Apache Git

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927946636 Thanks @uschindler , this is the way to go: compiler does a good job. java already has all the necessary logic here to autovectorize and use e.g. `vpopcntdq` or AVX2 lookup-table counting

Re: [PR] LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #740: URL: https://github.com/apache/lucene/pull/740#issuecomment-1927857893 Yes this was intentional. It breaks API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927824571 Here's my branch: https://github.com/apache/lucene/compare/main...uschindler:lucene:binary_hamming_distance I can merge this into this branch, but the code cleanup and removal o

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927790053 I removed the integer tail and have see no difference (especially looked also at the non-aligned sizes): ```java @Override public int binaryHammingDistance(byte[] a, b

Re: [PR] LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori [lucene]

2024-02-05 Thread via GitHub
mikemccand commented on PR #740: URL: https://github.com/apache/lucene/pull/740#issuecomment-1927759464 @mocobeta just checking: it looks like this was never backported to 9.x (I hit unexpected merge conflicts while backporting an FST change) -- was that intentional? Were there API breaks

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927739974 Seems to autovectorize just fine, i took uwe's branch and dumped assembly on my AVX2 machine and see e.g. 256-bit xor and population count logic. I checked the logic in openjdk and it will

Re: [PR] Throw CorruptSegmentInfoException on encountering missing segment info (_N.si) file in CheckIndex [lucene]

2024-02-05 Thread via GitHub
mikemccand commented on PR #12872: URL: https://github.com/apache/lucene/pull/12872#issuecomment-1927677782 Thanks @gokaai -- I'll try to review soon! If possible please try not to force-push: it removes the history of the past commits and makes it harder to see what changed on this i

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927678421 I am not sure if we really need the Integer tail. Mabye only implement the Long variant and the tail. -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] Throw CorruptSegmentInfoException on encountering missing segment info (_N.si) file in CheckIndex [lucene]

2024-02-05 Thread via GitHub
mikemccand commented on code in PR #12872: URL: https://github.com/apache/lucene/pull/12872#discussion_r1478678480 ## lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java: ## @@ -389,13 +386,25 @@ private static void parseSegmentInfos( } long totalDocs = 0;

Re: [I] java.lang.AssertionError in backward compat tests ("failed to parse ... as date") [lucene]

2024-02-05 Thread via GitHub
dweiss closed issue #13073: java.lang.AssertionError in backward compat tests ("failed to parse ... as date") URL: https://github.com/apache/lucene/issues/13073 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]

2024-02-05 Thread via GitHub
dweiss merged PR #13075: URL: https://github.com/apache/lucene/pull/13075 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]

2024-02-05 Thread via GitHub
dweiss commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1927650998 I'll merge this in so that we can avoid jenkins failures. If there has to be a follow-up, I'll open another issue. -- This is an automated message from the Apache Git Service. To re

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927626207 Hi, I modified the scalar variant like that: ```java @Override public int binaryHammingDistance(byte[] a, byte[] b) { int distance = 0, i = 0; for (; i < a

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927511826 The native order PR was merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) [lucene]

2024-02-05 Thread via GitHub
uschindler merged PR #888: URL: https://github.com/apache/lucene/pull/888 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] LUCENE-10641: IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches [lucene]

2024-02-05 Thread via GitHub
mikemccand commented on code in PR #12345: URL: https://github.com/apache/lucene/pull/12345#discussion_r1478503121 ## lucene/core/src/java/org/apache/lucene/index/ExitableIndexReader.java: ## @@ -0,0 +1,539 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [PR] Forbidden Thread.sleep API [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #13001: URL: https://github.com/apache/lucene/pull/13001#issuecomment-1927410583 I moved the changes entry to Lucene 10.0, as to me it makes no sense to apply this to Lucene 9.x which is not used for active development. New code enters main first and I had trouble

Re: [PR] Forbidden Thread.sleep API [lucene]

2024-02-05 Thread via GitHub
uschindler merged PR #13001: URL: https://github.com/apache/lucene/pull/13001 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [I] Can we ban `Thread.sleep`? [lucene]

2024-02-05 Thread via GitHub
uschindler closed issue #12946: Can we ban `Thread.sleep`? URL: https://github.com/apache/lucene/issues/12946 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

Re: [PR] Forbidden Thread.sleep API [lucene]

2024-02-05 Thread via GitHub
shubhamvishu commented on PR #13001: URL: https://github.com/apache/lucene/pull/13001#issuecomment-1927343001 @uschindler I have removed the added file. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #888: URL: https://github.com/apache/lucene/pull/888#issuecomment-1927340132 I forgot about this PR, we should really apply it. #13076 is another candidate that could make use of this. -- This is an automated message from the Apache Git Service. To respond to th

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927329824 Hi, I don't want to discuss about sense/nonsense of this disatance, but the implementation could been made very simple and then we may not even need to have a Panama Vector variant

Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]

2024-02-05 Thread via GitHub
mikemccand merged PR #12831: URL: https://github.com/apache/lucene/pull/12831 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]

2024-02-05 Thread via GitHub
mikemccand commented on PR #12831: URL: https://github.com/apache/lucene/pull/12831#issuecomment-1927303409 This is technically an API break, but `FSTCompiler` is an experimental API and effectively an internal Lucene datastructure, so I think we can safely backport to 9.x without deprecati

Re: [PR] Bump release to Java 21 [lucene]

2024-02-05 Thread via GitHub
rmuir commented on PR #12753: URL: https://github.com/apache/lucene/pull/12753#issuecomment-1927294356 @mikemccand I think it is heading in the right direction. There are a few more tasks to do here I think though, e.g. we still need to update `releaseWizard.py` and `smokeTestRelease.py`.

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927228480 even if it doesn't autovectorize, i suspect just gathering e.g. 4/8 bytes at a time with BitUtil varhandle and using single int/long xor + popcount would perform very well as a baseline.

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
rmuir commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927213508 I'm confused about the use of lookup table. naively, i'd try to just xor + popcnt: https://docs.oracle.com/en/java/javase/21/docs/api/jdk.incubator.vector/jdk/incubator/vector/Vecto

Re: [PR] Compute multiple float aggregations in one go [lucene]

2024-02-05 Thread via GitHub
mikemccand commented on code in PR #12547: URL: https://github.com/apache/lucene/pull/12547#discussion_r1478386039 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java: ## @@ -37,33 +37,43 @@ abstract class FloatTaxonomyFacets extends TaxonomyFacets

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
rmuir commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1478382295 ## lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -576,4 +578,114 @@ private int squareDistanceBody128(byte[] a, byt

Re: [PR] Bump release to Java 21 [lucene]

2024-02-05 Thread via GitHub
mikemccand commented on PR #12753: URL: https://github.com/apache/lucene/pull/12753#issuecomment-1927152977 Is this ready to go? Thank you for all the hard work here @ChrisHegarty and @rmuir! -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
benwtrent commented on PR #13076: URL: https://github.com/apache/lucene/pull/13076#issuecomment-1927115056 @pmpailis could you also push a `CHANGES.txt` update? It is would be under `New Features` for `Lucene 9.10.0` -- This is an automated message from the Apache Git Service. To respond

Re: [PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
benwtrent commented on code in PR #13076: URL: https://github.com/apache/lucene/pull/13076#discussion_r1478274441 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -214,4 +214,11 @@ public static float[] checkFinite(float[] v) { } return v; } + +

Re: [PR] Enable parent field in sorted bwc tests [lucene]

2024-02-05 Thread via GitHub
jpountz commented on code in PR #13067: URL: https://github.com/apache/lucene/pull/13067#discussion_r1478307290 ## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestIndexSortBackwardsCompatibility.java: ## @@ -82,14 +84,20 @@ public void testSortedIndexAddDocB

Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]

2024-02-05 Thread via GitHub
dweiss commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1927005366 I've moved that utility function to LineFileDocs and added a basic test case to TestLineFileDocs (good idea). I feel tempted to normalize the date field's value in LineFileDocs but this m

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-05 Thread via GitHub
benwtrent commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1478196742 ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -24,15 +24,8 @@ import org.apache.lucene.index.LeafReaderCo

Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]

2024-02-05 Thread via GitHub
dweiss commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1926946487 > P.S.: Maybe add a quick test for the parsing logic to check that both formats are accepted. Maybe it'd be better to move this logic into LineFileDocs so that the date field's val

Re: [PR] clean up smoketester GPG leaks [lucene]

2024-02-05 Thread via GitHub
hurutoriya commented on PR #12882: URL: https://github.com/apache/lucene/pull/12882#issuecomment-1926941192 @janhoy Thank you for suggestion. I've never do the QA. > If not, we need to QA that the script is not broken by this change. OK, let me try the QA 🙏 . How should I do

[PR] Adding binary Hamming distance as similarity option for byte vectors [lucene]

2024-02-05 Thread via GitHub
pmpailis opened a new pull request, #13076: URL: https://github.com/apache/lucene/pull/13076 This PR adds support for binary Hamming distance as a similarity metric for byte vectors. The drive behind this is that there is an increasing interest in applying hashing techniques for embeddi

Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]

2024-02-05 Thread via GitHub
dweiss commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1926938848 > Looks fine. Why did you create a `Function` for parsing instead of a simple static method? I think you did this to hide the formatter instances? Yes, correct. -- This is an au

Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]

2024-02-05 Thread via GitHub
uschindler commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1926758404 P.S.: Maybe add a quick test for the parsing logic to check that both formats are accepted. -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]

2024-02-05 Thread via GitHub
uschindler commented on code in PR #13075: URL: https://github.com/apache/lucene/pull/13075#discussion_r1478028235 ## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestIndexSortBackwardsCompatibility.java: ## @@ -147,6 +150,36 @@ public void testSortedIndex()

Re: [I] java.lang.AssertionError in backward compat tests ("failed to parse ... as date") [lucene]

2024-02-05 Thread via GitHub
dweiss commented on issue #13073: URL: https://github.com/apache/lucene/issues/13073#issuecomment-1926679038 There was actually just a single assertion that caused the problems. I've just remove it since it seems to be duplicated anyway with a term (body:the) that exists in both data sets.

Re: [I] java.lang.AssertionError in backward compat tests ("failed to parse ... as date") [lucene]

2024-02-05 Thread via GitHub
dweiss commented on issue #13073: URL: https://github.com/apache/lucene/issues/13073#issuecomment-1926670843 @s1monw - some of those assertions added in #13046 only hold for the built-in europarl. I fixed date parsing but I'm not sure how to deal with the problem that the line resource can

Re: [PR] Make date parsing more flexible for linedocsfile (europarl, enwiki) [lucene]

2024-02-05 Thread via GitHub
dweiss commented on PR #13075: URL: https://github.com/apache/lucene/pull/13075#issuecomment-1926668465 This makes date parsing accept both europarl and enwiki. The tests still assume the docs come from europarl though and fail on the large enwiki.random.lines.txt: ``` gradlew -p luce

Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-02-05 Thread via GitHub
vsop-479 commented on PR #11888: URL: https://github.com/apache/lucene/pull/11888#issuecomment-1926589435 @jpountz Can we push on this change by checking whether our test case has covered all the status, that `TermsEnum.seekExact` or `TermsEnum.seekCeil` may emit? -- This is an autom

Re: [I] java.lang.AssertionError in backward compat tests ("failed to parse ... as date") [lucene]

2024-02-05 Thread via GitHub
dweiss commented on issue #13073: URL: https://github.com/apache/lucene/issues/13073#issuecomment-1926587718 The embedded "linedocsfile" (europarl) has a different date field format compared to the "large" enwiki used on jenkins. ``` (1) 2004-03-30 Istituzioni europee proteggerl

[I] java.lang.AssertionError in backward compat tests ("failed to parse ... as date") [lucene]

2024-02-05 Thread via GitHub
dweiss opened a new issue, #13073: URL: https://github.com/apache/lucene/issues/13073 ### Description These failures do reproduce but you need the linedocsfile. I'll take a look. ### Version and environment details _No response_ -- This is an automated message from the