Re: [PR] Refactor Faiss-based vector format for easier backport [lucene]

2025-07-10 Thread via GitHub
kaivalnp commented on PR #14934: URL: https://github.com/apache/lucene/pull/14934#issuecomment-3060808264 Also ran benchmarks to ensure these changes don't adversely affect performance.. `main`: ``` recall latency(ms) netCPU avgCpuCountnDoc topK fanout maxConn beamWi

Re: [PR] Refactor Faiss-based vector format for easier backport [lucene]

2025-07-10 Thread via GitHub
github-actions[bot] commented on PR #14934: URL: https://github.com/apache/lucene/pull/14934#issuecomment-3060801954 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop

Re: [PR] Backport Faiss-based vector format to 10.x [lucene]

2025-07-10 Thread via GitHub
kaivalnp commented on code in PR #14843: URL: https://github.com/apache/lucene/pull/14843#discussion_r2199713156 ## lucene/sandbox/src/java21/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,636 @@ +/* + * Licensed to the Apache Software Foundation (ASF) unde

Re: [PR] Backport Faiss-based vector format to 10.x [lucene]

2025-07-10 Thread via GitHub
kaivalnp commented on code in PR #14843: URL: https://github.com/apache/lucene/pull/14843#discussion_r2199711854 ## lucene/sandbox/src/generated/jdk/jdk21.apijar: ## Review Comment: > only the LibFaissC should access native APIs and add abstractions for all code to get rid

Re: [PR] Backport Faiss-based vector format to 10.x [lucene]

2025-07-10 Thread via GitHub
kaivalnp commented on code in PR #14843: URL: https://github.com/apache/lucene/pull/14843#discussion_r2199711381 ## gradle/generation/extract-jdk-apis.gradle: ## @@ -17,7 +17,10 @@ def resources = scriptResources(buildscript) -configure(project(":lucene:core")) { +configure

Re: [PR] Backport Faiss-based vector format to 10.x [lucene]

2025-07-10 Thread via GitHub
kaivalnp commented on PR #14843: URL: https://github.com/apache/lucene/pull/14843#issuecomment-3060745112 Thanks a lot for the review @uschindler, it was super helpful! I've taken an initial pass at refactoring some classes in `main` to make this backport easier like you mentioned (#14934

Re: [PR] Refactor Faiss-based vector format for easier backport [lucene]

2025-07-10 Thread via GitHub
github-actions[bot] commented on PR #14934: URL: https://github.com/apache/lucene/pull/14934#issuecomment-3060678945 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop

[PR] Refactor Faiss-based vector format for easier backport [lucene]

2025-07-10 Thread via GitHub
kaivalnp opened a new pull request, #14934: URL: https://github.com/apache/lucene/pull/14934 ### Description Refactor classes of the Faiss-based vector format to simplify backport to 10.x - Extract minimal functionality required for the format into a new `FaissLibrary` interface

Re: [I] Stop duplicating per-segment work across segment partitions [lucene]

2025-07-10 Thread via GitHub
expani commented on issue #13745: URL: https://github.com/apache/lucene/issues/13745#issuecomment-3060066462 Will go over all types of queries to check ( other than PointRangeQuery ) that needs special handling by sharing the docId space unless someone has already covered it. -- This is

Re: [PR] Introduce Impacts.forEach [lucene]

2025-07-10 Thread via GitHub
HUSTERGS commented on PR #14931: URL: https://github.com/apache/lucene/pull/14931#issuecomment-3059927176 Thanks for your explaination! I got your point, lets close this PR for now : ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] Introduce Impacts.forEach [lucene]

2025-07-10 Thread via GitHub
HUSTERGS closed pull request #14931: Introduce Impacts.forEach URL: https://github.com/apache/lucene/pull/14931 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe,

Re: [PR] Fix off-heap byte vector scoring at query time [lucene]

2025-07-10 Thread via GitHub
msokolov commented on PR #14874: URL: https://github.com/apache/lucene/pull/14874#issuecomment-3059892629 I'm not really comfortable pushing the PR as it is given that it makes searching slower in the benchmark where we reindex first, and I think we should understand the hotspot hack a litt

Re: [PR] Introduce Impacts.forEach [lucene]

2025-07-10 Thread via GitHub
jpountz commented on PR #14931: URL: https://github.com/apache/lucene/pull/14931#issuecomment-3059281073 Thanks for identifying this room for improvement. I'm a bit hesitant about the extra complexity since `Term` and `OrHighRare` are among the fastest queries already. Maybe something to ke

Re: [PR] Fix off-heap byte vector scoring at query time [lucene]

2025-07-10 Thread via GitHub
vigyasharma commented on PR #14874: URL: https://github.com/apache/lucene/pull/14874#issuecomment-3059270247 > This seems to enable hotspot to separately optimize these two code paths. Ah okay! That makes sense. > There is yet another mystery here, which is: why, after adding t

Re: [I] Should we add a new "BulkScoreVectors" type api to the vector scorer interfaces? [lucene]

2025-07-10 Thread via GitHub
mccullocht commented on issue #14013: URL: https://github.com/apache/lucene/issues/14013#issuecomment-3059230678 Breadcrumb back to the mailing list discussion: https://lists.apache.org/thread/obc84kp3mxmd9nrbpxyj8bt0hbzfpxwv There's some evidence to suggest that a bulk scoring API wo

Re: [PR] Updating Dense#intoBitSet to properly set upTo if it exceeds bitset size [lucene]

2025-07-10 Thread via GitHub
jpountz commented on PR #14922: URL: https://github.com/apache/lucene/pull/14922#issuecomment-3059219790 > To me the contract here is that caller should guarantee there is no doc between offset + bitset.length() and upTo if offset + bitset.length() < upTo. Maybe we should clarify it in java

Re: [PR] Vectorize bitset to array [lucene]

2025-07-10 Thread via GitHub
jpountz commented on PR #14910: URL: https://github.com/apache/lucene/pull/14910#issuecomment-3059176116 This is very cool and the speedup makes sense to me. When dynamic pruning is enabled, only queries whose leading clauses are dense benefit significantly from this speedup (`OrStopWords`

Re: [PR] Fix off-heap byte vector scoring at query time [lucene]

2025-07-10 Thread via GitHub
msokolov commented on PR #14874: URL: https://github.com/apache/lucene/pull/14874#issuecomment-3059172791 > My understanding was that off heap document vectors helped by avoiding a copy back into the heap, plus avoiding the cost of reallocation and copy if some of them got garbage collected

Re: [PR] build: move forbidden-apis and more to java [lucene]

2025-07-10 Thread via GitHub
dweiss commented on PR #14924: URL: https://github.com/apache/lucene/pull/14924#issuecomment-3059160305 > I'd start with gradlew clean and possibly kill any still running daemons. I had something similar two days ago. I'm sorry to hear this. Never happened to me and I mess around havi

Re: [PR] Fix off-heap byte vector scoring at query time [lucene]

2025-07-10 Thread via GitHub
vigyasharma commented on PR #14874: URL: https://github.com/apache/lucene/pull/14874#issuecomment-3059155132 Wow, these are very impressive gains! Nice find @kaivalnp. So the key change is in `Arena.ofAuto().allocateFrom(JAVA_BYTE, queryVector);` which allocates an off heap `MemorySeg

Re: [PR] build: move forbidden-apis and more to java [lucene]

2025-07-10 Thread via GitHub
msokolov commented on PR #14924: URL: https://github.com/apache/lucene/pull/14924#issuecomment-3059030317 Thanks! Before, gradlew clean would not work either, but it is working now. I think possibly I just waited long enough and the daemons died? -- This is an automated message from the A

Re: [PR] GroupVarInt Encoding Implementation for HNSW Graphs [lucene]

2025-07-10 Thread via GitHub
benwtrent commented on PR #14932: URL: https://github.com/apache/lucene/pull/14932#issuecomment-3059026742 @aylonsk great looking numbers! I expect for cheaper vector ops (e.g. single bit quantization), the impact is even higher. -- This is an automated message from the Apache Git Service

Re: [PR] GroupVarInt Encoding Implementation for HNSW Graphs [lucene]

2025-07-10 Thread via GitHub
aylonsk commented on PR #14932: URL: https://github.com/apache/lucene/pull/14932#issuecomment-3059012712 Thanks for your response! My apologies, I forgot to post my results from LuceneUtil. Because I noticed variance between each run, I decided to test each set of hyperparameters 10

Re: [PR] build: move forbidden-apis and more to java [lucene]

2025-07-10 Thread via GitHub
uschindler commented on PR #14924: URL: https://github.com/apache/lucene/pull/14924#issuecomment-3059004884 I'd start with `gradlew clean` and possibly kill any still running daemons. I had something similar two days ago. -- This is an automated message from the Apache Git Service. To res

Re: [PR] Fix off-heap byte vector scoring at query time [lucene]

2025-07-10 Thread via GitHub
msokolov commented on PR #14874: URL: https://github.com/apache/lucene/pull/14874#issuecomment-3058892060 BTW the above results were on ARM/Graviton 2. I also tried on an Intel laptp and got speedups, although not as much, and the weird faster search after indexing also persists here

Re: [PR] build: move forbidden-apis and more to java [lucene]

2025-07-10 Thread via GitHub
msokolov commented on PR #14924: URL: https://github.com/apache/lucene/pull/14924#issuecomment-3058868762 This broke my local build: ``` FAILURE: Build failed with an exception. * Where: Build file '/home/ANT.AMAZON.COM/sokolovm/workspace/lucene/build.gradle' line: 30

Re: [I] Should we add a new "BulkScoreVectors" type api to the vector scorer interfaces? [lucene]

2025-07-10 Thread via GitHub
benwtrent commented on issue #14013: URL: https://github.com/apache/lucene/issues/14013#issuecomment-3058819130 @mccullocht Given the recent conversation on the Lucene list about making HNSW search faster. -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] GroupVarInt Encoding Implementation for HNSW Graphs [lucene]

2025-07-10 Thread via GitHub
benwtrent commented on PR #14932: URL: https://github.com/apache/lucene/pull/14932#issuecomment-3058632703 Hi @aylonsk ! Thank you for digging into this issue. I am sure you are still working on it, but I had some feedback: - It would be interesting to get statistics around resulting

[PR] Move more gradle scripts to java [lucene]

2025-07-10 Thread via GitHub
dweiss opened a new pull request, #14933: URL: https://github.com/apache/lucene/pull/14933 Just another iteration. Draft until it reaches a reasonable size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [I] AbstractKnnVectorQuery breaks shallowAdvance contract, causing chaos [lucene]

2025-07-10 Thread via GitHub
benwtrent commented on issue #14857: URL: https://github.com/apache/lucene/issues/14857#issuecomment-3058536745 This has been patched in lucene 9.12.x and will be available if/when another bugfix is released from that branch. -- This is an automated message from the Apache Git Service. To

Re: [I] AbstractKnnVectorQuery breaks shallowAdvance contract, causing chaos [lucene]

2025-07-10 Thread via GitHub
benwtrent closed issue #14857: AbstractKnnVectorQuery breaks shallowAdvance contract, causing chaos URL: https://github.com/apache/lucene/issues/14857 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [I] Stop duplicating per-segment work across segment partitions [lucene]

2025-07-10 Thread via GitHub
expani commented on issue #13745: URL: https://github.com/apache/lucene/issues/13745#issuecomment-3058268142 I was looking to integrate Intra Segment Concurrent Search and found that this same problem also applies to downstream consumers of Lucene like OpenSearch/ElasticSearch/Solr who use

Re: [PR] Vectorize bitset to array [lucene]

2025-07-10 Thread via GitHub
github-actions[bot] commented on PR #14910: URL: https://github.com/apache/lucene/pull/14910#issuecomment-3058176567 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop

Re: [PR] Vectorize bitset to array [lucene]

2025-07-10 Thread via GitHub
gf2121 commented on PR #14910: URL: https://github.com/apache/lucene/pull/14910#issuecomment-3058170259 Some more data: **Mac M2** ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value

Re: [PR] Fix off-heap byte vector scoring at query time [lucene]

2025-07-10 Thread via GitHub
msokolov commented on PR #14874: URL: https://github.com/apache/lucene/pull/14874#issuecomment-3058165787 what I did: ``` @@ -305,7 +306,36 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport { @Override public int dotProduct(byte[] a, byte[] b) {

Re: [PR] Fix off-heap byte vector scoring at query time [lucene]

2025-07-10 Thread via GitHub
msokolov commented on PR #14874: URL: https://github.com/apache/lucene/pull/14874#issuecomment-3058163612 OK I discovered the loss of recall was due to a silly bug. After fixing that, these are the results I'm seeing with the addition of a separate code path for `dotProduct(byte[], byte[])`

Re: [PR] build: move forbidden-apis and more to java [lucene]

2025-07-10 Thread via GitHub
dweiss commented on code in PR #14924: URL: https://github.com/apache/lucene/pull/14924#discussion_r2198055095 ## build-tools/build-infra/src/main/java/org/apache/lucene/gradle/plugins/java/ApplyForbiddenApisPlugin.java: ## @@ -0,0 +1,293 @@ +/* + * Licensed to the Apache Softwa

Re: [PR] GroupVarInt Encoding Implementation for HNSW Graphs [lucene]

2025-07-10 Thread via GitHub
github-actions[bot] commented on PR #14932: URL: https://github.com/apache/lucene/pull/14932#issuecomment-3057869141 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop

[PR] GroupVarInt Encoding Implementation for HNSW Graphs [lucene]

2025-07-10 Thread via GitHub
aylonsk opened a new pull request, #14932: URL: https://github.com/apache/lucene/pull/14932 ### Description For HNSW Graphs, the alternate encoding I implemented was GroupVarInt encoding, which in theory should be less costly both in space and runtime. The pros of this encoding would

Re: [PR] build: move forbidden-apis and more to java [lucene]

2025-07-10 Thread via GitHub
uschindler commented on code in PR #14924: URL: https://github.com/apache/lucene/pull/14924#discussion_r2198013800 ## build-tools/build-infra/src/main/java/org/apache/lucene/gradle/plugins/java/ApplyForbiddenApisPlugin.java: ## @@ -0,0 +1,293 @@ +/* + * Licensed to the Apache So

Re: [PR] Updating Dense#intoBitSet to properly set upTo if it exceeds bitset size [lucene]

2025-07-10 Thread via GitHub
gf2121 commented on PR #14922: URL: https://github.com/apache/lucene/pull/14922#issuecomment-3057858468 > i.e. not actually accounting for upTo bits, but instead just for bitSize Good point, I checked `BitsetIterator#intoBitset` and we had similar logic there. https://github.c

Re: [PR] build: move forbidden-apis and more to java [lucene]

2025-07-10 Thread via GitHub
uschindler commented on code in PR #14924: URL: https://github.com/apache/lucene/pull/14924#discussion_r2198006802 ## build-tools/build-infra/src/main/java/org/apache/lucene/gradle/plugins/java/ApplyForbiddenApisPlugin.java: ## @@ -0,0 +1,293 @@ +/* + * Licensed to the Apache So

Re: [PR] build: move forbidden-apis and more to java [lucene]

2025-07-10 Thread via GitHub
dweiss merged PR #14924: URL: https://github.com/apache/lucene/pull/14924 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Reapply "Update the IOContext rather than the ReadAdvice on IndexInput (#14702)" [lucene]

2025-07-10 Thread via GitHub
thecoop commented on code in PR #14844: URL: https://github.com/apache/lucene/pull/14844#discussion_r2197871752 ## lucene/test-framework/src/java/org/apache/lucene/tests/codecs/asserting/AssertingKnnVectorsFormat.java: ## @@ -228,8 +245,6 @@ public Map getOffHeapByteSize(FieldIn

Re: [I] VirtualMachineError is swallowed in IndexWriter [lucene]

2025-07-10 Thread via GitHub
uschindler commented on issue #14731: URL: https://github.com/apache/lucene/issues/14731#issuecomment-3057432027 > Note, we do not enable the InfoStream logger for IndexWriter ("IW"), which would have let us see the original error, I believe because it may be noisy given these fatal errors

Re: [PR] Fix off-heap byte vector scoring at query time [lucene]

2025-07-10 Thread via GitHub
msokolov commented on PR #14874: URL: https://github.com/apache/lucene/pull/14874#issuecomment-3057272315 Separately, I tried using the `Arena.ofAuto().allocateFrom()` construct in the on-heap case that is used during indexing and this made indexing incredibly slow. I guess it is because we

Re: [PR] Fix off-heap byte vector scoring at query time [lucene]

2025-07-10 Thread via GitHub
msokolov commented on PR #14874: URL: https://github.com/apache/lucene/pull/14874#issuecomment-3057200869 I did some deep-diving with profiler and I realized that when indexing, we call these dotProduct methods in a different context in which all of the vectors are on-heap. I'm surmising t

[PR] Introduce Impacts.forEach [lucene]

2025-07-10 Thread via GitHub
HUSTERGS opened a new pull request, #14931: URL: https://github.com/apache/lucene/pull/14931 ### Description This PR propose to introduce a new `forEach` api on `Impacts`. It seems to be helpful to reduce the cost of `MaxScoreCache.computeMaxScore`. I've tried many other ways, to avo

[I] IndexWriter.merge does not properly clean up merge reader instances [lucene]

2025-07-10 Thread via GitHub
thecoop opened a new issue, #14930: URL: https://github.com/apache/lucene/issues/14930 ### Description Following on from https://github.com/apache/lucene/pull/14844#discussion_r2168818779, there are cases where `IndexWriter.merge` does not close merge instances before closing the 'r

Re: [I] VirtualMachineError is swallowed in IndexWriter [lucene]

2025-07-10 Thread via GitHub
thecoop commented on issue #14731: URL: https://github.com/apache/lucene/issues/14731#issuecomment-3057033604 The pattern of using `finally` blocks to handle cleanup is one which we are slowly removing, and replacing with suppressed exceptions; that case in particular is already modified to

Re: [I] VirtualMachineError is swallowed in IndexWriter [lucene]

2025-07-10 Thread via GitHub
thecoop closed issue #14731: VirtualMachineError is swallowed in IndexWriter URL: https://github.com/apache/lucene/issues/14731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Feat: Add auto formatting bot [lucene]

2025-07-10 Thread via GitHub
georgereuben commented on PR #14927: URL: https://github.com/apache/lucene/pull/14927#issuecomment-3056199571 Hi @dweiss @rmuir, I have updated the workflow. If the PR is raised in the same repo by a maintainer, it will raise a PR with formatting fixes, and if the PR is raised from a forked

Re: [PR] Feat: Add auto formatting bot [lucene]

2025-07-10 Thread via GitHub
georgereuben commented on code in PR #14927: URL: https://github.com/apache/lucene/pull/14927#discussion_r2196923309 ## .github/workflows/auto-format.yml: ## @@ -0,0 +1,268 @@ +name: Lucene Auto Format Bot + +on: + issue_comment: +types: [created] + +env: + DEVELOCITY_ACCE

Re: [PR] Feat: Add auto formatting bot [lucene]

2025-07-10 Thread via GitHub
georgereuben commented on code in PR #14927: URL: https://github.com/apache/lucene/pull/14927#discussion_r2196881674 ## .github/workflows/auto-format.yml: ## @@ -0,0 +1,359 @@ +name: Lucene Auto Format Bot + +on: + issue_comment: +types: [created] + +env: + DEVELOCITY_ACCE

Re: [PR] Feat: Add auto formatting bot [lucene]

2025-07-10 Thread via GitHub
georgereuben commented on code in PR #14927: URL: https://github.com/apache/lucene/pull/14927#discussion_r2196880437 ## .github/workflows/auto-format.yml: ## @@ -0,0 +1,359 @@ +name: Lucene Auto Format Bot + +on: + issue_comment: +types: [created] + +env: + DEVELOCITY_ACCE