[GitHub] [lucene-jira-archive] mocobeta opened a new issue, #36: Can we parallelize the converter script?
mocobeta opened a new issue, #36: URL: https://github.com/apache/lucene-jira-archive/issues/36 `jira2markdown_imprt.py` is single-threaded and it takes several hours to convert all Jira issues. I think it'd be easy to parallelize this with [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) module (it does not call any HTTP APIs), but I remember a few years ago, `logging` was not thread-safe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer
mocobeta commented on PR #33: URL: https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181456586 I'm also converting the whole Jira issue myself; it looks like it takes several hours... (recent changes to fix conversion errors could affect the conversion speed I think). This shouldn't be so slow, raised #36. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #36: Can we parallelize the converter script?
mocobeta commented on issue #36: URL: https://github.com/apache/lucene-jira-archive/issues/36#issuecomment-1181497062 I found https://pypi.org/project/multiprocessing-logging/, but this works only on Linux. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565375#comment-17565375 ] Adrien Grand commented on LUCENE-10480: --- +1 to explore this in a separate issue. bq. Do you think this slowdown to AndHighOrMedMed may be considered as blocker to 9.3 release? I wouldn't say blocker, but maybe we could give us time indeed by only using this new scorer on top-level disjunctions for now so that we have more time to figure out whether we should stick to BMW or switch to BMM for inner disjunctions. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7h 20m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long
[ https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-10600: -- Fix Version/s: 9.3 > SortedSetDocValues#docValueCount should be an int, not long > --- > > Key: LUCENE-10600 > URL: https://issues.apache.org/jira/browse/LUCENE-10600 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Lu Xugang >Priority: Minor > Fix For: 9.3 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #38: StackOverflowException on certain issue descriptions and comment text
mocobeta commented on issue #38: URL: https://github.com/apache/lucene-jira-archive/issues/38#issuecomment-1181605666 Thank you for opening this. While the stack overflow is rare, this recursion in parsing also causes a significant slowdown in conversion. I'm sure the root cause of the slow down and stack overflow is this line (a customized version of Jira list syntax parser): https://github.com/apache/lucene-jira-archive/blob/b4f125913eb77ed807d4f1a5836ac4d330f2352a/migration/src/markup/lists.py#L69 I'm trying to find other ways that do not cause infinite recursion to parse lists correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField
jpountz commented on code in PR #966: URL: https://github.com/apache/lucene/pull/966#discussion_r918804129 ## lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java: ## @@ -230,9 +230,29 @@ final void writeByte(int stream, byte b) { } final void writeBytes(int stream, byte[] b, int offset, int len) { -// TODO: optimize final int end = offset + len; -for (int i = offset; i < end; i++) writeByte(stream, b[i]); +int streamAddress = streamAddressOffset + stream; +int upto = termStreamAddressBuffer[streamAddress]; +byte[] slice = bytePool.buffers[upto >> ByteBlockPool.BYTE_BLOCK_SHIFT]; +assert slice != null; +int sliceOffset = upto & ByteBlockPool.BYTE_BLOCK_MASK; + +while (slice[sliceOffset] == 0 && offset < end) { + slice[sliceOffset++] = b[offset++]; + (termStreamAddressBuffer[streamAddress])++; +} Review Comment: Maybe in the future we could optimize this case a bit too by using `Arrays#mismatch` with an array that is full of zeroes. ## lucene/core/src/test/org/apache/lucene/index/TestTermsHashPerField.java: ## @@ -298,4 +299,23 @@ class Posting { assertTrue("the last posting must be EOF on the reader", eof); } } + + public void testWriteBytes() throws IOException { +for (int i = 0; i < 100; i++) { + AtomicInteger newCalled = new AtomicInteger(0); + AtomicInteger addCalled = new AtomicInteger(0); + TermsHashPerField hash = createNewHash(newCalled, addCalled); + hash.start(null, true); + hash.add(newBytesRef("start"), 0); // tid = 0; + int size = TestUtil.nextInt(random(), 5, 10); + byte[] randomData = new byte[size]; + random().nextBytes(randomData); + hash.writeBytes(0, randomData, 0, randomData.length); Review Comment: Maybe change this to write small chunks at once to better exercise the case when we're starting a write in the middle or at the end of a slice? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #1003: LUCENE-10616: optimizing decompress when only retrieving some fields
jpountz commented on code in PR #1003: URL: https://github.com/apache/lucene/pull/1003#discussion_r918758391 ## lucene/core/src/java/org/apache/lucene/codecs/compressing/Decompressor.java: ## @@ -42,6 +44,13 @@ protected Decompressor() {} public abstract void decompress( DataInput in, int originalLength, int offset, int length, BytesRef bytes) throws IOException; + public InputStream decompress(DataInput in, int originalLength, int offset, int length) Review Comment: Maybe it would help to make it a `DataInput` and use `ByteArrayDataInput`. ## lucene/core/src/java/org/apache/lucene/document/DocumentStoredFieldVisitor.java: ## @@ -98,6 +100,16 @@ public void doubleField(FieldInfo fieldInfo, double value) { @Override public Status needsField(FieldInfo fieldInfo) throws IOException { +// return stop after collected all needed fields +if (fieldsToAdd != null +&& !fieldsToAdd.contains(fieldInfo.name) +&& fieldsToAdd.size() +== doc.getFields().stream() +.map(IndexableField::name) +.collect(Collectors.toSet()) +.size()) { + return Status.STOP; Review Comment: This isn't correct, some fields could have multiple values? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer
mikemccand commented on PR #33: URL: https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181821682 I'm closing this messed up PR -- I rebooted it into #40. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand opened a new pull request, #40: #27: polish the legacy Jira text added to the issue a bit
mikemccand opened a new pull request, #40: URL: https://github.com/apache/lucene-jira-archive/pull/40 I "rebooted" my PR by downloading the diff off the messed up #33 PR, futzing it locally, applying, resolving conflicts. Messy messy. I'll try to more carefully manage the git merging steps next time ... I re-tested that this version is able to export tricky issues LUCENE-550 and LUCENE-4341, still showing the stack overflow error until we push @mocobeta's nice fix in #39. Closes #33. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand closed pull request #33: Polish wording of Legacy Jira details header, and each comment footer
mikemccand closed pull request #33: Polish wording of Legacy Jira details header, and each comment footer URL: https://github.com/apache/lucene-jira-archive/pull/33 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r919288022 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; + + /** Flush all buffered data on disk * */ + public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException; Review Comment: @jtibshirani Thanks for the suggestion, I thought how to organize it, and I could not find a good way to do it, so I left the things as they are. In `IndexingChain#flush` we could have called `KnnFieldVectorsWriter#flush`, but `flush` operation also requires do `writer.finish();` and close the writer, so it is better managed by `VectorValuesConsumer` than individual ``KnnFieldVectorsWriter` > this would help make Lucene93HnswVectorsWriter easier to read, because we could separate out the complex sorting logic into a class like SortingFieldWriter This is also challenging to implement because whether a field writer is `SortingFieldWriter` only becomes known during flush, so this would require converting usual field writer object to `SortingFieldWriter` on flush, which doesn't look nice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565402#comment-17565402 ] Adrien Grand commented on LUCENE-10603: --- +1 > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 5h 20m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r919288022 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; + + /** Flush all buffered data on disk * */ + public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException; Review Comment: @jtibshirani Thanks for the suggestion, I thought how to organize it, and I could not find a good way to do it, so I left the things as they are. In `IndexingChain#flush` we could have called `KnnFieldVectorsWriter#flush`, but `flush` operation also requires do `writer.finish();` and close the writer, so it is better managed by `VectorValuesConsumer` than individual `KnnFieldVectorsWriter` objects. > this would help make Lucene93HnswVectorsWriter easier to read, because we could separate out the complex sorting logic into a class like SortingFieldWriter This is also challenging to implement because whether a field writer is `SortingFieldWriter` only becomes known during flush, so this would require converting usual field writer object to `SortingFieldWriter` on flush, which doesn't look nice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer
mocobeta commented on PR #33: URL: https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181624324 > Thanks -- I was beginning to wonder if it was normal how long it was taking ;) Of course it's not normal; I remember it took two or three hours to convert the whole Jira snapshot when I did the last test migration. Not so fast, but was acceptable speed (taking into account this is entirely written in python). I've added some custom syntax parser components (e.g. https://github.com/apache/lucene-jira-archive/pull/19) to fix conversion errors since then - some of them should cause a slowdown in parsing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10628) Enable MatchingFacetSetCounts to use space partitioning data structures
[ https://issues.apache.org/jira/browse/LUCENE-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565882#comment-17565882 ] Ignacio Vera commented on LUCENE-10628: --- I have mainly worked with two type of trees in Lucene. * [KD-tree|https://github.com/apache/lucene/blob/35ca2d79f73c6dfaf5e648fe241f7e0b37084a90/lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java#L79]: It is complex to build so probably not suitable for building them on the fly but best structure for an index. * [Interval tree|https://github.com/apache/lucene/blob/2d6ad2fee6dfd96388594f4de9b37c037efe8017/lucene/core/src/java/org/apache/lucene/geo/ComponentTree.java#L28] (I think originally introduced by Robert Muir): Not as efficient as a kd-tree but much cheaper to build and suitable for small data. >From a quick look I think you would be looking for an interval tree but mind >you that I have only worked with that tree for very low dimension. I guess >this kind of tree will quickly degenerate due to the [curse of >dimensionality|https://en.wikipedia.org/wiki/Curse_of_dimensionality]. How may >dimensions are you expecting to support? > Enable MatchingFacetSetCounts to use space partitioning data structures > --- > > Key: LUCENE-10628 > URL: https://issues.apache.org/jira/browse/LUCENE-10628 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Marc D'Mello >Priority: Minor > > Currently, {{MatchingFacetSetCounts}} iterates over {{FacetSetMatcher}} > instances passed into it linearly. While this is fine in some cases, if we > have a large amount of {{FacetSetMatcher}}'s, this can be inefficient. We > should provide the option to users to enable the use of space partitioning > data structures (namely R trees and KD trees) so we can potentially scan over > these {{FacetSetMatcher}}'s in sub-linear time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #39: Fix stack overflow when parsing lists
mocobeta opened a new pull request, #39: URL: https://github.com/apache/lucene-jira-archive/pull/39 Close #38 This ad-hoc patch fixes `'maximum recursion depth exceeded'` error, and also makes the script a bit faster. (8h -> 5h) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
[ https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565885#comment-17565885 ] Adrien Grand commented on LUCENE-10649: --- Good catch [~vigyas], it looks related indeed. The bug seems to be that {{ReindexingMergePolicy}} doesn't override {{findFullFlushMerges}} to wrap input readers, so the merged segment doesn't get fields from the parallel reader. Would you like to open a PR? > Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField > --- > > Key: LUCENE-10649 > URL: https://issues.apache.org/jira/browse/LUCENE-10649 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > > Failing Build Link: > [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/] > Repro: > {code:java} > gradlew test --tests > TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField > -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA > -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 > {code} > Error: > {code:java} > java.lang.AssertionError: expected:<103> but was:<2147483647> > at > __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r919332844 ## lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java: ## @@ -102,9 +104,22 @@ private class FieldsWriter extends KnnVectorsWriter { } @Override -public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) +public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) throws IOException { + KnnVectorsWriter writer = getInstance(fieldInfo); + return writer.addField(fieldInfo); +} + +@Override +public void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException { + for (WriterAndSuffix was : formats.values()) { +was.writer.flush(maxDoc, sortMap); + } +} + +@Override +public void mergeOneField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) throws IOException { - getInstance(fieldInfo).writeField(fieldInfo, knnVectorsReader); + getInstance(fieldInfo).mergeOneField(fieldInfo, knnVectorsReader); Review Comment: @jtibshirani Thanks for the suggestion, but we can't at the same time make `KnnVectorsWriter#merge` final and also un-support this `mergeOneField`? Should we keep `mergeOneField` and make `KnnVectorsWriter#merge` final as your other suggestion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy
luyuncheng commented on PR #987: URL: https://github.com/apache/lucene/pull/987#issuecomment-1181632413 > Would it be possible to remove all `CompressionMode#compress` variants that take a `byte[]` now that you introduced a new method that takes a `ByteBuffersDataInput`? > > Also maybe we should keep old codecs unmodified and only make this change to `Lucene90Codec` where it makes most sense? Hi @jpountz Thanks for reviewing this code. I prefer keeping old codecs unmodified, because `CompressionMode#compress` is a public abstract method, if we change it with variants `ByteBuffersDataInput` we need to backport in many codecs, like [commits](https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene50/Lucene50StoredFieldsFormat.java). And if we only using compress method with variants ByteBuffersDataInput in LUCENE90, we can not using abstract method `Compressor.compress`, when we want to use other compression mode. Would it be possible to add a new method in Compressor, like following? it can keep the old codecs unmodified, and method with variants ByteBuffersDataInput only can be called in Lucene90Codec. ``` public abstract void compress(byte[] bytes, int off, int len, DataOutput out) throws IOException; public void compress(CompositeByteBuf compositeByteBuf, DataOutput out) throws IOException; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048
[ https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565468#comment-17565468 ] Michael Sokolov commented on LUCENE-10471: -- We should not be imposing an arbitrary limit that prevents people with CNNs (image-processing models) from using this feature. It makes sense to me to increase the limit to the point where we would see actual bugs/failures, or where the large numbers might prevent us from making some future optimization, rather than trying to determine where the performance stops being acceptable - that's a question for users to decide for themselves. Of course we don't know where that place is that we might want to optimize in the future (Rob and I discussed an idea using all-integer math that would suffer from overflow, but still we should not just allow MAX_INT dimensions I think? To me a limit like 16K makes sense – well beyond any stated use case, but not effectively infinite? > Increase the number of dims for KNN vectors to 2048 > --- > > Key: LUCENE-10471 > URL: https://issues.apache.org/jira/browse/LUCENE-10471 > Project: Lucene - Core > Issue Type: Wish >Reporter: Mayya Sharipova >Priority: Trivial > Time Spent: 40m > Remaining Estimate: 0h > > The current maximum allowed number of dimensions is equal to 1024. But we see > in practice a couple well-known models that produce vectors with > 1024 > dimensions (e.g > [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1] > uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing > max dims to `2048` will satisfy these use cases. > I am wondering if anybody has strong objections against this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy
luyuncheng commented on code in PR #987: URL: https://github.com/apache/lucene/pull/987#discussion_r918848057 ## lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java: ## @@ -257,9 +270,13 @@ private static class DeflateCompressor extends Compressor { } @Override -public void compress(byte[] bytes, int off, int len, DataOutput out) throws IOException { +public void compress(ByteBuffersDataInput buffersInput, int off, int len, DataOutput out) Review Comment: it is a nice suggestion, i try to use method `compress(CompositeByteBuf compositeByteBuf, DataOutput out) ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r919288022 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; + + /** Flush all buffered data on disk * */ + public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException; Review Comment: @jtibshirani Thanks for the suggestion, I thought how to organize it, and I could not find a good way to do it, so I left the things as they are. In `IndexingChain#flush` we could have called `KnnFieldVectorsWriter#flush`, but `flush` operation also requires do `writer.finish();` and close the writer, so it is better managed by `VectorValuesConsumer` than individual `KnnFieldVectorsWriter` objects. > this would help make Lucene93HnswVectorsWriter easier to read, because we could separate out the complex sorting logic into a class like SortingFieldWriter This is also challenging to implement because whether a field writer needs to be `SortingFieldWriter` only becomes known during flush (`if sortMap != null` ), so this would require converting usual field writer object to `SortingFieldWriter` on flush, which doesn't look nice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r919343914 ## lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsWriter.java: ## @@ -266,65 +470,128 @@ private void writeMeta( } } - private OnHeapHnswGraph writeGraph( - RandomAccessVectorValuesProducer vectorValues, VectorSimilarityFunction similarityFunction) + /** + * Writes the vector values to the output and returns a set of documents that contains vectors. + */ + private static DocsWithFieldSet writeVectorData(IndexOutput output, VectorValues vectors) throws IOException { +DocsWithFieldSet docsWithField = new DocsWithFieldSet(); +for (int docV = vectors.nextDoc(); docV != NO_MORE_DOCS; docV = vectors.nextDoc()) { + // write vector + BytesRef binaryValue = vectors.binaryValue(); + assert binaryValue.length == vectors.dimension() * Float.BYTES; + output.writeBytes(binaryValue.bytes, binaryValue.offset, binaryValue.length); + docsWithField.add(docV); +} +return docsWithField; + } -// build graph -HnswGraphBuilder hnswGraphBuilder = -new HnswGraphBuilder( -vectorValues, similarityFunction, M, beamWidth, HnswGraphBuilder.randSeed); -hnswGraphBuilder.setInfoStream(segmentWriteState.infoStream); -OnHeapHnswGraph graph = hnswGraphBuilder.build(vectorValues.randomAccess()); + @Override + public void close() throws IOException { +IOUtils.close(meta, vectorData, vectorIndex); + } -// write vectors' neighbours on each level into the vectorIndex file -int countOnLevel0 = graph.size(); -for (int level = 0; level < graph.numLevels(); level++) { - int maxConnOnLevel = level == 0 ? (M * 2) : M; - NodesIterator nodesOnLevel = graph.getNodesOnLevel(level); - while (nodesOnLevel.hasNext()) { -int node = nodesOnLevel.nextInt(); -NeighborArray neighbors = graph.getNeighbors(level, node); -int size = neighbors.size(); -vectorIndex.writeInt(size); -// Destructively modify; it's ok we are discarding it after this -int[] nnodes = neighbors.node(); -Arrays.sort(nnodes, 0, size); -for (int i = 0; i < size; i++) { - int nnode = nnodes[i]; - assert nnode < countOnLevel0 : "node too large: " + nnode + ">=" + countOnLevel0; - vectorIndex.writeInt(nnode); -} -// if number of connections < maxConn, add bogus values up to maxConn to have predictable -// offsets -for (int i = size; i < maxConnOnLevel; i++) { - vectorIndex.writeInt(0); -} + private static class FieldData extends KnnFieldVectorsWriter { Review Comment: Nice suggestion! Addressed in b47ddc9e6834e2b0a838adf7fc1bed791b24ce2e -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #38: StackOverflowException on certain issue descriptions and comment text
mocobeta commented on issue #38: URL: https://github.com/apache/lucene-jira-archive/issues/38#issuecomment-1181776008 I opened #39. I cannot really explain _why the ad-hoc fix works_ but it works. I think there should be a better way though, it would be sufficient for the one-time batch. - it parses Jira list syntax correctly (if the list is not a complex one) - it does not cause stack overflows and improves the throughput (30~40% faster than current main on my desktop) Still, this takes four or five hours for me, we could parallelize it (#36) so that we can improve/test it more often. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r919349095 ## lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java: ## @@ -26,233 +26,153 @@ import org.apache.lucene.codecs.KnnVectorsWriter; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.Counter; import org.apache.lucene.util.RamUsageEstimator; /** - * Buffers up pending vector value(s) per doc, then flushes when segment flushes. + * Buffers up pending vector value(s) per doc, then flushes when segment flushes. Used for {@code + * SimpleTextKnnVectorsWriter} and for vectors writers before v 9.3 . * * @lucene.experimental */ -class VectorValuesWriter { - - private final FieldInfo fieldInfo; - private final Counter iwBytesUsed; - private final List vectors = new ArrayList<>(); - private final DocsWithFieldSet docsWithField; - - private int lastDocID = -1; - - private long bytesUsed; - - VectorValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) { -this.fieldInfo = fieldInfo; -this.iwBytesUsed = iwBytesUsed; -this.docsWithField = new DocsWithFieldSet(); -this.bytesUsed = docsWithField.ramBytesUsed(); -if (iwBytesUsed != null) { - iwBytesUsed.addAndGet(bytesUsed); +public abstract class VectorValuesWriter extends KnnVectorsWriter { Review Comment: @jtibshirani Thanks for the suggestion, I understood it and indeed I think it is a good idea. I will hold off implementing it until we finalize how we want to organize `KnnVectorsWriter` and `KnnFieldVectorsWriter` classes, so we can implement `SimpleTextKnnVectorsWriter` in the same way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10653) Should BlockMaxMaxscoreScorer rebuild its heap in bulk?
[ https://issues.apache.org/jira/browse/LUCENE-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565380#comment-17565380 ] Adrien Grand commented on LUCENE-10653: --- +1 to doing a bulk heapify The fact that this scorer only handles 2 clauses for now is only a way to give us more time to evaluate when we should use it vs. WANDScorer in my opinion. Most likely it will be used for more than 2 clauses at some point in the future. > Should BlockMaxMaxscoreScorer rebuild its heap in bulk? > --- > > Key: LUCENE-10653 > URL: https://issues.apache.org/jira/browse/LUCENE-10653 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Minor > > BMMScorer has to frequently rebuild its heap, and does do by clearing and > then iteratively calling {{{}add{}}}. It would be more efficient to heapify > in bulk. This is more academic than anything right now though since BMMScorer > is only used with two-clause disjunctions, so it's sort of a silly > optimization if it's not supporting a greater number of clauses. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on PR #992: URL: https://github.com/apache/lucene/pull/992#issuecomment-1182388563 @jtibshirani @jpountz Thank for your review. I've tried to address your comments, but it looks like we are still not clear how to organize `merge` and `flush` methods. Would be nice if you can provide further comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
mayya-sharipova commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r919332844 ## lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java: ## @@ -102,9 +104,22 @@ private class FieldsWriter extends KnnVectorsWriter { } @Override -public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) +public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) throws IOException { + KnnVectorsWriter writer = getInstance(fieldInfo); + return writer.addField(fieldInfo); +} + +@Override +public void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException { + for (WriterAndSuffix was : formats.values()) { +was.writer.flush(maxDoc, sortMap); + } +} + +@Override +public void mergeOneField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) throws IOException { - getInstance(fieldInfo).writeField(fieldInfo, knnVectorsReader); + getInstance(fieldInfo).mergeOneField(fieldInfo, knnVectorsReader); Review Comment: @jtibshirani Thanks for the suggestion, but we can't at the same time make `KnnVectorsWriter#merge` final and also un-support this `mergeOneField` in `PerFieldKnnVectorsFormat` as `KnnVectorsWriter#merge` calls the corresponding `mergeOneField`. Should we keep `mergeOneField` and make `KnnVectorsWriter#merge` final as your other suggestion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #38: StackOverflowException on certain issue descriptions and comment text
mikemccand commented on issue #38: URL: https://github.com/apache/lucene-jira-archive/issues/38#issuecomment-1181644356 > I'm trying to find other ways that do not cause infinite recursion while parsing lists correctly. Awesome, thanks @mocobeta! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #907: LUCENE-10357 Ghost fields and postings/points
jpountz commented on PR #907: URL: https://github.com/apache/lucene/pull/907#issuecomment-1181518177 @shahrs87 Can you look into removing all other instances of `terms == Terms.EMPTY` or `terms != Terms.EMPTY` as well? To do this while keeping tests passing, I think you'll need to create empty `Terms` instances that still honor the options of the `FieldInfo` as per my previous suggestion. E.g. you could add a new `Terms#empty(FieldInfo)` helper method that does the right thing for `hasFreqs()`, `hasPositions()`, etc. by looking at the index options of the `FieldInfo`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand merged pull request #40: #27: polish the legacy Jira text added to the issue a bit
mikemccand merged PR #40: URL: https://github.com/apache/lucene-jira-archive/pull/40 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565914#comment-17565914 ] Michael Sokolov commented on LUCENE-10577: -- It would be nice if we could make this encoding an *entirely* internal detail, with no user-configuration, but I don't think we can, because: # the choice of quantization scaling factor has a significant impact on the lossiness and thus recall. It needs to be tuned for each dataset # Even if we were able to do this tuning in Lucene, automatically, we would have to do it per-segment, and then when we merge, we'd have to re-scale, and we would lose more precision then. Because of this, I think we need to expose the ability for users to provide quantized data, and then they need some way of specifying for a given field whether it is byte-encoded or float-encoded. Although I do see that it could be done using the PerFieldKnnVectorsFormat - is that what you were saying, [~julietibs] ? > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer
mikemccand commented on PR #33: URL: https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181587514 > I'm also converting the whole Jira issue myself; it looks like it takes several hours... (recent changes to fix conversion errors could affect the conversion speed I think). This shouldn't be so slow, raised #36. Thanks -- I was beginning to wonder if it was normal how long it was taking ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer
mikemccand commented on PR #33: URL: https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181586644 > Sorry there should have been a "catch all" try~except clause. I made a quick fix in #35. No worries at all! No need to apologize! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #36: Can we parallelize the converter script?
mocobeta commented on issue #36: URL: https://github.com/apache/lucene-jira-archive/issues/36#issuecomment-1181522090 https://docs.python.org/3/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer
mikemccand commented on PR #33: URL: https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181589626 > It looks like a bug introduced in [cfbc821](https://github.com/apache/lucene-jira-archive/commit/cfbc821390859a7053e43028325b6bc616ec2b5b). (I have postponed testing it with the whole Jira dump.) > I'll take a look at it. Thanks for chasing this down -- I'll open a spinoff issue to track progress. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer
mikemccand commented on PR #33: URL: https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181586767 And thank you for the quick fix! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand opened a new issue, #37: Why are some Jira issues completely missing?
mikemccand opened a new issue, #37: URL: https://github.com/apache/lucene-jira-archive/issues/37 Spinoff from #33. This is not a blocker for migration, more because I'm curious how Jira lost issues and how pervasive this problem might be -- maybe other Apache projects are affected? Or maybe we are doing something wrong in Lucene ;) Some Jira issues in the sequential numbering from 1 .. N just don't seem to exist, and seem to have never existing (Google searching, jirasearch, Jira's own (Lucene based!) search engine, my long email archive also fail to find at least one of them): ``` [2022-07-11 07:57:25,815] WARNING:download_jira: Can't download LUCENE-498. status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}} [2022-07-11 07:59:10,096] WARNING:download_jira: Can't download LUCENE-613. status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}} [2022-07-11 07:59:10,978] WARNING:download_jira: Can't download LUCENE-614. status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}} [2022-07-11 07:59:13,615] WARNING:download_jira: Can't download LUCENE-617. status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}} [2022-07-11 08:10:36,059] WARNING:download_jira: Can't download LUCENE-1362. status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}} [2022-07-11 08:10:36,932] WARNING:download_jira: Can't download LUCENE-1363. status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}} [2022-07-11 08:10:37,798] WARNING:download_jira: Can't download LUCENE-1364. status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}} [2022-07-11 08:26:22,112] WARNING:download_jira: Can't download LUCENE-2375. status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}} [2022-07-11 08:27:02,304] WARNING:download_jira: Can't download LUCENE-2418. status code=404, message={"errorMessages":["Issue Does Not Exist"],"errors":{}} ``` Maybe Jira has some concurrency bug in how it numbers issues and sometimes leaves holes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy
jpountz commented on PR #987: URL: https://github.com/apache/lucene/pull/987#issuecomment-1181718918 > if we only using compress method with variants ByteBuffersDataInput in LUCENE90, we can not using abstract method Compressor.compress, when we want to use other compression mode. I think that this downside is fine? We prefer codecs to evolve independently so when we start needing changes for a new codec, we prefer to fork the code so that old codecs still rely on the unchanged code (which should move to lucene/backward-codecs) while the new codecs only use the new code without carrying over legacy code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10650) "after_effect": "no" was removed what replaces it?
[ https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565384#comment-17565384 ] Adrien Grand commented on LUCENE-10650: --- {{query.boost}} is the {{query.getBoost()}} from DFRSimilarity's {{double score(BasicStats stats, double freq, double docLen)}}, which does {{stats.getBoost() * basicModel.score(stats, tfn, aeTimes1pTfn)}}. The division by log(2) is not the tfn but a way to turn Math.log, which is a log in base 10 into a log in base 2. I wouldn't expect latency to be higher, this should get compiled to more or less the same code that you used to rely on in DFRSimilarity. > "after_effect": "no" was removed what replaces it? > -- > > Key: LUCENE-10650 > URL: https://issues.apache.org/jira/browse/LUCENE-10650 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nathan Meisels >Priority: Major > > Hi! > We have been using an old version of elasticsearch with the following > settings: > > {code:java} > "default": { > "queryNorm": "1", > "type": "DFR", > "basic_model": "in", > "after_effect": "no", > "normalization": "no" > }{code} > > I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that > "after_effect": "no" was removed. > In > [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33] > version score was: > {code:java} > return tfn * (float)(log2((N + 1) / (n + 0.5)));{code} > In > [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43] > version it's: > {code:java} > long N = stats.getNumberOfDocuments(); > long n = stats.getDocFreq(); > double A = log2((N + 1) / (n + 0.5)); > // basic model I should return A * tfn > // which we rewrite to A * (1 + tfn) - A > // so that it can be combined with the after effect while still guaranteeing > // that the result is non-decreasing with tfn > return A * aeTimes1pTfn * (1 - 1 / (1 + tfn)); > {code} > I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is > different than what we are used to. (We depend heavily on the exact scoring). > Do you have any advice how we can keep the same scoring as before? > Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10628) Enable MatchingFacetSetCounts to use space partitioning data structures
[ https://issues.apache.org/jira/browse/LUCENE-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565944#comment-17565944 ] Marc D'Mello commented on LUCENE-10628: --- Thanks for taking a look! As for the answer to your question - I'm not sure, it's really up to the user. The facet set matchers can theoretically be an unlimited amount of dimensions. Just to make sure we are on the same page, I'll just define the relevant parts of the {{facetset}} package API (apologize if I'm just repeating information that you already know here): Essentially we store multi-dim points into a {{BinaryDocValues}} field, so for example a list like {{(1, 2, 3), (2, 3, 4), (3, 4, 5)...}}. We have an {{ExactFacetSetMatcher}} that represents a single point of the same dimension as the field, and we will count how many points in the BDV match that point represented by the {{ExactFacetSetMatcher}}. We can put multiple of these {{ExactFacetSetMatcher}}'s into a group in {{MatchingFacetSetCounts}} and count how points in the BDV matched each {{ExactFacetSetMatcher}}. Currently, we linearly scan each point through each {{ExactFacetMatcher}} to get the counts, which is the part I want to optimize by putting the {{ExactFacetSetMatcher}}'s into a space partitioning data structure (either putting these into a KD tree, or as you suggested an interval tree). We also have {{RangeFacetSetMatcher}} which is similar to {{ExactFacetSetMatcher}}, except you can define ranges per dimension, for example something like {{(1 - 3, 3 - 4, 4 - 6)}} which would match if all of a point's dimensions lie within the range. I imagine you can put a group of these {{RangeFacetSetMatcher}}'s into an R tree to avoid linear scanning. So I'd imagine most use cases would be with a low dimensionality, but there might be some use cases that require higher dimensions. In the higher dimension case, would it just be best to resort to linear scanning then rather than building a KD tree? For the {{RangeFacetSetMatcher}} case, would bulk adding these into an R tree also be too complex? In the common use case, there would be many more points in the index than {{FacetSetMatcher}}'s, so another approach could also be to index the points in a KD tree. Sorry for all the questions! But I would be really interested in any suggestions you have here as I am inexperienced with these kind of data structures. > Enable MatchingFacetSetCounts to use space partitioning data structures > --- > > Key: LUCENE-10628 > URL: https://issues.apache.org/jira/browse/LUCENE-10628 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Marc D'Mello >Priority: Minor > > Currently, {{MatchingFacetSetCounts}} iterates over {{FacetSetMatcher}} > instances passed into it linearly. While this is fine in some cases, if we > have a large amount of {{FacetSetMatcher}}'s, this can be inefficient. We > should provide the option to users to enable the use of space partitioning > data structures (namely R trees and KD trees) so we can potentially scan over > these {{FacetSetMatcher}}'s in sub-linear time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565919#comment-17565919 ] Julie Tibshirani commented on LUCENE-10577: --- I wasn't suggesting making it entirely an internal detail, I just suggested moving the VectorEncoding configuration from FieldInfo (where it currently is in your PR) to the Lucene93HnswVectorsFormat constructor. It would still be user-configurable and have a good default, just like maxConn and beamWidth. I think I agree it would be complicated (and maybe unclear for users) if we tried to do it under-the-hood with no user config at all. > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565919#comment-17565919 ] Julie Tibshirani edited comment on LUCENE-10577 at 7/12/22 4:23 PM: I wasn't suggesting making it entirely an internal detail, I just suggested moving the VectorEncoding configuration from FieldInfo (where it currently is in your PR) to the Lucene93HnswVectorsFormat constructor. It would still be user-configurable and have a good default, just like maxConn and beamWidth. I think I agree it would be complicated (and maybe unclear for users) if we tried to do it under-the-hood with no user config at all. Edit: and yes, exactly -- PerFieldKnnVectorsFormat is what allows you to have different format configuration parameters for different vector fields. You can use it, for example, to set maxConn=16 for one field, and maxConn=32 for some other field. was (Author: julietibs): I wasn't suggesting making it entirely an internal detail, I just suggested moving the VectorEncoding configuration from FieldInfo (where it currently is in your PR) to the Lucene93HnswVectorsFormat constructor. It would still be user-configurable and have a good default, just like maxConn and beamWidth. I think I agree it would be complicated (and maybe unclear for users) if we tried to do it under-the-hood with no user config at all. > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565871#comment-17565871 ] Julie Tibshirani commented on LUCENE-10577: --- I checked out the latest PR changes, and I like the direction of using a new VectorEncoding class rather than squeezing this into VectorSimilarityFunction. I wonder if VectorEncoding should be a parameter on Lucene93HnswVectorsFormat though (alongside maxConn/ beamWidth) rather than on FieldInfo. My reasoning: FieldInfo contains information relevant to API consumers... but we want the encoding to be an internal detail of the format. Moreover, adding it to FieldInfo implies that all other vectors format implementations should support it. This could place more burden on implementing new formats. (I think this was part of the objection to me adding cosine similarity, that it increases the codec surface area without enough benefit? We discussed this in LUCENE-10191.) > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField
jpountz merged PR #966: URL: https://github.com/apache/lucene/pull/966 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand opened a new issue, #38: StackOverflowException on certain issue descriptions and comment text
mikemccand opened a new issue, #38: URL: https://github.com/apache/lucene-jira-archive/issues/38 Spinoff from #33. Some issues' text hit a stack overflow exception, e.g. one of the comments on LUCENE-550: ``` (.venv) beast3:migration[polish_legacy_jira]$ python src/jira2github_import.py --min 1 --max 10649 [2022-07-11 15:01:02,826] INFO:jira2github_import: Converting Jira issues to GitHub issues in /l/jira-github-migration/migration/github-import-data [2022-07-11 15:10:25,306] WARNING:jira2github_import: Jira dump file not found: /l/jira-github-migration/migration/jira-dump/LUCENE-498.json ERROR: unhandled exception while converting LUCENE-550 Traceback (most recent call last): File "/l/jira-github-migration/migration/src/jira2github_import.py", line 229, in convert_issue(num, dump_dir, output_dir, account_map, github_att_repo, github_att_branch) File "/l/jira-github-migration/migration/src/jira2github_import.py", line 133, in convert_issue comment_body = f"""{convert_text(comment_body, att_replace_map, account_map)} File "/l/jira-github-migration/migration/src/jira_util.py", line 216, in convert_text text = jira2markdown.convert(text, elements=elements) File "/l/jira-github-migration/.venv/lib/python3.10/site-packages/jira2markdown/parser.py", line 20, in convert return markup.transformString(text) File "/l/jira-github-migration/.venv/lib/python3.10/site-packages/pyparsing.py", line 2059, in transformString for t, s, e in self.scanString(instring): File "/l/jira-github-migration/.venv/lib/python3.10/site-packages/pyparsing.py", line 2007, in scanString nextLoc, tokens = parseFn(instring, preloc, callPreParse=False) File "/l/jira-github-migration/.venv/lib/python3.10/site-packages/pyparsing.py", line 1683, in _parseNoCache loc, tokens = self.parseImpl(instring, preloc, doActions) File "/l/jira-github-migration/.venv/lib/python3.10/site-packages/pyparsing.py", line 4462, in parseImpl return self.expr._parse(instring, loc, doActions, callPreParse=False)
[GitHub] [lucene-jira-archive] mikemccand commented on issue #38: StackOverflowException on certain issue descriptions and comment text
mikemccand commented on issue #38: URL: https://github.com/apache/lucene-jira-archive/issues/38#issuecomment-1181596940 Note that it is pretty rare -- when I ran the full conversion, I saw four separate occurrences. Might not be so important to track down? We can just carry over the raw text, escaped in MD code block, in such cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10619) Optimize the writeBytes in TermsHashPerField
[ https://issues.apache.org/jira/browse/LUCENE-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565872#comment-17565872 ] ASF subversion and git services commented on LUCENE-10619: -- Commit d7c2def019b8c1318d3c37a7065569e8d1a1af1f in lucene's branch refs/heads/main from tang donghai [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d7c2def019b ] LUCENE-10619: Optimize the writeBytes in TermsHashPerField (#966) > Optimize the writeBytes in TermsHashPerField > > > Key: LUCENE-10619 > URL: https://issues.apache.org/jira/browse/LUCENE-10619 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 9.2 >Reporter: tang donghai >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > Because we don't know the length of slice, writeBytes will always write byte > one after another instead of writing a block of bytes. > May be we could return both offset and length in ByteBlockPool#allocSlice? > 1. BYTE_BLOCK_SIZE is 32768, offset is at most 15 bits. > 2. slice size is at most 200, so it could fit in 8 bits. > So we could put them together into an int offset | length > There are only two places where this function is used,the cost of change it > is relatively small. > When allocSlice could return the offset and length of new Slice, we could > change writeBytes like below > {code:java} > // write block of bytes each time > while(remaining > 0 ) { >int offsetAndLength = allocSlice(bytes, offset); >length = min(remaining, (offsetAndLength & 0xff) - 1); >offset = offsetAndLength >> 8; >System.arraycopy(src, srcPos, bytePool.buffer, offset, length); >remaining -= length; >offset+= (length + 1); > } > {code} > If it could work, I'd like to raise a pr. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer
mikemccand commented on PR #33: URL: https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181657754 I pushed a small change to make a best-effort when we hit exceptions from the converter. Such comments look like this: https://github.com/mikemccand/stargazers-migration-test/issues/52#issuecomment-1181652126 But hopefully this new code never runs w/ @mocobeta's better fix for the infinite / slow recursion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on a diff in pull request #39: Stack overflows can occur when parsing Jira lists
mikemccand commented on code in PR #39: URL: https://github.com/apache/lucene-jira-archive/pull/39#discussion_r919015037 ## migration/src/markup/lists.py: ## @@ -40,6 +40,11 @@ def action(self, tokens: ParseResults) -> str: for line in tokens: # print(repr(line)) +if line == "\n": +# can't really explain but if this is the first item, an empty string should be added to preserve line feed Review Comment: LOL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tang-hi commented on pull request #966: LUCENE-10619: Optimize the writeBytes in TermsHashPerField
tang-hi commented on PR #966: URL: https://github.com/apache/lucene/pull/966#issuecomment-1181886902 @jpountz thanks for the suggestion 😄 . I have changed testWriteBytes to write small chunks each time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #38: StackOverflowException on certain issue descriptions and comment text
mocobeta commented on issue #38: URL: https://github.com/apache/lucene-jira-archive/issues/38#issuecomment-1181803770 I'll merge it once I confirmed it parses all Jira without any errors. (I think nobody can review the quick and dirty fix...) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer
mikemccand commented on PR #33: URL: https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181662032 OK don't merge this -- I somehow messed up and slurped in unrelated (already previously committed/pushed) changes. I have to drop off for now but will try to fix this a bit later ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on pull request #39: Stack overflows can occur when parsing Jira lists
mocobeta commented on PR #39: URL: https://github.com/apache/lucene-jira-archive/pull/39#issuecomment-1181804695 Thank you @mikemccand -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10619) Optimize the writeBytes in TermsHashPerField
[ https://issues.apache.org/jira/browse/LUCENE-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10619. --- Fix Version/s: 9.3 Resolution: Fixed > Optimize the writeBytes in TermsHashPerField > > > Key: LUCENE-10619 > URL: https://issues.apache.org/jira/browse/LUCENE-10619 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 9.2 >Reporter: tang donghai >Priority: Major > Fix For: 9.3 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Because we don't know the length of slice, writeBytes will always write byte > one after another instead of writing a block of bytes. > May be we could return both offset and length in ByteBlockPool#allocSlice? > 1. BYTE_BLOCK_SIZE is 32768, offset is at most 15 bits. > 2. slice size is at most 200, so it could fit in 8 bits. > So we could put them together into an int offset | length > There are only two places where this function is used,the cost of change it > is relatively small. > When allocSlice could return the offset and length of new Slice, we could > change writeBytes like below > {code:java} > // write block of bytes each time > while(remaining > 0 ) { >int offsetAndLength = allocSlice(bytes, offset); >length = min(remaining, (offsetAndLength & 0xff) - 1); >offset = offsetAndLength >> 8; >System.arraycopy(src, srcPos, bytePool.buffer, offset, length); >remaining -= length; >offset+= (length + 1); > } > {code} > If it could work, I'd like to raise a pr. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10619) Optimize the writeBytes in TermsHashPerField
[ https://issues.apache.org/jira/browse/LUCENE-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565873#comment-17565873 ] ASF subversion and git services commented on LUCENE-10619: -- Commit 9f9786122b487f992119f45c5d8a51a8d9d4a6f8 in lucene's branch refs/heads/branch_9x from tang donghai [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9f9786122b4 ] LUCENE-10619: Optimize the writeBytes in TermsHashPerField (#966) > Optimize the writeBytes in TermsHashPerField > > > Key: LUCENE-10619 > URL: https://issues.apache.org/jira/browse/LUCENE-10619 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 9.2 >Reporter: tang donghai >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > Because we don't know the length of slice, writeBytes will always write byte > one after another instead of writing a block of bytes. > May be we could return both offset and length in ByteBlockPool#allocSlice? > 1. BYTE_BLOCK_SIZE is 32768, offset is at most 15 bits. > 2. slice size is at most 200, so it could fit in 8 bits. > So we could put them together into an int offset | length > There are only two places where this function is used,the cost of change it > is relatively small. > When allocSlice could return the offset and length of new Slice, we could > change writeBytes like below > {code:java} > // write block of bytes each time > while(remaining > 0 ) { >int offsetAndLength = allocSlice(bytes, offset); >length = min(remaining, (offsetAndLength & 0xff) - 1); >offset = offsetAndLength >> 8; >System.arraycopy(src, srcPos, bytePool.buffer, offset, length); >remaining -= length; >offset+= (length + 1); > } > {code} > If it could work, I'd like to raise a pr. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on pull request #33: Polish wording of Legacy Jira details header, and each comment footer
mikemccand commented on PR #33: URL: https://github.com/apache/lucene-jira-archive/pull/33#issuecomment-1181660019 Sorry -- not pushed to the PR yet -- struggling w/ git ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy
jpountz commented on code in PR #987: URL: https://github.com/apache/lucene/pull/987#discussion_r918752313 ## lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java: ## @@ -257,9 +270,13 @@ private static class DeflateCompressor extends Compressor { } @Override -public void compress(byte[] bytes, int off, int len, DataOutput out) throws IOException { +public void compress(ByteBuffersDataInput buffersInput, int off, int len, DataOutput out) Review Comment: Should we remove `off` and `len` and rely on callers to create a `ByteBuffersDataInput#slice` if they only need to compress a subset of the input? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566015#comment-17566015 ] Michael Sokolov commented on LUCENE-10577: -- OK, that makes sense to me – I'll see about moving the setting to the `Lucene93HnswVectorsFormat` > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048
[ https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566026#comment-17566026 ] Mayya Sharipova commented on LUCENE-10471: -- [~sstolpovskiy] [~sokolov] Thanks for providing your suggestions. It looks like we clearly see the need for upto 2048 dims for images, so I will be merging the linked PR. > Increase the number of dims for KNN vectors to 2048 > --- > > Key: LUCENE-10471 > URL: https://issues.apache.org/jira/browse/LUCENE-10471 > Project: Lucene - Core > Issue Type: Wish >Reporter: Mayya Sharipova >Priority: Trivial > Time Spent: 40m > Remaining Estimate: 0h > > The current maximum allowed number of dimensions is equal to 1024. But we see > in practice a couple well-known models that produce vectors with > 1024 > dimensions (e.g > [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1] > uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing > max dims to `2048` will satisfy these use cases. > I am wondering if anybody has strong objections against this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types
Nick Knize created LUCENE-10654: --- Summary: New companion doc value format for LatLonShape and XYShape field types Key: LUCENE-10654 URL: https://issues.apache.org/jira/browse/LUCENE-10654 Project: Lucene - Core Issue Type: New Feature Reporter: Nick Knize {{XYDocValuesField}} provides doc value support for {{XYPoint}}. {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}. However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue format. This lack of doc value support for shapes means facets, aggregations, and IndexOrDocValues queries are currently not possible for Shape field types. This gap needs be closed in lucene. To support IndexOrDocValues queries along with various geometry aggregations and facets, the ability to compute the spatial relation with the doc value is needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since the doc value encoding is nothing more than a simple 2D integer encoding of the x,y and lat,lon dimensional components. Accomplishing the same with a naive integer encoded binary representation for N-vertex shapes would be costly. {{ComponentTree}} already provides an efficient in memory structure for quickly computing spatial relations over Shape types based on a binary tree of tessellated triangles provided by the {{Tessellator}}. Furthermore, this tessellation is already computed at index time. If we create an on-disk representation of {{ComponentTree}}s binary tree of tessellated triangles and use this as the doc value {{binaryValue}} format we will be able to efficiently compute spatial relations with this binary representation and achieve the same facet/aggregation result over shapes as we can with points today (e.g., grid facets, centroid, area, etc). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types
[ https://issues.apache.org/jira/browse/LUCENE-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Knize updated LUCENE-10654: Description: {{XYDocValuesField}} provides doc value support for {{XYPoint}}. {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}. However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue format. This lack of doc value support for shapes means facets, aggregations, and IndexOrDocValues queries are currently not possible for Shape field types. This gap needs be closed in lucene. To support IndexOrDocValues queries along with various geometry aggregations and facets, the ability to compute the spatial relation with the doc value is needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since the doc value encoding is nothing more than a simple 2D integer encoding of the x,y and lat,lon dimensional components. Accomplishing the same with a naive integer encoded binary representation for N-vertex shapes would be costly. {{ComponentTree}} already provides an efficient in memory structure for quickly computing spatial relations over Shape types based on a binary tree of tessellated triangles provided by the {{Tessellator}}. Furthermore, this tessellation is already computed at index time. If we create an on-disk representation of {{ComponentTree}} 's binary tree of tessellated triangles and use this as the doc value {{binaryValue}} format we will be able to efficiently compute spatial relations with this binary representation and achieve the same facet/aggregation result over shapes as we can with points today (e.g., grid facets, centroid, area, etc). was: {{XYDocValuesField}} provides doc value support for {{XYPoint}}. {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}. However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue format. This lack of doc value support for shapes means facets, aggregations, and IndexOrDocValues queries are currently not possible for Shape field types. This gap needs be closed in lucene. To support IndexOrDocValues queries along with various geometry aggregations and facets, the ability to compute the spatial relation with the doc value is needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since the doc value encoding is nothing more than a simple 2D integer encoding of the x,y and lat,lon dimensional components. Accomplishing the same with a naive integer encoded binary representation for N-vertex shapes would be costly. {{ComponentTree}} already provides an efficient in memory structure for quickly computing spatial relations over Shape types based on a binary tree of tessellated triangles provided by the {{Tessellator}}. Furthermore, this tessellation is already computed at index time. If we create an on-disk representation of {{ComponentTree}}s binary tree of tessellated triangles and use this as the doc value {{binaryValue}} format we will be able to efficiently compute spatial relations with this binary representation and achieve the same facet/aggregation result over shapes as we can with points today (e.g., grid facets, centroid, area, etc). > New companion doc value format for LatLonShape and XYShape field types > -- > > Key: LUCENE-10654 > URL: https://issues.apache.org/jira/browse/LUCENE-10654 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Nick Knize >Priority: Major > > {{XYDocValuesField}} provides doc value support for {{XYPoint}}. > {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}. > However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue > format. > This lack of doc value support for shapes means facets, aggregations, and > IndexOrDocValues queries are currently not possible for Shape field types. > This gap needs be closed in lucene. > To support IndexOrDocValues queries along with various geometry aggregations > and facets, the ability to compute the spatial relation with the doc value is > needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since > the doc value encoding is nothing more than a simple 2D integer encoding of > the x,y and lat,lon dimensional components. Accomplishing the same with a > naive integer encoded binary representation for N-vertex shapes would be > costly. > {{ComponentTree}} already provides an efficient in memory structure for > quickly computing spatial relations over Shape types based on a binary tree > of tessellated triangles provided by the {{Tessellator}}. Furthermore, this > tessellation is already computed at index time. If we create an on-disk > representation of {{ComponentTree}} 's binary tree of tessellated triangles > and use this as the doc valu
[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order
Yuti-G commented on code in PR #1013: URL: https://github.com/apache/lucene/pull/1013#discussion_r919502708 ## lucene/facet/src/test/org/apache/lucene/facet/FacetTestCase.java: ## @@ -264,4 +264,24 @@ protected void assertFloatValuesEquals(FacetResult a, FacetResult b) { a.labelValues[i].value.floatValue() / 1e5); } } + + protected void assertNumericValuesEquals(Number a, Number b) { +assertTrue(a.getClass().isInstance(b)); +if (a instanceof Float) { + assertEquals(a.floatValue(), b.floatValue(), a.floatValue() / 1e5); +} else if (a instanceof Double) { + assertEquals(a.doubleValue(), b.doubleValue(), a.doubleValue() / 1e5); +} else { + assertEquals(a, b); +} + } + + protected void assertAllChildrenEqualsWithoutOrdering(FacetResult a, FacetResult b) { Review Comment: Thanks for the feedback! I addressed it in the new commit. Since we renamed the method to a generic name `assertFacetResult`, I added a comment `// assert children equal with no assumption of the children ordering` to inform future users in case they try to use this assert method but care about children ordering e.g., getTopChildren. Please let me know what you think. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nknize opened a new pull request, #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape
nknize opened a new pull request, #1017: URL: https://github.com/apache/lucene/pull/1017 Adds new doc value field to support LatLonShape and XYShape doc values. The implementation is inspired by ComponentTree. A binary tree of tessellated components (point, line, or triangle) is created. This tree is then DFS serialized to a variable compressed DataOutput buffer to keep the doc value format as compact as possible. DocValue queries are performed on the serialized tree using a similar component relation logic as found in SpatialQuery for BKD indexed shapes. To make this possible some of the relation logic is refactored to make it accessible to the doc value query counterpart. Current limitations (to be addressed in follow up PR) 1. Only Polygon Doc Values are tested 2. CONTAINS relation not yet supported 3. Only BoundingBox queries are supported (General Geometry Queries will be added in a follow on enhancement PR) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
[ https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566071#comment-17566071 ] Vigya Sharma commented on LUCENE-10649: --- Great, thanks for confirming Adrien. I'll open a PR with the fix. > Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField > --- > > Key: LUCENE-10649 > URL: https://issues.apache.org/jira/browse/LUCENE-10649 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > > Failing Build Link: > [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/] > Repro: > {code:java} > gradlew test --tests > TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField > -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA > -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 > {code} > Error: > {code:java} > java.lang.AssertionError: expected:<103> but was:<2147483647> > at > __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347) > at > java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types
[ https://issues.apache.org/jira/browse/LUCENE-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Knize updated LUCENE-10654: Fix Version/s: 9.3 > New companion doc value format for LatLonShape and XYShape field types > -- > > Key: LUCENE-10654 > URL: https://issues.apache.org/jira/browse/LUCENE-10654 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Nick Knize >Priority: Major > Fix For: 9.3 > > Time Spent: 10m > Remaining Estimate: 0h > > {{XYDocValuesField}} provides doc value support for {{XYPoint}}. > {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}. > However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue > format. > This lack of doc value support for shapes means facets, aggregations, and > IndexOrDocValues queries are currently not possible for Shape field types. > This gap needs be closed in lucene. > To support IndexOrDocValues queries along with various geometry aggregations > and facets, the ability to compute the spatial relation with the doc value is > needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since > the doc value encoding is nothing more than a simple 2D integer encoding of > the x,y and lat,lon dimensional components. Accomplishing the same with a > naive integer encoded binary representation for N-vertex shapes would be > costly. > {{ComponentTree}} already provides an efficient in memory structure for > quickly computing spatial relations over Shape types based on a binary tree > of tessellated triangles provided by the {{Tessellator}}. Furthermore, this > tessellation is already computed at index time. If we create an on-disk > representation of {{ComponentTree}} 's binary tree of tessellated triangles > and use this as the doc value {{binaryValue}} format we will be able to > efficiently compute spatial relations with this binary representation and > achieve the same facet/aggregation result over shapes as we can with points > today (e.g., grid facets, centroid, area, etc). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #1010: Specialize ordinal encoding for SortedSetDocValues
gsmiller merged PR #1010: URL: https://github.com/apache/lucene/pull/1010 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta merged pull request #39: Stack overflows can occur when parsing Jira lists
mocobeta merged PR #39: URL: https://github.com/apache/lucene-jira-archive/pull/39 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta closed issue #38: StackOverflowException on certain issue descriptions and comment text
mocobeta closed issue #38: StackOverflowException on certain issue descriptions and comment text URL: https://github.com/apache/lucene-jira-archive/issues/38 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
msokolov commented on PR #947: URL: https://github.com/apache/lucene/pull/947#issuecomment-1182694202 OK, this last round of commits moves the new vector encoding parameter out of IndexableField and FieldInfo into Codec constructor and internally to the codec, in FieldEntry. It certainly has less visible surface area now. I also merged from main and resolved a bunch of conflicts with the scoring change. I think it is correct (all the unit tests pass), but it wasn't trivial and I think it would be worth running some integration/performance tests just to make sure all is still well. There's a little bit of code duplication in HnswGraphSearcher where we now have the logic for switching from approximate to exact knn in two places that I don't like. Maybe that can be factored better? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn opened a new pull request, #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions
zacharymorn opened a new pull request, #1018: URL: https://github.com/apache/lucene/pull/1018 ### Description (or a Jira issue link if you have one) Use BulkScorer to limit BMMScorer to only top-level disjunctions Note: Tests update pending -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566149#comment-17566149 ] Zach Chen commented on LUCENE-10480: {quote}I wouldn't say blocker, but maybe we could give us time indeed by only using this new scorer on top-level disjunctions for now so that we have more time to figure out whether we should stick to BMW or switch to BMM for inner disjunctions. {quote} Sounds good. I tried a few quick approaches to limit BMM scorer to top-level disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they didn't work due to weight's / query's recursive logic. So I ended up wrapping the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] pending tests update) like your other PR. Please let me know if this approach looks good to you, or if there's a better approach. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7.5h > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566149#comment-17566149 ] Zach Chen edited comment on LUCENE-10480 at 7/13/22 5:09 AM: - {quote}I wouldn't say blocker, but maybe we could give us time indeed by only using this new scorer on top-level disjunctions for now so that we have more time to figure out whether we should stick to BMW or switch to BMM for inner disjunctions. {quote} Sounds good. I tried a few quick approaches to limit BMM scorer to top-level disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they didn't work due to weight's / query's recursive logic. So I ended up wrapping the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] pending tests update) like your other PR. Please let me know if this approach looks good to you, or if there's a better approach. was (Author: zacharymorn): {quote}I wouldn't say blocker, but maybe we could give us time indeed by only using this new scorer on top-level disjunctions for now so that we have more time to figure out whether we should stick to BMW or switch to BMM for inner disjunctions. {quote} Sounds good. I tried a few quick approaches to limit BMM scorer to top-level disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they didn't work due to weight's / query's recursive logic. So I ended up wrapping the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] pending tests update) like your other PR. Please let me know if this approach looks good to you, or if there's a better approach. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7.5h > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #1018: LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions
zacharymorn commented on PR #1018: URL: https://github.com/apache/lucene/pull/1018#issuecomment-1182774748 Benchmark results with `wikinightly.tasks` boolean queries below: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value BrowseMonthTaxoFacets 28.81 (37.2%) 26.45 (32.0%) -8.2% ( -56% - 97%) 0.454 OrHighMedDayTaxoFacets 17.65 (4.5%) 16.78 (5.3%) -5.0% ( -14% -5%) 0.001 BrowseRandomLabelTaxoFacets 27.58 (50.2%) 26.72 (45.1%) -3.1% ( -65% - 185%) 0.836 TermBGroup1M1P 37.75 (7.6%) 36.62 (6.5%) -3.0% ( -15% - 11%) 0.179 TermGroup100 36.05 (5.4%) 35.18 (4.5%) -2.4% ( -11% -8%) 0.130 IntNRQ 90.71 (4.7%) 88.69 (7.2%) -2.2% ( -13% - 10%) 0.248 TermBGroup1M 30.11 (5.3%) 29.64 (5.1%) -1.6% ( -11% -9%) 0.343 TermDateFacets 48.93 (4.5%) 48.28 (5.0%) -1.3% ( -10% -8%) 0.377 SloppyPhrase 13.21 (3.3%) 13.05 (3.5%) -1.2% ( -7% -5%) 0.256 IntervalsOrdered 125.27 (7.0%) 123.79 (7.9%) -1.2% ( -14% - 14%) 0.615 MedTermDayTaxoFacets 78.33 (4.2%) 77.48 (4.5%) -1.1% ( -9% -8%) 0.429 TermDayOfYearSort 254.99 (3.5%) 252.39 (2.9%) -1.0% ( -7% -5%) 0.312 AndHighMedDayTaxoFacets 122.91 (2.6%) 121.74 (2.8%) -1.0% ( -6% -4%) 0.265 SpanNear6.11 (5.6%)6.05 (4.4%) -0.9% ( -10% -9%) 0.583 AndHighMed 144.28 (4.2%) 143.04 (4.9%) -0.9% ( -9% -8%) 0.556 AndHighHigh 43.39 (2.6%) 43.04 (4.0%) -0.8% ( -7% -5%) 0.449 Phrase 52.64 (4.4%) 52.26 (4.6%) -0.7% ( -9% -8%) 0.615 AndHighHighDayTaxoFacets 11.91 (2.9%) 11.83 (3.6%) -0.7% ( -6% -6%) 0.527 TermDTSort 331.47 (3.4%) 329.38 (3.3%) -0.6% ( -7% -6%) 0.552 AndHighOrMedMed 90.33 (4.4%) 90.06 (4.8%) -0.3% ( -9% -9%) 0.841 TermGroup10K 42.46 (4.3%) 42.38 (4.3%) -0.2% ( -8% -8%) 0.886 BrowseMonthSSDVFacets 29.10 (14.2%) 29.05 (9.5%) -0.2% ( -20% - 27%) 0.965 TermGroup1M 40.35 (4.0%) 40.30 (4.3%) -0.1% ( -8% -8%) 0.932 AndMedOrHighHigh 86.73 (3.5%) 86.76 (3.9%)0.0% ( -7% -7%) 0.978 TermMonthSort 273.18 (7.7%) 273.28 (8.4%)0.0% ( -14% - 17%) 0.989 Fuzzy2 81.84 (2.8%) 81.91 (2.9%)0.1% ( -5% -5%) 0.918 PKLookup 321.81 (5.4%) 322.43 (5.8%)0.2% ( -10% - 12%) 0.914 TermTitleSort 188.55 (8.0%) 188.92 (8.3%)0.2% ( -14% - 17%) 0.939 Respell 111.20 (2.5%) 111.46 (3.7%)0.2% ( -5% -6%) 0.815 Fuzzy1 78.31 (2.9%) 78.64 (2.9%)0.4% ( -5% -6%) 0.648 BrowseRandomLabelSSDVFacets 19.92 (8.2%) 20.03 (6.4%)0.5% ( -13% - 16%) 0.821 Term 3440.49 (3.9%) 3461.12 (4.8%)0.6% ( -7% -9%) 0.664 BrowseDayOfYearSSDVFacets 26.22 (12.5%) 26.47 (4.8%)0.9% ( -14% - 20%) 0.751 BrowseDateTaxoFacets 27.49 (32.2%) 27.82 (32.6%)1.2% ( -48% - 97%) 0.905 BrowseDayOfYearTaxoFacets 27.84 (31.8%) 28.20 (32.4%)1.3% ( -47% - 96%) 0.900 BrowseDateSSDVFacets3.75 (27.0%)3.80 (28.3%)1.3% ( -42% - 77%) 0.879 Wildcard 113.02 (4.3%) 114.66 (5.3%)1.5% ( -7% - 11%) 0.342 Prefix3 83.80 (7.4%) 85.97 (7.3%)2.6% ( -11% - 18%) 0.266 OrHighHigh 113.87 (3.9%) 156.08 (8.9%) 37.1% ( 23% - 51%) 0.000 OrHighMed 92.87 (5.1%) 210.48 (13.0%) 126.6% ( 103% - 152%) 0.000 ``` ``` TaskQPS baseline StdDevQPS my_
[GitHub] [lucene] navneet1v commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape
navneet1v commented on code in PR #1017: URL: https://github.com/apache/lucene/pull/1017#discussion_r919668826 ## lucene/core/src/java/org/apache/lucene/document/ShapeDocValuesField.java: ## @@ -0,0 +1,844 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Comparator; +import java.util.List; +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.document.ShapeField.DecodedTriangle.TYPE; +import org.apache.lucene.document.ShapeField.QueryRelation; +import org.apache.lucene.document.SpatialQuery.EncodedRectangle; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.IndexableFieldType; +import org.apache.lucene.index.PointValues.Relation; +import org.apache.lucene.search.Query; +import org.apache.lucene.store.ByteArrayDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; + +/** A doc values field representation for {@link LatLonShape} and {@link XYShape} */ +public final class ShapeDocValuesField extends Field { + private final ShapeComparator shapeComparator; + + private static final FieldType FIELD_TYPE = new FieldType(); + + static { +FIELD_TYPE.setDocValuesType(DocValuesType.BINARY); +FIELD_TYPE.setOmitNorms(true); +FIELD_TYPE.freeze(); + } + + /** + * Creates a {@ShapeDocValueField} instance from a shape tessellation + * + * @param name The Field Name (must not be null) + * @param tessellation The tessellation (must not be null) + */ + ShapeDocValuesField(String name, List tessellation) { +super(name, FIELD_TYPE); +BytesRef b = computeBinaryValue(tessellation); +this.fieldsData = b; +try { + this.shapeComparator = new ShapeComparator(b); +} catch (IOException e) { + throw new IllegalArgumentException("unable to read binary shape doc value field. ", e); +} + } + + /** Creates a {@code ShapeDocValue} field from a given serialized value */ + ShapeDocValuesField(String name, BytesRef binaryValue) { Review Comment: the constructor are not public how the clients are can use these doc values? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10650) "after_effect": "no" was removed what replaces it?
[ https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566191#comment-17566191 ] Nathan Meisels commented on LUCENE-10650: - Thanks for the response! I have another question. What happens if you upgrade to new -elastic- lucene version with these old settings? (e.g after effect no)? Does elastic run multiple lucene versions so it will know how to deal with it? Or do I have to reindex with new setting before the upgrade? Thanks > "after_effect": "no" was removed what replaces it? > -- > > Key: LUCENE-10650 > URL: https://issues.apache.org/jira/browse/LUCENE-10650 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nathan Meisels >Priority: Major > > Hi! > We have been using an old version of elasticsearch with the following > settings: > > {code:java} > "default": { > "queryNorm": "1", > "type": "DFR", > "basic_model": "in", > "after_effect": "no", > "normalization": "no" > }{code} > > I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that > "after_effect": "no" was removed. > In > [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33] > version score was: > {code:java} > return tfn * (float)(log2((N + 1) / (n + 0.5)));{code} > In > [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43] > version it's: > {code:java} > long N = stats.getNumberOfDocuments(); > long n = stats.getDocFreq(); > double A = log2((N + 1) / (n + 0.5)); > // basic model I should return A * tfn > // which we rewrite to A * (1 + tfn) - A > // so that it can be combined with the after effect while still guaranteeing > // that the result is non-decreasing with tfn > return A * aeTimes1pTfn * (1 - 1 / (1 + tfn)); > {code} > I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is > different than what we are used to. (We depend heavily on the exact scoring). > Do you have any advice how we can keep the same scoring as before? > Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org