uschindler commented on code in PR #12436:
URL: https://github.com/apache/lucene/pull/12436#discussion_r1276162592
##########
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsFormat.java:
##########
@@ -76,6 +79,19 @@ public static KnnVectorsFormat forName(String name) {
/** Returns a {@link KnnVectorsReader} to read the vectors from the index. */
public abstract KnnVectorsReader fieldsReader(SegmentReadState state) throws
IOException;
+ /**
+ * Returns the maximum number of vector dimensions supported by this codec
for the given field
+ * name
+ *
+ * <p>Codecs should override this method to specify the maximum number of
dimensions they support.
+ *
+ * @param fieldName the field name
+ * @return the maximum number of vector dimensions.
+ */
+ public int getMaxDimensions(String fieldName) {
Review Comment:
In main branch this should be abstract, only in 9.x we should have a default
impl.
##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##########
@@ -478,18 +473,42 @@ public void testIllegalDimensionTooLarge() throws
Exception {
try (Directory dir = newDirectory();
IndexWriter w = new IndexWriter(dir, newIndexWriterConfig())) {
Document doc = new Document();
- expectThrows(
- IllegalArgumentException.class,
- () ->
- doc.add(
- new KnnFloatVectorField(
- "f",
- new float[FloatVectorValues.MAX_DIMENSIONS + 1],
- VectorSimilarityFunction.DOT_PRODUCT)));
+ doc.add(
+ new KnnFloatVectorField(
+ "f",
+ new float[getVectorsMaxDimensions("f") + 1],
+ VectorSimilarityFunction.DOT_PRODUCT));
+ Exception exc = expectThrows(IllegalArgumentException.class, () ->
w.addDocument(doc));
Review Comment:
So basically the check is now delayed to `addDocument()`? Great!
I was afraid that the check comes delayed while indexing of flushing is
happening. So to me this looks good.
##########
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##########
@@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws
IOException {
final Sort indexSort = indexWriterConfig.getIndexSort();
validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType);
}
+ if (s.vectorDimension != 0) {
+ validateMaxVectorDimension(
+ pf.fieldName,
+ s.vectorDimension,
+ indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions());
+ }
Review Comment:
Yes this looks fine. The hashtable lookup is no issue at all.
As getMaxDimensions() is a public codec method, we can let us do the check
here. A separate validateFieldDims() is not needed.
I only don't like the long call chain to actually get the vectors format
from the indexWriterConfig. But I can live with that.
##########
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java:
##########
@@ -80,6 +80,11 @@ public KnnVectorsReader fieldsReader(SegmentReadState state)
throws IOException
return new FieldsReader(state);
}
+ @Override
Review Comment:
is this the only subclass of KnnVectorsFormat that we have?
In addition, we should also add explicit number of dimensions in backwards
codecs, because at the time when it was implemented 1024 was their default. In
backwards codec the method should be final, IMHO.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]