Re: [PR] JAVA-3143: Extend driver vector support to arbitrary subtypes and fix handling of variable length types (OSS C* 5.0) [cassandra-java-driver]

via GitHub Wed, 15 Jan 2025 16:15:23 -0800


absurdfarce commented on code in PR #1952:
URL: 
https://github.com/apache/cassandra-java-driver/pull/1952#discussion_r1914214511



##########
core/src/main/java/com/datastax/oss/driver/internal/core/type/codec/BlobCodec.java:
##########
@@ -83,4 +84,10 @@ public ByteBuffer parse(@Nullable String value) {
         ? null
         : Bytes.fromHexString(value);
   }
+
+  @NonNull
+  @Override
+  public Optional<Integer> serializedSize() {
+    return Optional.absent();
+  }

Review Comment:
   We already define a default version (which also returns Optional.absent()) 
of this method in the TypeCodec interface implemented by this class.  Is there 
a reason we need to define it here as well?



##########
core/src/test/java/com/datastax/oss/driver/api/core/data/CqlVectorTest.java:
##########
@@ -21,58 +21,75 @@
 import static org.assertj.core.api.Assertions.assertThatThrownBy;
 import static org.assertj.core.api.Assertions.fail;
 
+import com.datastax.oss.driver.api.core.type.codec.TypeCodec;
 import com.datastax.oss.driver.api.core.type.codec.TypeCodecs;
+import com.datastax.oss.driver.api.core.type.codec.registry.CodecRegistry;
 import com.datastax.oss.driver.internal.SerializationHelper;
-import com.datastax.oss.driver.shaded.guava.common.collect.Iterators;
+import com.tngtech.java.junit.dataprovider.DataProvider;
+import com.tngtech.java.junit.dataprovider.DataProviderRunner;
+import com.tngtech.java.junit.dataprovider.UseDataProvider;
 import java.io.ByteArrayInputStream;
 import java.io.ObjectInputStream;
 import java.io.ObjectStreamException;
+import java.time.LocalTime;
 import java.util.AbstractList;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collections;
+import java.util.Iterator;
 import java.util.List;
 import java.util.stream.Collectors;
 import org.apache.commons.codec.DecoderException;
 import org.apache.commons.codec.binary.Hex;
 import org.assertj.core.util.Lists;
 import org.junit.Test;
+import org.junit.runner.RunWith;
 
+@RunWith(DataProviderRunner.class)
 public class CqlVectorTest {
 
   private static final Float[] VECTOR_ARGS = {1.0f, 2.5f};

Review Comment:
   We don't need this anymore, right?  If I'm reading this correctly this has 
been entirely replaced by the equivalent value in dataProvider() (plus the 
other values returned by that method which perform similar compares for the 
other types we now support).



##########
core/src/test/java/com/datastax/oss/driver/internal/core/type/codec/VectorCodecTest.java:
##########
@@ -20,122 +20,244 @@
 import static org.assertj.core.api.Assertions.assertThat;
 import static org.assertj.core.api.Assertions.assertThatThrownBy;
 
+import com.datastax.oss.driver.api.core.ProtocolVersion;
 import com.datastax.oss.driver.api.core.data.CqlVector;
+import com.datastax.oss.driver.api.core.type.DataType;
 import com.datastax.oss.driver.api.core.type.DataTypes;
-import com.datastax.oss.driver.api.core.type.VectorType;
+import com.datastax.oss.driver.api.core.type.codec.TypeCodec;
 import com.datastax.oss.driver.api.core.type.codec.TypeCodecs;
-import com.datastax.oss.driver.api.core.type.reflect.GenericType;
+import com.datastax.oss.driver.api.core.type.codec.registry.CodecRegistry;
 import com.datastax.oss.driver.internal.core.type.DefaultVectorType;
-import java.util.Arrays;
+import com.datastax.oss.protocol.internal.util.Bytes;
+import com.tngtech.java.junit.dataprovider.DataProvider;
+import com.tngtech.java.junit.dataprovider.DataProviderRunner;
+import com.tngtech.java.junit.dataprovider.UseDataProvider;
+import java.nio.ByteBuffer;
+import java.time.LocalTime;
+import java.util.HashMap;
+import org.apache.commons.lang3.ArrayUtils;
 import org.junit.Test;
+import org.junit.runner.RunWith;
 
-public class VectorCodecTest extends CodecTestBase<CqlVector<Float>> {
+@RunWith(DataProviderRunner.class)
+public class VectorCodecTest {
 
-  private static final Float[] VECTOR_ARGS = {1.0f, 2.5f};
-
-  private static final CqlVector<Float> VECTOR = 
CqlVector.newInstance(VECTOR_ARGS);
-
-  private static final String VECTOR_HEX_STRING = "0x" + "3f800000" + 
"40200000";
-
-  private static final String FORMATTED_VECTOR = "[1.0, 2.5]";
-
-  public VectorCodecTest() {
-    VectorType vectorType = DataTypes.vectorOf(DataTypes.FLOAT, 2);
-    this.codec = TypeCodecs.vectorOf(vectorType, TypeCodecs.FLOAT);
+  @DataProvider
+  public static Object[][] dataProvider() {
+    HashMap<Integer, String> map1 = new HashMap<>();
+    map1.put(1, "a");
+    HashMap<Integer, String> map2 = new HashMap<>();
+    map2.put(2, "b");
+    // For every row, data type, array of 2 values, formatted string, encoded 
bytes

Review Comment:
   I'm wondering if this might be better (from a code management perspective) 
to create a small container class which contains these values and provides 
named accessors for them (container.getDataType(), container.getValues(), 
etc.).  Not critical but it might make this code easier to read + maintain.



##########
core/src/main/java/com/datastax/oss/driver/internal/core/type/codec/VectorCodec.java:
##########
@@ -156,4 +116,195 @@ public CqlVector<SubtypeT> parse(@Nullable String value) {
         ? null
         : CqlVector.from(value, this.subtypeCodec);
   }
+
+  private static class FixedLength<SubtypeT> implements 
VectorCodecProxy<SubtypeT> {
+    private final VectorType cqlType;
+    private final TypeCodec<SubtypeT> subtypeCodec;
+
+    private FixedLength(VectorType cqlType, TypeCodec<SubtypeT> subtypeCodec) {
+      this.cqlType = cqlType;
+      this.subtypeCodec = subtypeCodec;
+    }
+
+    @Override
+    public ByteBuffer encode(
+        @Nullable CqlVector<SubtypeT> value, @NonNull ProtocolVersion 
protocolVersion) {
+      if (value == null || cqlType.getDimensions() <= 0) {
+        return null;
+      }
+      ByteBuffer[] valueBuffs = new ByteBuffer[cqlType.getDimensions()];
+      Iterator<SubtypeT> values = value.iterator();
+      int allValueBuffsSize = 0;
+      for (int i = 0; i < cqlType.getDimensions(); ++i) {
+        ByteBuffer valueBuff;
+        SubtypeT valueObj;
+
+        try {
+          valueObj = values.next();
+        } catch (NoSuchElementException nsee) {
+          throw new IllegalArgumentException(
+              String.format(
+                  "Not enough elements; must provide elements for %d 
dimensions",
+                  cqlType.getDimensions()));
+        }
+
+        try {
+          valueBuff = this.subtypeCodec.encode(valueObj, protocolVersion);
+        } catch (ClassCastException e) {
+          throw new IllegalArgumentException("Invalid type for element: " + 
valueObj.getClass());
+        }
+        if (valueBuff == null) {
+          throw new NullPointerException("Vector elements cannot encode to CQL 
NULL");
+        }
+        allValueBuffsSize += valueBuff.limit();
+        valueBuff.rewind();
+        valueBuffs[i] = valueBuff;
+      }
+      // if too many elements, throw
+      if (values.hasNext()) {
+        throw new IllegalArgumentException(
+            String.format(
+                "Too many elements; must provide elements for %d dimensions",
+                cqlType.getDimensions()));

Review Comment:
   Just for my own edification: we did decide we were going to throw in this 
case, right?  The other option is that we can silently ignore any extra data 
that might be present in the buffer and I recall some conversation on this 
point.  My recollection was that we had decided _not_ to throw in this case but 
my PR for PYTHON-1369 explicitly [does 
throw](https://github.com/datastax/python-driver/pull/1217/files#diff-18308cd34ffea992ea325d7d0c09eb78b27203def63c60dedbbcd8db8626cd7fR486-R489)
 in this case as well.
   
   I do remember that we had decided to keep this behaviour consistent across 
drivers.  @joao-r-reis can you confirm what the C# driver does in this case?



##########
core/src/main/java/com/datastax/oss/driver/internal/core/type/util/VIntCoding.java:
##########
@@ -174,10 +175,84 @@ public static int computeVIntSize(final long param) {
   }
 
   /** Compute the number of bytes that would be needed to encode an unsigned 
varint. */
-  private static int computeUnsignedVIntSize(final long value) {
+  public static int computeUnsignedVIntSize(final long value) {
     int magnitude =
         Long.numberOfLeadingZeros(
             value | 1); // | with 1 to ensure magnitude <= 63, so (63 - 1) / 7 
<= 8
     return (639 - magnitude * 9) >> 6;
   }
+
+  public static void writeUnsignedVInt32(int value, ByteBuffer output) {
+    writeUnsignedVInt((long) value, output);
+  }
+
+  public static void writeUnsignedVInt(long value, ByteBuffer output) {
+    int size = VIntCoding.computeUnsignedVIntSize(value);
+    if (size == 1) {
+      output.put((byte) value);
+      return;
+    }
+
+    output.put(VIntCoding.encodeVInt(value, size), 0, size);
+  }
+
+  /**
+   * Read up to a 32-bit integer back, using the unsigned (no zigzag) encoding.
+   *
+   * <p>Note this method is the same as {@link #readUnsignedVInt(DataInput)}, 
except that we do
+   * *not* block if there are not enough bytes in the buffer to reconstruct 
the value.
+   *
+   * @throws VIntOutOfRangeException If the vint doesn't fit into a 32-bit 
integer
+   */
+  public static int getUnsignedVInt32(ByteBuffer input, int readerIndex) {
+    return checkedCast(getUnsignedVInt(input, readerIndex));
+  }
+
+  public static long getUnsignedVInt(ByteBuffer input, int readerIndex) {
+    return getUnsignedVInt(input, readerIndex, input.limit());
+  }
+
+  public static long getUnsignedVInt(ByteBuffer input, int readerIndex, int 
readerLimit) {
+    if (readerIndex < 0)
+      throw new IllegalArgumentException(
+          "Reader index should be non-negative, but was " + readerIndex);
+
+    if (readerIndex >= readerLimit) return -1;
+
+    int firstByte = input.get(readerIndex++);
+
+    // Bail out early if this is one byte, necessary or it fails later
+    if (firstByte >= 0) return firstByte;
+
+    int size = numberOfExtraBytesToRead(firstByte);
+    if (readerIndex + size > readerLimit) return -1;
+
+    long retval = firstByte & firstByteValueMask(size);
+    for (int ii = 0; ii < size; ii++) {
+      byte b = input.get(readerIndex++);
+      retval <<= 8;
+      retval |= b & 0xff;
+    }
+
+    return retval;
+  }
+
+  public static int checkedCast(long value) {
+    int result = (int) value;
+    if ((long) result != value) throw new VIntOutOfRangeException(value);
+    return result;
+  }
+
+  /**
+   * Throw when attempting to decode a vint and the output type doesn't have 
enough space to fit the
+   * value that was decoded
+   */
+  public static class VIntOutOfRangeException extends RuntimeException {
+    public final long value;
+
+    private VIntOutOfRangeException(long value) {
+      super(value + " is out of range for a 32-bit integer");
+      this.value = value;
+    }
+  }

Review Comment:
   Worth noting that there is a [server-side 
implementation](https://github.com/apache/cassandra/commit/ae537abc6494564d7254a2126465522d86b44c1e#diff-5861c7f2b8ab811c316be3858b29b9a3c65f759828022dce0559d3f6023a7fd1R176-R224)
 of the same functionality which we might want to leverage here.  I don't want 
to get in the habit of bringing in server-side code in general so I'm certainly 
not arguing that we _have_ to use it... but it's worth at least considering.
   
   The server-side implementation is at least subject to validation via a unit 
test.  While we're on the subject of testing... none of the functionality in 
this test (including functionality that existed before this PR) is covered by 
any existing unit test.  That's semi-okay for the unsigned vint stuff since the 
vector tests in the integration test implicitly cover it... but it would be 
nice to have an explicit unit test for this as well.  Ideally such a test would 
fill in coverage of the previously existing methods as well but I'm fine 
leaving that as "future work" if it'll take too long.



##########
core/src/main/java/com/datastax/oss/driver/api/core/type/codec/TypeCodec.java:
##########
@@ -234,4 +235,9 @@ default boolean accepts(@NonNull DataType cqlType) {
    */
   @Nullable
   JavaTypeT parse(@Nullable String value);
+
+  @NonNull
+  default Optional<Integer> serializedSize() {
+    return Optional.absent();
+  }

Review Comment:
   Let's shift this to the JDK Optional (and Optional.empty()) rather than the 
Guava variant.  The Guava version doesn't give us anything we actually need 
here above what we get from the JDK lib.



##########
core/src/main/java/com/datastax/oss/driver/internal/core/type/codec/VectorCodec.java:
##########
@@ -156,4 +116,195 @@ public CqlVector<SubtypeT> parse(@Nullable String value) {
         ? null
         : CqlVector.from(value, this.subtypeCodec);
   }
+
+  private static class FixedLength<SubtypeT> implements 
VectorCodecProxy<SubtypeT> {
+    private final VectorType cqlType;
+    private final TypeCodec<SubtypeT> subtypeCodec;
+
+    private FixedLength(VectorType cqlType, TypeCodec<SubtypeT> subtypeCodec) {
+      this.cqlType = cqlType;
+      this.subtypeCodec = subtypeCodec;
+    }
+
+    @Override
+    public ByteBuffer encode(
+        @Nullable CqlVector<SubtypeT> value, @NonNull ProtocolVersion 
protocolVersion) {
+      if (value == null || cqlType.getDimensions() <= 0) {
+        return null;
+      }
+      ByteBuffer[] valueBuffs = new ByteBuffer[cqlType.getDimensions()];
+      Iterator<SubtypeT> values = value.iterator();
+      int allValueBuffsSize = 0;
+      for (int i = 0; i < cqlType.getDimensions(); ++i) {
+        ByteBuffer valueBuff;
+        SubtypeT valueObj;
+
+        try {
+          valueObj = values.next();
+        } catch (NoSuchElementException nsee) {
+          throw new IllegalArgumentException(
+              String.format(
+                  "Not enough elements; must provide elements for %d 
dimensions",
+                  cqlType.getDimensions()));
+        }
+
+        try {
+          valueBuff = this.subtypeCodec.encode(valueObj, protocolVersion);
+        } catch (ClassCastException e) {
+          throw new IllegalArgumentException("Invalid type for element: " + 
valueObj.getClass());
+        }
+        if (valueBuff == null) {
+          throw new NullPointerException("Vector elements cannot encode to CQL 
NULL");
+        }
+        allValueBuffsSize += valueBuff.limit();
+        valueBuff.rewind();
+        valueBuffs[i] = valueBuff;
+      }
+      // if too many elements, throw
+      if (values.hasNext()) {
+        throw new IllegalArgumentException(
+            String.format(
+                "Too many elements; must provide elements for %d dimensions",
+                cqlType.getDimensions()));
+      }
+      /* Since we already did an early return for <= 0 dimensions above */
+      assert valueBuffs.length > 0;
+      ByteBuffer rv = ByteBuffer.allocate(allValueBuffsSize);
+      for (int i = 0; i < cqlType.getDimensions(); ++i) {
+        rv.put(valueBuffs[i]);
+      }
+      rv.flip();
+      return rv;
+    }
+
+    @Override
+    public CqlVector<SubtypeT> decode(
+        @Nullable ByteBuffer bytes, @NonNull ProtocolVersion protocolVersion) {
+      if (bytes == null || bytes.remaining() == 0) {
+        return null;
+      }
+
+      int elementSize = subtypeCodec.serializedSize().get();
+      if (bytes.remaining() != cqlType.getDimensions() * elementSize) {
+        throw new IllegalArgumentException(
+            String.format(
+                "Expected elements of uniform size, observed %d elements with 
total bytes %d",
+                cqlType.getDimensions(), bytes.remaining()));
+      }
+
+      ByteBuffer slice = bytes.slice();
+      List<SubtypeT> rv = new ArrayList<SubtypeT>(cqlType.getDimensions());
+      for (int i = 0; i < cqlType.getDimensions(); ++i) {
+        // Set the limit for the current element
+        int originalPosition = slice.position();
+        slice.limit(originalPosition + elementSize);
+        rv.add(this.subtypeCodec.decode(slice, protocolVersion));
+        // Move to the start of the next element
+        slice.position(originalPosition + elementSize);
+        // Reset the limit to the end of the buffer
+        slice.limit(slice.capacity());
+      }
+
+      return CqlVector.newInstance(rv);
+    }
+  }
+
+  private static class VariableLength<SubtypeT> implements 
VectorCodecProxy<SubtypeT> {
+    private final VectorType cqlType;
+    private final TypeCodec<SubtypeT> subtypeCodec;
+
+    private VariableLength(VectorType cqlType, TypeCodec<SubtypeT> 
subtypeCodec) {
+      this.cqlType = cqlType;
+      this.subtypeCodec = subtypeCodec;
+    }
+
+    @Override
+    public ByteBuffer encode(
+        @Nullable CqlVector<SubtypeT> value, @NonNull ProtocolVersion 
protocolVersion) {
+      if (value == null || cqlType.getDimensions() <= 0) {
+        return null;
+      }
+      ByteBuffer[] valueBuffs = new ByteBuffer[cqlType.getDimensions()];
+      Iterator<SubtypeT> values = value.iterator();
+      int allValueBuffsSize = 0;
+      for (int i = 0; i < cqlType.getDimensions(); ++i) {
+        ByteBuffer valueBuff;
+        SubtypeT valueObj;
+
+        try {
+          valueObj = values.next();
+        } catch (NoSuchElementException nsee) {
+          throw new IllegalArgumentException(
+              String.format(
+                  "Not enough elements; must provide elements for %d 
dimensions",
+                  cqlType.getDimensions()));
+        }
+
+        try {
+          valueBuff = this.subtypeCodec.encode(valueObj, protocolVersion);
+        } catch (ClassCastException e) {
+          throw new IllegalArgumentException("Invalid type for element: " + 
valueObj.getClass());
+        }
+        if (valueBuff == null) {
+          throw new NullPointerException("Vector elements cannot encode to CQL 
NULL");
+        }
+        int elementSize = valueBuff.limit();
+        allValueBuffsSize += elementSize + 
VIntCoding.computeVIntSize(elementSize);
+        valueBuff.rewind();
+        valueBuffs[i] = valueBuff;
+      }
+
+      // if too many elements, throw
+      if (values.hasNext()) {
+        throw new IllegalArgumentException(
+            String.format(
+                "Too many elements; must provide elements for %d dimensions",
+                cqlType.getDimensions()));
+      }
+
+      /* Since we already did an early return for <= 0 dimensions above */
+      assert valueBuffs.length > 0;
+      ByteBuffer rv = ByteBuffer.allocate(allValueBuffsSize);
+      for (int i = 0; i < cqlType.getDimensions(); ++i) {
+        VIntCoding.writeUnsignedVInt32(valueBuffs[i].remaining(), rv);
+        rv.put(valueBuffs[i]);
+      }
+      rv.flip();
+      return rv;
+    }
+
+    @Override
+    public CqlVector<SubtypeT> decode(
+        @Nullable ByteBuffer bytes, @NonNull ProtocolVersion protocolVersion) {
+      if (bytes == null || bytes.remaining() == 0) {
+        return null;
+      }
+
+      ByteBuffer input = bytes.duplicate();
+      List<SubtypeT> rv = new ArrayList<SubtypeT>(cqlType.getDimensions());
+      for (int i = 0; i < cqlType.getDimensions(); ++i) {
+        int size = VIntCoding.getUnsignedVInt32(input, input.position());
+        input.position(input.position() + 
VIntCoding.computeUnsignedVIntSize(size));
+
+        ByteBuffer value;
+        if (size < 0) {
+          value = null;
+        } else {
+          value = input.duplicate();
+          value.limit(value.position() + size);
+          input.position(input.position() + size);
+        }
+        rv.add(subtypeCodec.decode(value, protocolVersion));
+      }
+      // if too many elements, throw
+      if (input.hasRemaining()) {
+        throw new IllegalArgumentException(
+            String.format(
+                "Too many elements; must provide elements for %d dimensions",
+                cqlType.getDimensions()));
+      }
+
+      return CqlVector.newInstance(rv);
+    }

Review Comment:
   I agree with @joao-r-reis.  Python folds everything into a single impl, 
although in fairness Python doesn't have codecs and handles this logic via 
serialize() and deserialize() methods on the types themselves... which makes 
multiple "codec" classes harder to implement.
   
   Still, I agree with the general idea that it would be nice to consolidate as 
much of this as possible into a single implementation in order to avoid the 
duplication here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] JAVA-3143: Extend driver vector support to arbitrary subtypes and fix handling of variable length types (OSS C* 5.0) [cassandra-java-driver]

Reply via email to