yifan-c commented on code in PR #110:
URL:
https://github.com/apache/cassandra-analytics/pull/110#discussion_r2136874334
##########
cassandra-analytics-common/src/main/java/org/apache/cassandra/cdc/api/CommitLog.java:
##########
@@ -54,7 +54,7 @@ static Optional<Pair<Integer, Long>>
extractVersionAndSegmentId(@NotNull final S
try
{
final int version = matcher.group(2) == null ? 6 :
Integer.parseInt(matcher.group(2));
- if (version != 6 && version != 7)
+ if (version != 6 && version != 7 && version != 8) //
TODO(c4c5): What is the difference between values?
Review Comment:
the values correspond to the commit log version in Cassandra.
You might already find it out. Sharing just in case. They can be found at
https://github.com/apache/cassandra/blob/701495afec90a9ce0e01430be6927b729a12ec72/src/java/org/apache/cassandra/db/commitlog/CommitLogDescriptor.java#L63-L66
version:8 is added in this patch
https://issues.apache.org/jira/browse/CASSANDRA-14227
##########
cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/data/FileSystemSSTable.java:
##########
@@ -63,9 +61,11 @@ protected InputStream openInputStream(FileType fileType)
}
try
{
- return useBufferingInputStream
- ? new BufferingInputStream<>(new FileSystemSource<>(this,
fileType, filePath), stats.get())
- : new BufferedInputStream(new
FileInputStream(filePath.toFile()));
+ // TODO(c4c5): Any input stream returned here should support
re-buffering at random position.
+// return useBufferingInputStream
Review Comment:
+1 on this
##########
cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/CqlUtils.java:
##########
@@ -166,7 +166,7 @@ public static String extractCleanedTableSchema(@NotNull
String createStatementTo
@NotNull String keyspace,
@NotNull String table)
{
- Pattern pattern = Pattern.compile(String.format("CREATE TABLE
?\"?%s?\"?\\.{1}\"?%s\"?[^;]*;", keyspace, table));
+ Pattern pattern = Pattern.compile(String.format("CREATE TABLE (IF NOT
EXISTS)? ?\"?%s?\"?\\.{1}\"?%s\"?[^;]*;", keyspace, table));
Review Comment:
The pattern aimed for processing the describe table output, which should not
contain the "IF NOT EXISTS" portion.
But, the change is a good addition and improves robustness.
##########
cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/streaming/BufferingInputStream.java:
##########
@@ -237,8 +252,8 @@ public void onRead(StreamBuffer buffer)
{
return;
}
- bytesWritten.addAndGet(length);
queue.add(buffer);
+ bytesWritten.addAndGet(length);
Review Comment:
Just curious, why relocating the line?
##########
cassandra-five-zero-bridge/src/main/java/org/apache/cassandra/spark/reader/CompressionMetadata.java:
##########
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.cassandra.spark.reader;
+
+import java.io.DataInputStream;
+import java.io.EOFException;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.cassandra.io.compress.ICompressor;
+import org.apache.cassandra.io.util.File;
+import org.apache.cassandra.io.util.Memory;
+import org.apache.cassandra.schema.CompressionParams;
+import org.apache.cassandra.spark.reader.common.AbstractCompressionMetadata;
+import org.apache.cassandra.spark.reader.common.BigLongArray;
+
+/**
+ * Holds metadata about compressed file
+ */
+// CompressionMetadata is mocked in IndexReaderTests and mockito does not
support mocking final classes
+// CHECKSTYLE IGNORE: FinalClass
+public class CompressionMetadata extends AbstractCompressionMetadata
+{
+
+ private final CompressionParams parameters;
+ private final Memory chunkOffsetsMem;
+
+ private CompressionMetadata(long dataLength, Memory chunkOffsetsMem,
BigLongArray chunkOffsets, CompressionParams parameters)
+ {
+ super(dataLength, chunkOffsets);
+ this.chunkOffsetsMem = chunkOffsetsMem;
+ this.parameters = parameters;
+ }
+
+ static CompressionMetadata fromInputStream(InputStream inStream, boolean
hasCompressedLength) throws IOException
+ {
+ long dataLength;
+ BigLongArray chunkOffsets;
+
+ DataInputStream inData = new DataInputStream(inStream);
+
+ String compressorName = inData.readUTF();
+ int optionCount = inData.readInt();
+ Map<String, String> options = new HashMap<>(optionCount);
+ for (int option = 0; option < optionCount; ++option)
+ {
+ options.put(inData.readUTF(), inData.readUTF());
+ }
+
+ int chunkLength = inData.readInt();
+ int minCompressRatio = 2147483647;
+ if (hasCompressedLength)
+ {
+ minCompressRatio = inData.readInt();
+ }
+
+ CompressionParams params = new CompressionParams(compressorName,
chunkLength, minCompressRatio, options);
+ // TODO(c4c5): CRC check change gone?
+ //
params.setCrcCheckChance(AbstractCompressionMetadata.CRC_CHECK_CHANCE);
Review Comment:
See https://issues.apache.org/jira/browse/CASSANDRA-18872
crc check chance is moved to table params
##########
cassandra-five-zero-bridge/src/main/java/org/apache/cassandra/spark/reader/BtiIndexReader.java:
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.cassandra.spark.reader;
+
+import java.io.InputStream;
+import java.math.BigInteger;
+
+import org.apache.cassandra.analytics.stats.Stats;
+import org.apache.cassandra.bridge.TokenRange;
+import org.apache.cassandra.io.sstable.Descriptor;
+import org.apache.cassandra.io.sstable.format.bti.BtiReaderUtils;
+import org.apache.cassandra.io.util.File;
+import org.apache.cassandra.schema.TableMetadata;
+import org.apache.cassandra.spark.data.FileType;
+import org.apache.cassandra.spark.data.IncompleteSSTableException;
+import org.apache.cassandra.spark.data.SSTable;
+import org.apache.cassandra.spark.reader.common.IIndexReader;
+import org.apache.cassandra.spark.sparksql.filters.SparkRangeFilter;
+import org.jetbrains.annotations.NotNull;
+import org.jetbrains.annotations.Nullable;
+
+public class BtiIndexReader implements IIndexReader
+{
+ private TokenRange ssTableRange = null;
Review Comment:
I think it is always `null`. We want to initialize the range value according
to partitions.db, which record the first and last token of the sstable.
##########
cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/streaming/BufferingInputStream.java:
##########
@@ -100,6 +100,21 @@ public BufferingInputStream(CassandraFileSource<T> source,
BufferingInputStreamS
this.stats = stats;
}
+ // Cassandra 4.x vs 5.x START
+ public BufferingInputStream(CassandraFileSource<T> source,
BufferingInputStreamStats<T> stats, long position)
+ {
+ this(source, stats);
+ this.rangeStart = position;
+ this.bytesRead = position == 0 ? 0 : (position + 1);
+ this.bytesWritten.set(this.bytesRead);
+ }
+
+ public BufferingInputStream<T> reBuffer(long position)
+ {
+ return new BufferingInputStream<>(source, stats, position);
+ }
Review Comment:
yep. It is needed for random file access for BTI
##########
cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/data/FileSystemSource.java:
##########
@@ -90,28 +90,30 @@ public void request(long start, long end, StreamConsumer
consumer)
// Start-end range is inclusive but on the final request end
== length so we need to exclude
int increment = close ? 0 : 1;
byte[] bytes = new byte[(int) (end - start + increment)];
- if (file.getChannel().read(ByteBuffer.wrap(bytes), start) >= 0)
+ int read = file.getChannel().read(ByteBuffer.wrap(bytes),
start);
+ if (read >= 0)
{
consumer.onRead(StreamBuffer.wrap(bytes));
consumer.onEnd();
}
- else
- {
- close = true;
- }
+ // TODO(c4c5): Make configurable, so that remote file access
does not accidentally close file.
Review Comment:
Yes. This TODO makes sense. The prior behavior is only reading the file
sequentially. But for the partitions trie, random access is required. We do not
want to close the file prematurely.
##########
cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/data/SSTable.java:
##########
@@ -63,7 +63,13 @@ public InputStream openSummaryStream()
@Nullable
public InputStream openPrimaryIndexStream()
{
- return openInputStream(FileType.INDEX);
+ return openInputStream(isBigFormat() ? FileType.INDEX :
FileType.PARTITIONS_INDEX); // Cassandra 4.x vs 5.x
+ }
+
+ @Nullable
+ public InputStream openSecondaryIndexStream()
Review Comment:
How about just call it `openBtiRowIndexStream` to avoid any confusion with
secondary index feature in Cassandra?
##########
cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/streaming/BufferingInputStream.java:
##########
@@ -346,24 +361,28 @@ else if (count <= bytesBuffered())
* @param buffer the ByteBuffer
* @throws EOFException if attempts to read beyond the end of the file
* @throws IOException for failure during I/O
+ * @return number of bytes read
*/
- public void read(ByteBuffer buffer) throws IOException
+ public int read(ByteBuffer buffer) throws IOException
{
+ int read = 0; // Cassandra 4.x vs 5.x
Review Comment:
Not sure if I understand the comment. It is only a temporary marker when
developing.
##########
cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/streaming/BufferingInputStream.java:
##########
@@ -346,24 +361,28 @@ else if (count <= bytesBuffered())
* @param buffer the ByteBuffer
* @throws EOFException if attempts to read beyond the end of the file
* @throws IOException for failure during I/O
+ * @return number of bytes read
*/
- public void read(ByteBuffer buffer) throws IOException
+ public int read(ByteBuffer buffer) throws IOException
{
+ int read = 0; // Cassandra 4.x vs 5.x
for (int remainingLength = buffer.remaining(); 0 < remainingLength;
remainingLength = buffer.remaining())
{
if (checkState() < 0)
{
- throw new EOFException();
+ break;
Review Comment:
Should it still signal EOF when it has reached EOF?
##########
cassandra-analytics-integration-framework/build.gradle:
##########
@@ -38,7 +38,9 @@ publishing {
}
}
-ext.dtestJar = System.getenv("DTEST_JAR") ?: "dtest-4.1.4.jar" // latest
supported Cassandra build is 4.1
+// ext.dtestJar = System.getenv("DTEST_JAR") ?: "dtest-4.1.4.jar" // latest
supported Cassandra build is 4.1
+// TODO(c4c5): Make configurable?
Review Comment:
It is already configurable via the environment variable, right?
##########
cassandra-five-zero-bridge/src/main/java/org/apache/cassandra/spark/reader/BtiIndexReader.java:
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.cassandra.spark.reader;
+
+import java.io.InputStream;
+import java.math.BigInteger;
+
+import org.apache.cassandra.analytics.stats.Stats;
+import org.apache.cassandra.bridge.TokenRange;
+import org.apache.cassandra.io.sstable.Descriptor;
+import org.apache.cassandra.io.sstable.format.bti.BtiReaderUtils;
+import org.apache.cassandra.io.util.File;
+import org.apache.cassandra.schema.TableMetadata;
+import org.apache.cassandra.spark.data.FileType;
+import org.apache.cassandra.spark.data.IncompleteSSTableException;
+import org.apache.cassandra.spark.data.SSTable;
+import org.apache.cassandra.spark.reader.common.IIndexReader;
+import org.apache.cassandra.spark.sparksql.filters.SparkRangeFilter;
+import org.jetbrains.annotations.NotNull;
+import org.jetbrains.annotations.Nullable;
+
+public class BtiIndexReader implements IIndexReader
+{
+ private TokenRange ssTableRange = null;
+
+ public BtiIndexReader(@NotNull SSTable ssTable,
Review Comment:
According to [the spec for partition index and row
index](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md#partition-index),
the nodes page is fixed to 4096. You may want to hard-code the chuck size to
it when opening the buffering input stream.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]