liucao-dd commented on PR #213: URL: https://github.com/apache/cassandra-analytics/pull/213#issuecomment-4584108377
Good point. I checked Spark's `Partitioning` contract and several maintained Spark V2 connectors (Iceberg, Paimon, ClickHouse, StarRocks, YDB, Lance). The common pattern is to instantiate `UnknownPartitioning` directly when the scan cannot guarantee keyed grouping, and reserve `KeyGroupedPartitioning` for cases where the connector can prove rows are grouped by the reported key expressions. That applies here: Cassandra analytics input partitions are token ranges, so a single Spark partition can contain many Cassandra partition keys/token values. We should not claim `KeyGroupedPartitioning`, and a Cassandra-specific subclass adds nothing over Spark's public `UnknownPartitioning`. Updated: removed `CassandraPartitioning` and `CassandraScanBuilder.outputPartitioning()` now returns `new UnknownPartitioning(dataLayer.partitionCount())` directly. The unit test still asserts the reported partitioning is `UnknownPartitioning` with the correct partition count. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
