[ https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219906#comment-17219906 ]
James Berragan edited comment on CASSANDRA-16222 at 10/24/20, 5:44 PM: ----------------------------------------------------------------------- [~doanduyhai] adding support for v-nodes is not complicated and contributions are welcome :). The token stuff in PartitionedDataLayer is largely about partitioning the Spark workers to evenly distribute across the token ring so the Spark job can scale linearly. I haven't tried but I think an implementation of the PartitionedDataLayer could support v-nodes without changing the core codebase. But, as I said, if any changes are required to make it easier, happy to accept pull requests and contributions. was (Author: jberragan): [~doanduyhai] adding support for v-nodes is not complicated and contributions are welcome :). The token stuff in PartitionedDataLayer is largely about partitioning the Spark workers to evenly distribute across the token ring so the Spark job can scale linearly. I haven't tried but I think an implementation of the PartitionedDataLayer could support v-nodes without changing the code codebase. But, as I said, if any changes are required to make it easier, happy to accept pull requests and contributions. > Spark-Cassandra Bulk Reader > --------------------------- > > Key: CASSANDRA-16222 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16222 > Project: Cassandra > Issue Type: New Feature > Components: Tool/external > Reporter: James Berragan > Priority: Normal > Fix For: NA > > Attachments: sparkbulkreader.patch > > > *Description:* > This ticket introduces the Spark-Cassandra Bulk Reader: a Spark library able > to read and compact Cassandra raw sstables into SparkSQL along the principles > of streaming compaction. The full code is attached as a patch file and will > be submitted to a GitHub repo. > *Motivation* > For analytics or "select *" use cases at scale, the performance is > prohibitively expensive to read via the normal CQL read path - either using > the Java driver or the Open Source Spark connector. By reading the raw > sstables, the bulk reader is able to read with near-zero impact to a > production cluster at speeds many orders of magnitudes faster than > alternatives. We have seen very good performance results, exporting a 32TB > table (~46bn CQL rows) to HDFS in Parquet format in around 1h10m; a 20x > reduction compared to previous solutions. By reading from multiple replicas > and ‘compacting’ duplicate data together, it can achieve consistency at a > user defined level (i.e. ONE, TWO, LOCAL_QUORUM etc). > *Overview* > This library provides the core code for reading a set of SSTables into > SparkSQL through a DataLayer abstraction. The role of the DataLayer is to: > * return a SchemaStruct, mapping the Cassandra CQL table schema to the > SparkSQL schema. > * a list of sstables available for reading. > * a method to open an InputStream for any file component of an sstable (e.g. > data, compression, summary etc). > The PartitionedDataLayer abstraction builds on the DataLayer interface for > partitioning Spark workers across a Cassandra token ring - allowing the Spark > job to scale linearly - and reading from sufficient Cassandra replicas to > achieve a user-specified consistency level. > A simple example LocalDataLayer implementation is included for reading from a > local file system. Users of the library can build their own implementations > to read from wherever they wish e.g. reading from a backup in a cloud storage > system, or reading from the snapshot directory of a live Cassandra cluster. > > At the core, the bulk reader uses the Apache Cassandra CompactionIterator to > perform the streaming compaction. As it iterates through the > CompactionIterator it deserializes the ByteBuffers, converts into the > appropriate SparkSQL data type and pivots each cell into a SparkSQL row. > > Supporting this library is a robust property-based test framework for > writing Cassandra sstables with arbitrary schemas using the CQLSSTableWriter, > and reading back through SparkSQL to verify the library achieves both > consistency and correctness. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org