[ https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17234082#comment-17234082 ]
Erick Ramirez commented on CASSANDRA-16222: ------------------------------------------- [~jberragan] really nice work! Can we interest you in drafting a blog post for the site that we can feature in a future newsletter? (y) > Spark-Cassandra Bulk Reader > --------------------------- > > Key: CASSANDRA-16222 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16222 > Project: Cassandra > Issue Type: New Feature > Components: Tool/external > Reporter: James Berragan > Priority: Normal > Fix For: NA > > Attachments: sparkbulkreader.patch > > > *Description:* > This ticket introduces the Spark-Cassandra Bulk Reader: a Spark library able > to read and compact Cassandra raw sstables into SparkSQL along the principles > of streaming compaction. The full code is attached as a patch file and will > be submitted to a GitHub repo. > *Motivation* > For analytics or "select *" use cases at scale, the performance is > prohibitively expensive to read via the normal CQL read path - either using > the Java driver or the Open Source Spark connector. By reading the raw > sstables, the bulk reader is able to read with near-zero impact to a > production cluster at speeds many orders of magnitudes faster than > alternatives. We have seen very good performance results, exporting a 32TB > table (~46bn CQL rows) to HDFS in Parquet format in around 1h10m; a 20x > reduction compared to previous solutions. By reading from multiple replicas > and ‘compacting’ duplicate data together, it can achieve consistency at a > user defined level (i.e. ONE, TWO, LOCAL_QUORUM etc). > *Overview* > This library provides the core code for reading a set of SSTables into > SparkSQL through a DataLayer abstraction. The role of the DataLayer is to: > * return a SchemaStruct, mapping the Cassandra CQL table schema to the > SparkSQL schema. > * a list of sstables available for reading. > * a method to open an InputStream for any file component of an sstable (e.g. > data, compression, summary etc). > The PartitionedDataLayer abstraction builds on the DataLayer interface for > partitioning Spark workers across a Cassandra token ring - allowing the Spark > job to scale linearly - and reading from sufficient Cassandra replicas to > achieve a user-specified consistency level. > A simple example LocalDataLayer implementation is included for reading from a > local file system. Users of the library can build their own implementations > to read from wherever they wish e.g. reading from a backup in a cloud storage > system, or reading from the snapshot directory of a live Cassandra cluster. > > At the core, the bulk reader uses the Apache Cassandra CompactionIterator to > perform the streaming compaction. As it iterates through the > CompactionIterator it deserializes the ByteBuffers, converts into the > appropriate SparkSQL data type and pivots each cell into a SparkSQL row. > > Supporting this library is a robust property-based test framework for > writing Cassandra sstables with arbitrary schemas using the CQLSSTableWriter, > and reading back through SparkSQL to verify the library achieves both > consistency and correctness. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org