[ 
https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17234082#comment-17234082
 ] 

Erick Ramirez commented on CASSANDRA-16222:
-------------------------------------------

[~jberragan] really nice work! Can we interest you in drafting a blog post for 
the site that we can feature in a future newsletter? (y)

> Spark-Cassandra Bulk Reader
> ---------------------------
>
>                 Key: CASSANDRA-16222
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16222
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tool/external
>            Reporter: James Berragan
>            Priority: Normal
>             Fix For: NA
>
>         Attachments: sparkbulkreader.patch
>
>
> *Description:*
>  This ticket introduces the Spark-Cassandra Bulk Reader: a Spark library able 
> to read and compact Cassandra raw sstables into SparkSQL along the principles 
> of streaming compaction. The full code is attached as a patch file and will 
> be submitted to a GitHub repo.
> *Motivation*
>  For analytics or "select *" use cases at scale, the performance is 
> prohibitively expensive to read via the normal CQL read path - either using 
> the Java driver or the Open Source Spark connector. By reading the raw 
> sstables, the bulk reader is able to read with near-zero impact to a 
> production cluster at speeds many orders of magnitudes faster than 
> alternatives. We have seen very good performance results, exporting a 32TB 
> table (~46bn CQL rows) to HDFS in Parquet format in around 1h10m; a 20x 
> reduction compared to previous solutions. By reading from multiple replicas 
> and ‘compacting’ duplicate data together, it can achieve consistency at a 
> user defined level (i.e. ONE, TWO, LOCAL_QUORUM etc). 
> *Overview*
>  This library provides the core code for reading a set of SSTables into 
> SparkSQL through a DataLayer abstraction. The role of the DataLayer is to:
>  * return a SchemaStruct, mapping the Cassandra CQL table schema to the 
> SparkSQL schema.
>  * a list of sstables available for reading.
>  * a method to open an InputStream for any file component of an sstable (e.g. 
> data, compression, summary etc).
> The PartitionedDataLayer abstraction builds on the DataLayer interface for 
> partitioning Spark workers across a Cassandra token ring - allowing the Spark 
> job to scale linearly - and reading from sufficient Cassandra replicas to 
> achieve a user-specified consistency level.
> A simple example LocalDataLayer implementation is included for reading from a 
> local file system. Users of the library can build their own implementations 
> to read from wherever they wish e.g. reading from a backup in a cloud storage 
> system, or reading from the snapshot directory of a live Cassandra cluster.
>   
>  At the core, the bulk reader uses the Apache Cassandra CompactionIterator to 
> perform the streaming compaction. As it iterates through the 
> CompactionIterator it deserializes the ByteBuffers, converts into the 
> appropriate SparkSQL data type and pivots each cell into a SparkSQL row.
>   
>  Supporting this library is a robust property-based test framework for 
> writing Cassandra sstables with arbitrary schemas using the CQLSSTableWriter, 
> and reading back through SparkSQL to verify the library achieves both 
> consistency and correctness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to