[jira] [Comment Edited] (CASSANDRA-16222) Spark-Cassandra Bulk Reader

James Berragan (Jira) Fri, 23 Oct 2020 16:27:08 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219906#comment-17219906
 ]


James Berragan edited comment on CASSANDRA-16222 at 10/23/20, 11:26 PM:
------------------------------------------------------------------------

[~doanduyhai] adding support for v-nodes is not complicated and contributions 
are welcome :).

The token stuff in PartitionedDataLayer is largely about partitioning the Spark 
workers to evenly distribute across the token ring so the Spark job can scale 
linearly. I haven't tried but I think an implementation of the 
PartitionedDataLayer could support v-nodes without changing the code codebase.

But, as I said, if any changes are required to make it easier, happy to accept 
pull requests and contributions.


was (Author: jberragan):
[~doanduyhai] adding support for v-nodes is not complicated and contributions 
are welcome :).

 

> Spark-Cassandra Bulk Reader
> ---------------------------
>
>                 Key: CASSANDRA-16222
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16222
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tool/external
>            Reporter: James Berragan
>            Priority: Normal
>             Fix For: NA
>
>         Attachments: sparkbulkreader.patch
>
>
> *Description:*
>  This ticket introduces the Spark-Cassandra Bulk Reader: a Spark library able 
> to read and compact Cassandra raw sstables into SparkSQL along the principles 
> of streaming compaction. The full code is attached as a patch file and will 
> be submitted to a GitHub repo.
> *Motivation*
>  For analytics or "select *" use cases at scale, the performance is 
> prohibitively expensive to read via the normal CQL read path - either using 
> the Java driver or the Open Source Spark connector. By reading the raw 
> sstables, the bulk reader is able to read with near-zero impact to a 
> production cluster at speeds many orders of magnitudes faster than 
> alternatives. We have seen very good performance results, exporting a 32TB 
> table (~46bn CQL rows) to HDFS in Parquet format in around 1h10m; a 20x 
> reduction compared to previous solutions. By reading from multiple replicas 
> and ‘compacting’ duplicate data together, it can achieve consistency at a 
> user defined level (i.e. ONE, TWO, LOCAL_QUORUM etc). 
> *Overview*
>  This library provides the core code for reading a set of SSTables into 
> SparkSQL through a DataLayer abstraction. The role of the DataLayer is to:
>  * return a SchemaStruct, mapping the Cassandra CQL table schema to the 
> SparkSQL schema.
>  * a list of sstables available for reading.
>  * a method to open an InputStream for any file component of an sstable (e.g. 
> data, compression, summary etc).
> The PartitionedDataLayer abstraction builds on the DataLayer interface for 
> partitioning Spark workers across a Cassandra token ring - allowing the Spark 
> job to scale linearly - and reading from sufficient Cassandra replicas to 
> achieve a user-specified consistency level.
> A simple example LocalDataLayer implementation is included for reading from a 
> local file system. Users of the library can build their own implementations 
> to read from wherever they wish e.g. reading from a backup in a cloud storage 
> system, or reading from the snapshot directory of a live Cassandra cluster.
>   
>  At the core, the bulk reader uses the Apache Cassandra CompactionIterator to 
> perform the streaming compaction. As it iterates through the 
> CompactionIterator it deserializes the ByteBuffers, converts into the 
> appropriate SparkSQL data type and pivots each cell into a SparkSQL row.
>   
>  Supporting this library is a robust property-based test framework for 
> writing Cassandra sstables with arbitrary schemas using the CQLSSTableWriter, 
> and reading back through SparkSQL to verify the library achieves both 
> consistency and correctness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-16222) Spark-Cassandra Bulk Reader

Reply via email to