Hi,
It is very easy to integrate using Cassandra in a use case such as this. For 
instance, do your joins in Spark and do your data storage in Cassandra which 
allows a very flexible schema, unlike a relational DB, and is much faster, 
fault tolerant, and with spark and colocation WRT data locality, infinitely 
faster.
If you use the Spark Cassandra Connector, reading and writing to Cassandra is 
as simple as:

write - DStream or RDD
stream.map(RawData(_)).saveToCassandra(keyspace, table) 
 
read - SparkContext or StreamingContext
ssc.cassandraTable[Double](keyspace, dailytable)
      .select("precipitation")
      .where("weather_station = ? AND year = ?", wsid, year)
      .map(doWork)

In your build:
"com.datastax.spark"  %% "spark-cassandra-connector"          % 
"1.1.0-alpha4”// our 1.1.0 is for spark 1.1
 
https://github.com/datastax/spark-cassandra-connector
docs: https://github.com/datastax/spark-cassandra-connector/tree/master/doc

- Helena
twitter: @helenaedelson

On Oct 26, 2014, at 10:05 AM, Rick Richardson <rick.richard...@gmail.com> wrote:

> Spark's API definitely covers all of the things that a relational database 
> can do. It will probably outperform a relational star schema if all of your 
> *working* data set can fit into RAM on your cluster. It will still perform 
> quite well if most of the data fits and some has to spill over to disk.
> 
> What are your requirements exactly? 
> What is massive amounts of data exactly?
> How big is your cluster?
> 
> Note that Spark is not for data storage, only data analysis. It pulls data 
> into working data sets called RDD's.
> 
> As a migration path, you could probably pull the data out of a relational 
> database to analyze. But in the long run, I would recommend using a more 
> purpose built, huge storage database such as Cassandra. If your data is very 
> static, you could also just store it in files. 
> On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote:
> My understanding is the SparkSQL allows one to access Spark data as if it 
> were stored in a relational database.  It compiles SQL queries into a series 
> of calls to the Spark API.
> 
> I need the performance of a SQL database, but I don't care about doing 
> queries with SQL.
> 
> I create the input to MLib by doing a massive JOIN query.  So, I am creating 
> a single collection by combining many collections.  This sort of operation is 
> very inefficient in Mongo, Cassandra or HDFS.
> 
> I could store my data in a relational database, and copy the query results to 
> Spark for processing.  However, I was hoping I could keep everything in Spark.
> 
> On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <soumya.sima...@gmail.com> 
> wrote:
> 1. What data store do you want to store your data in ? HDFS, HBase, 
> Cassandra, S3 or something else? 
> 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? 
> 
> One option is to process the data in Spark and then store it in the 
> relational database of your choice.
> 
> 
> 
> 
> On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote:
> Hello all,
> 
> We are considering Spark for our organization.  It is obviously a superb 
> platform for processing massive amounts of data... how about retrieving it?
> 
> We are currently storing our data in a relational database in a star schema.  
> Retrieving our data requires doing many complicated joins across many tables.
> 
> Can we use Spark as a relational database?  Or, if not, can we put Spark on 
> top of a relational database?
> 
> Note that we don't care about SQL.  Accessing our data via standard queries 
> is nice, but we are equally happy (or more happy) to write Scala code. 
> 
> What is important to us is doing relational queries on huge amounts of data.  
> Is Spark good at this?
> 
> Thank you very much in advance
> Peter
> 
> 

Reply via email to