Thanks for all the useful responses.

We have the usual task of mining a stream of events coming from our many
users.  We need to store these events, and process them.  We use a standard
Star Schema to represent our data.

For the moment, it looks like we should store these events in SQL.  When
appropriate, we will do analysis with relational queries.  Or, when
appropriate we will extract data into working sets in Spark.

I imagine this is a pretty common use case for Spark.

On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson <rick.richard...@gmail.com
> wrote:

> Spark's API definitely covers all of the things that a relational database
> can do. It will probably outperform a relational star schema if all of your
> *working* data set can fit into RAM on your cluster. It will still perform
> quite well if most of the data fits and some has to spill over to disk.
>
> What are your requirements exactly?
> What is massive amounts of data exactly?
> How big is your cluster?
>
> Note that Spark is not for data storage, only data analysis. It pulls data
> into working data sets called RDD's.
>
> As a migration path, you could probably pull the data out of a relational
> database to analyze. But in the long run, I would recommend using a more
> purpose built, huge storage database such as Cassandra. If your data is
> very static, you could also just store it in files.
>  On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote:
>
>> My understanding is the SparkSQL allows one to access Spark data as if it
>> were stored in a relational database.  It compiles SQL queries into a
>> series of calls to the Spark API.
>>
>> I need the performance of a SQL database, but I don't care about doing
>> queries with SQL.
>>
>> I create the input to MLib by doing a massive JOIN query.  So, I am
>> creating a single collection by combining many collections.  This sort of
>> operation is very inefficient in Mongo, Cassandra or HDFS.
>>
>> I could store my data in a relational database, and copy the query
>> results to Spark for processing.  However, I was hoping I could keep
>> everything in Spark.
>>
>> On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <
>> soumya.sima...@gmail.com> wrote:
>>
>>> 1. What data store do you want to store your data in ? HDFS, HBase,
>>> Cassandra, S3 or something else?
>>> 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?
>>>
>>> One option is to process the data in Spark and then store it in the
>>> relational database of your choice.
>>>
>>>
>>>
>>>
>>> On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> We are considering Spark for our organization.  It is obviously a
>>>> superb platform for processing massive amounts of data... how about
>>>> retrieving it?
>>>>
>>>> We are currently storing our data in a relational database in a star
>>>> schema.  Retrieving our data requires doing many complicated joins across
>>>> many tables.
>>>>
>>>> Can we use Spark as a relational database?  Or, if not, can we put
>>>> Spark on top of a relational database?
>>>>
>>>> Note that we don't care about SQL.  Accessing our data via standard
>>>> queries is nice, but we are equally happy (or more happy) to write Scala
>>>> code.
>>>>
>>>> What is important to us is doing relational queries on huge amounts of
>>>> data.  Is Spark good at this?
>>>>
>>>> Thank you very much in advance
>>>> Peter
>>>>
>>>
>>>
>>

Reply via email to