Great.  Thank you very much Michael :-D

On Mon, Oct 27, 2014 at 2:03 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> I'd suggest checking out the Spark SQL programming guide to answer this
> type of query:
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> You could also perform it using the raw Spark RDD API
> <http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.rdd.RDD>,
> but its often the case that the in-memory columnar caching of spark SQL is
> faster and more space efficient.
>
> On Mon, Oct 27, 2014 at 6:27 AM, Peter Wolf <opus...@gmail.com> wrote:
>
>> I agree.  I'd like to avoid SQL
>>
>> If I could store everything in Cassandra or Mongo and process in Spark,
>> that would be far preferable to creating a temporary Working Set.
>>
>> I'd like to write a performance test.  Lets say I have two large
>> collections A and B.  Each collection has 2 columns and many many rows.
>> The columns are Id and Value.
>>
>> I want to create a third collection that is the equivalent of the SQL
>> query
>>
>> select A.Id, A.Value, B.Value where A.Id = B.Id
>>
>> This new collection is the inner join of A and B.  It has 3 columns A.Id,
>> A.Value, B.Value and one row for each Id that A and B have in common.
>>
>> Furthermore, this table is only needed temporarily as part of
>> processing.  It needs to created efficiently, and accessible quickly.
>>
>> Can someone give me a pointer to the appropriate API and/or example code?
>>
>> Thanks again
>> P
>>
>> On Mon, Oct 27, 2014 at 1:04 AM, Michael Hausenblas <
>> michael.hausenb...@gmail.com> wrote:
>>
>>>
>>> > Given that you are storing event data (which is basically things that
>>> have happened in the past AND cannot be modified) you should definitely
>>> look at Event sourcing.
>>> > http://martinfowler.com/eaaDev/EventSourcing.html
>>>
>>>
>>> Agreed. In this context: a lesser known fact is that the Lambda
>>> Architecture is, in a nutshell, an extension of Fowler’s ES, so you might
>>> also want to check out:
>>>
>>> https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
>>>
>>>
>>> Cheers,
>>>                 Michael
>>>
>>> --
>>> Michael Hausenblas
>>> Ireland, Europe
>>> http://mhausenblas.info/
>>>
>>> > On 27 Oct 2014, at 01:14, Soumya Simanta <soumya.sima...@gmail.com>
>>> wrote:
>>> >
>>> > Given that you are storing event data (which is basically things that
>>> have happened in the past AND cannot be modified) you should definitely
>>> look at Event sourcing.
>>> > http://martinfowler.com/eaaDev/EventSourcing.html
>>> >
>>> > If all you are doing is storing events then I don't think you need a
>>> relational database. Rather an event log is ideal. Please see -
>>> http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
>>> >
>>> > There are many other datastores that can do a better job at storing
>>> your events. You can process your data and then store them in a relational
>>> database to query later.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf <opus...@gmail.com> wrote:
>>> > Thanks for all the useful responses.
>>> >
>>> > We have the usual task of mining a stream of events coming from our
>>> many users.  We need to store these events, and process them.  We use a
>>> standard Star Schema to represent our data.
>>> >
>>> > For the moment, it looks like we should store these events in SQL.
>>> When appropriate, we will do analysis with relational queries.  Or, when
>>> appropriate we will extract data into working sets in Spark.
>>> >
>>> > I imagine this is a pretty common use case for Spark.
>>> >
>>> > On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson <
>>> rick.richard...@gmail.com> wrote:
>>> > Spark's API definitely covers all of the things that a relational
>>> database can do. It will probably outperform a relational star schema if
>>> all of your *working* data set can fit into RAM on your cluster. It will
>>> still perform quite well if most of the data fits and some has to spill
>>> over to disk.
>>> >
>>> > What are your requirements exactly?
>>> > What is massive amounts of data exactly?
>>> > How big is your cluster?
>>> >
>>> > Note that Spark is not for data storage, only data analysis. It pulls
>>> data into working data sets called RDD's.
>>> >
>>> > As a migration path, you could probably pull the data out of a
>>> relational database to analyze. But in the long run, I would recommend
>>> using a more purpose built, huge storage database such as Cassandra. If
>>> your data is very static, you could also just store it in files.
>>> > On Oct 26, 2014 9:19 AM, "Peter Wolf" <opus...@gmail.com> wrote:
>>> > My understanding is the SparkSQL allows one to access Spark data as if
>>> it were stored in a relational database.  It compiles SQL queries into a
>>> series of calls to the Spark API.
>>> >
>>> > I need the performance of a SQL database, but I don't care about doing
>>> queries with SQL.
>>> >
>>> > I create the input to MLib by doing a massive JOIN query.  So, I am
>>> creating a single collection by combining many collections.  This sort of
>>> operation is very inefficient in Mongo, Cassandra or HDFS.
>>> >
>>> > I could store my data in a relational database, and copy the query
>>> results to Spark for processing.  However, I was hoping I could keep
>>> everything in Spark.
>>> >
>>> > On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta <
>>> soumya.sima...@gmail.com> wrote:
>>> > 1. What data store do you want to store your data in ? HDFS, HBase,
>>> Cassandra, S3 or something else?
>>> > 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?
>>> >
>>> > One option is to process the data in Spark and then store it in the
>>> relational database of your choice.
>>> >
>>> >
>>> >
>>> >
>>> > On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf <opus...@gmail.com>
>>> wrote:
>>> > Hello all,
>>> >
>>> > We are considering Spark for our organization.  It is obviously a
>>> superb platform for processing massive amounts of data... how about
>>> retrieving it?
>>> >
>>> > We are currently storing our data in a relational database in a star
>>> schema.  Retrieving our data requires doing many complicated joins across
>>> many tables.
>>> >
>>> > Can we use Spark as a relational database?  Or, if not, can we put
>>> Spark on top of a relational database?
>>> >
>>> > Note that we don't care about SQL.  Accessing our data via standard
>>> queries is nice, but we are equally happy (or more happy) to write Scala
>>> code.
>>> >
>>> > What is important to us is doing relational queries on huge amounts of
>>> data.  Is Spark good at this?
>>> >
>>> > Thank you very much in advance
>>> > Peter
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>
>

Reply via email to