Re: Spark (SQL) as OLAP engine

2015-02-03 Thread McNerlin, Andrew (Agoda)
Hi Sean,

I'm interested in trying something similar.  How was your performance when you 
had many concurrent queries running against spark?  I know this will work well 
where you have a low volume of queries against a large dataset, but am 
concerned about having a high volume of queries against the same large dataset. 
(I know I've not defined large, but hopefully you get the gist:))

I'm using Cassandra to handle workloads where we have large amounts of low 
complexity queries, but want to move to an architecture which supports a 
similar(ish) large volume of higher complexity queries.  I'd like to use spark 
as the query serving layer, but have concerns about how many concurrent queries 
it could handle.

I'd be interested in anyones thoughts or experience with this.

Thanks,
Andrew

From: Sean McNamara 
sean.mcnam...@webtrends.commailto:sean.mcnam...@webtrends.com
Date: Wednesday, February 4, 2015 at 1:01
To: Adamantios Corais 
adamantios.cor...@gmail.commailto:adamantios.cor...@gmail.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark (SQL) as OLAP engine

We have gone down a similar path at Webtrends, Spark has worked amazingly well 
for us in this use case.  Our solution goes from REST, directly into spark, and 
back out to the UI instantly.

Here is the resulting product in case you are curious (and please pardon the 
self promotion): 
https://www.webtrends.com/support-training/training/explore-onboarding/


 How can I automatically cache the data once a day...

If you are not memory-bounded you could easily cache the daily results for some 
span of time and re-union them together each time you add new data.  You would 
service queries off the unioned RDD.


 ... and make them available on a web service

From the unioned RDD you could always step into spark SQL at that point.  Or 
you could use a simple scatter/gather pattern for this.  As with all things 
Spark, this is super easy to do: just use aggregate()()!


Cheers,

Sean


On Feb 3, 2015, at 9:59 AM, Adamantios Corais 
adamantios.cor...@gmail.commailto:adamantios.cor...@gmail.com wrote:

Hi,

After some research I have decided that Spark (SQL) would be ideal for building 
an OLAP engine. My goal is to push aggregated data (to Cassandra or other 
low-latency data storage) and then be able to project the results on a web page 
(web service). New data will be added (aggregated) once a day, only. On the 
other hand, the web service must be able to run some fixed(?) queries (either 
on Spark or Spark SQL) at anytime and plot the results with D3.js. Note that I 
can already achieve similar speeds while in REPL mode by caching the data. 
Therefore, I believe that my problem must be re-phrased as follows: How can I 
automatically cache the data once a day and make them available on a web 
service that is capable of running any Spark or Spark (SQL)  statement in order 
to plot the results with D3.js?

Note that I have already some experience in Spark (+Spark SQL) as well as D3.js 
but not at all with OLAP engines (at least in their traditional form).

Any ideas or suggestions?

// Adamantios





This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark (SQL) as OLAP engine

2015-02-03 Thread Adamantios Corais
Hi,

After some research I have decided that Spark (SQL) would be ideal for
building an OLAP engine. My goal is to push aggregated data (to Cassandra
or other low-latency data storage) and then be able to project the results
on a web page (web service). New data will be added (aggregated) once a
day, only. On the other hand, the web service must be able to run some
fixed(?) queries (either on Spark or Spark SQL) at anytime and plot the
results with D3.js. Note that I can already achieve similar speeds while in
REPL mode by caching the data. Therefore, I believe that my problem must be
re-phrased as follows: How can I automatically cache the data once a day
and make them available on a web service that is capable of running any
Spark or Spark (SQL)  statement in order to plot the results with D3.js?

Note that I have already some experience in Spark (+Spark SQL) as well as
D3.js but not at all with OLAP engines (at least in their traditional form).

Any ideas or suggestions?


*// Adamantios*


Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Sean McNamara
We have gone down a similar path at Webtrends, Spark has worked amazingly well 
for us in this use case.  Our solution goes from REST, directly into spark, and 
back out to the UI instantly.

Here is the resulting product in case you are curious (and please pardon the 
self promotion): 
https://www.webtrends.com/support-training/training/explore-onboarding/


 How can I automatically cache the data once a day...

If you are not memory-bounded you could easily cache the daily results for some 
span of time and re-union them together each time you add new data.  You would 
service queries off the unioned RDD.


 ... and make them available on a web service

From the unioned RDD you could always step into spark SQL at that point.  Or 
you could use a simple scatter/gather pattern for this.  As with all things 
Spark, this is super easy to do: just use aggregate()()!


Cheers,

Sean


On Feb 3, 2015, at 9:59 AM, Adamantios Corais 
adamantios.cor...@gmail.commailto:adamantios.cor...@gmail.com wrote:

Hi,

After some research I have decided that Spark (SQL) would be ideal for building 
an OLAP engine. My goal is to push aggregated data (to Cassandra or other 
low-latency data storage) and then be able to project the results on a web page 
(web service). New data will be added (aggregated) once a day, only. On the 
other hand, the web service must be able to run some fixed(?) queries (either 
on Spark or Spark SQL) at anytime and plot the results with D3.js. Note that I 
can already achieve similar speeds while in REPL mode by caching the data. 
Therefore, I believe that my problem must be re-phrased as follows: How can I 
automatically cache the data once a day and make them available on a web 
service that is capable of running any Spark or Spark (SQL)  statement in order 
to plot the results with D3.js?

Note that I have already some experience in Spark (+Spark SQL) as well as D3.js 
but not at all with OLAP engines (at least in their traditional form).

Any ideas or suggestions?

// Adamantios





Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Jonathan Haddad
Write out the rdd to a cassandra table.  The datastax driver provides
saveToCassandra() for this purpose.

On Tue Feb 03 2015 at 8:59:15 AM Adamantios Corais 
adamantios.cor...@gmail.com wrote:

 Hi,

 After some research I have decided that Spark (SQL) would be ideal for
 building an OLAP engine. My goal is to push aggregated data (to Cassandra
 or other low-latency data storage) and then be able to project the results
 on a web page (web service). New data will be added (aggregated) once a
 day, only. On the other hand, the web service must be able to run some
 fixed(?) queries (either on Spark or Spark SQL) at anytime and plot the
 results with D3.js. Note that I can already achieve similar speeds while in
 REPL mode by caching the data. Therefore, I believe that my problem must be
 re-phrased as follows: How can I automatically cache the data once a day
 and make them available on a web service that is capable of running any
 Spark or Spark (SQL)  statement in order to plot the results with D3.js?

 Note that I have already some experience in Spark (+Spark SQL) as well as
 D3.js but not at all with OLAP engines (at least in their traditional form).

 Any ideas or suggestions?


 *// Adamantios*





Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Denny Lee
A great presentation by Evan Chan on utilizing Cassandra as Jonathan noted
is at: OLAP with Cassandra and Spark
http://www.slideshare.net/EvanChan2/2014-07olapcassspark.

On Tue Feb 03 2015 at 10:03:34 AM Jonathan Haddad j...@jonhaddad.com wrote:

 Write out the rdd to a cassandra table.  The datastax driver provides
 saveToCassandra() for this purpose.

 On Tue Feb 03 2015 at 8:59:15 AM Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 Hi,

 After some research I have decided that Spark (SQL) would be ideal for
 building an OLAP engine. My goal is to push aggregated data (to Cassandra
 or other low-latency data storage) and then be able to project the results
 on a web page (web service). New data will be added (aggregated) once a
 day, only. On the other hand, the web service must be able to run some
 fixed(?) queries (either on Spark or Spark SQL) at anytime and plot the
 results with D3.js. Note that I can already achieve similar speeds while in
 REPL mode by caching the data. Therefore, I believe that my problem must be
 re-phrased as follows: How can I automatically cache the data once a day
 and make them available on a web service that is capable of running any
 Spark or Spark (SQL)  statement in order to plot the results with D3.js?

 Note that I have already some experience in Spark (+Spark SQL) as well as
 D3.js but not at all with OLAP engines (at least in their traditional form).

 Any ideas or suggestions?


 *// Adamantios*