Hi Sean, I'm interested in trying something similar. How was your performance when you had many concurrent queries running against spark? I know this will work well where you have a low volume of queries against a large dataset, but am concerned about having a high volume of queries against the same large dataset. (I know I've not defined "large", but hopefully you get the gist:))
I'm using Cassandra to handle workloads where we have large amounts of low complexity queries, but want to move to an architecture which supports a similar(ish) large volume of higher complexity queries. I'd like to use spark as the query serving layer, but have concerns about how many concurrent queries it could handle. I'd be interested in anyones thoughts or experience with this. Thanks, Andrew From: Sean McNamara <sean.mcnam...@webtrends.com<mailto:sean.mcnam...@webtrends.com>> Date: Wednesday, February 4, 2015 at 1:01 To: Adamantios Corais <adamantios.cor...@gmail.com<mailto:adamantios.cor...@gmail.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: Spark (SQL) as OLAP engine We have gone down a similar path at Webtrends, Spark has worked amazingly well for us in this use case. Our solution goes from REST, directly into spark, and back out to the UI instantly. Here is the resulting product in case you are curious (and please pardon the self promotion): https://www.webtrends.com/support-training/training/explore-onboarding/ > How can I automatically cache the data once a day... If you are not memory-bounded you could easily cache the daily results for some span of time and re-union them together each time you add new data. You would service queries off the unioned RDD. > ... and make them available on a web service >From the unioned RDD you could always step into spark SQL at that point. Or >you could use a simple scatter/gather pattern for this. As with all things >Spark, this is super easy to do: just use aggregate()()! Cheers, Sean On Feb 3, 2015, at 9:59 AM, Adamantios Corais <adamantios.cor...@gmail.com<mailto:adamantios.cor...@gmail.com>> wrote: Hi, After some research I have decided that Spark (SQL) would be ideal for building an OLAP engine. My goal is to push aggregated data (to Cassandra or other low-latency data storage) and then be able to project the results on a web page (web service). New data will be added (aggregated) once a day, only. On the other hand, the web service must be able to run some fixed(?) queries (either on Spark or Spark SQL) at anytime and plot the results with D3.js. Note that I can already achieve similar speeds while in REPL mode by caching the data. Therefore, I believe that my problem must be re-phrased as follows: "How can I automatically cache the data once a day and make them available on a web service that is capable of running any Spark or Spark (SQL) statement in order to plot the results with D3.js?" Note that I have already some experience in Spark (+Spark SQL) as well as D3.js but not at all with OLAP engines (at least in their traditional form). Any ideas or suggestions? // Adamantios ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org