Thanks Everyone. I am still not clear on what is the right way to execute support multiple users, running concurrent queries with Spark. Is it through multiple spark contexts or through Livy (which creates a single spark context only).
Also, what kind of isolation is possible with Spark SQL? If one user fires a big query, then would that choke all other queries in the cluster? Regards, Ashish On Mon, Nov 13, 2017 at 3:10 AM, Patrick Alwell <palw...@hortonworks.com> wrote: > Alcon, > > > > You can most certainly do this. I’ve done benchmarking with Spark SQL and > the TPCDS queries using S3 as the filesystem. > > > > Zeppelin and Livy server work well for the dash boarding and concurrent > query issues: https://hortonworks.com/blog/livy-a-rest-interface-for- > apache-spark/ > > > > Livy Server will allow you to create multiple spark contexts via REST: > https://livy.incubator.apache.org/ > > > > If you are looking for broad SQL functionality I’d recommend instantiating > a Hive context. And Spark is able to spill to disk à > https://spark.apache.org/faq.html > > > > There are multiple companies running spark within their data warehouse > solutions: https://ibmdatawarehousing.wordpress.com/2016/10/12/ > steinbach_dashdb_local_spark/ > > > > Edmunds used Spark to allow business analysts to point Spark to files in > S3 and infer schema: https://www.youtube.com/watch?v=gsR1ljgZLq0 > > > > Recommend running some benchmarks and testing query scenarios for your end > users; but it sounds like you’ll be using it for exploratory analysis. > Spark is great for this ☺ > > > > -Pat > > > > > > *From: *Vadim Semenov <vadim.seme...@datadoghq.com> > *Date: *Sunday, November 12, 2017 at 1:06 PM > *To: *Gourav Sengupta <gourav.sengu...@gmail.com> > *Cc: *Phillip Henry <londonjava...@gmail.com>, ashish rawat < > dceash...@gmail.com>, Jörn Franke <jornfra...@gmail.com>, Deepak Sharma < > deepakmc...@gmail.com>, spark users <user@spark.apache.org> > *Subject: *Re: Spark based Data Warehouse > > > > It's actually quite simple to answer > > > > > 1. Is Spark SQL and UDF, able to handle all the workloads? > > Yes > > > > > 2. What user interface did you provide for data scientist, data > engineers and analysts > > Home-grown platform, EMR, Zeppelin > > > > > What are the challenges in running concurrent queries, by many users, > over Spark SQL? Considering Spark still does not provide spill to disk, in > many scenarios, are there frequent query failures when executing concurrent > queries > > You can run separate Spark Contexts, so jobs will be isolated > > > > > Are there any open source implementations, which provide something > similar? > > Yes, many. > > > > > > On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > > Dear Ashish, > > what you are asking for involves at least a few weeks of dedicated > understanding of your used case and then it takes at least 3 to 4 months to > even propose a solution. You can even build a fantastic data warehouse just > using C++. The matter depends on lots of conditions. I just think that your > approach and question needs a lot of modification. > > > > Regards, > > Gourav > > > > On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <londonjava...@gmail.com> > wrote: > > Hi, Ashish. > > You are correct in saying that not *all* functionality of Spark is > spill-to-disk but I am not sure how this pertains to a "concurrent user > scenario". Each executor will run in its own JVM and is therefore isolated > from others. That is, if the JVM of one user dies, this should not effect > another user who is running their own jobs in their own JVMs. The amount of > resources used by a user can be controlled by the resource manager. > > AFAIK, you configure something like YARN to limit the number of cores and > the amount of memory in the cluster a certain user or group is allowed to > use for their job. This is obviously quite a coarse-grained approach as (to > my knowledge) IO is not throttled. I believe people generally use something > like Apache Ambari to keep an eye on network and disk usage to mitigate > problems in a shared cluster. > > If the user has badly designed their query, it may very well fail with > OOMEs but this can happen irrespective of whether one user or many is using > the cluster at a given moment in time. > > > > Does this help? > > Regards, > > Phillip > > > > On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <dceash...@gmail.com> wrote: > > Thanks Jorn and Phillip. My question was specifically to anyone who have > tried creating a system using spark SQL, as Data Warehouse. I was trying to > check, if someone has tried it and they can help with the kind of workloads > which worked and the ones, which have problems. > > > > Regarding spill to disk, I might be wrong but not all functionality of > spark is spill to disk. So it still doesn't provide DB like reliability in > execution. In case of DBs, queries get slow but they don't fail or go out > of memory, specifically in concurrent user scenarios. > > > > Regards, > > Ashish > > > > On Nov 12, 2017 3:02 PM, "Phillip Henry" <londonjava...@gmail.com> wrote: > > Agree with Jorn. The answer is: it depends. > > > > In the past, I've worked with data scientists who are happy to use the > Spark CLI. Again, the answer is "it depends" (in this case, on the skills > of your customers). > > Regarding sharing resources, different teams were limited to their own > queue so they could not hog all the resources. However, people within a > team had to do some horse trading if they had a particularly intensive job > to run. I did feel that this was an area that could be improved. It may be > by now, I've just not looked into it for a while. > > BTW I'm not sure what you mean by "Spark still does not provide spill to > disk" as the FAQ says "Spark's operators spill data to disk if it does not > fit in memory" (http://spark.apache.org/faq.html). So, your data will not > normally cause OutOfMemoryErrors (certain terms and conditions may apply). > > My 2 cents. > > Phillip > > > > > > On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <jornfra...@gmail.com> wrote: > > What do you mean all possible workloads? > > You cannot prepare any system to do all possible processing. > > > > We do not know the requirements of your data scientists now or in the > future so it is difficult to say. How do they work currently without the > new solution? Do they all work on the same data? I bet you will receive on > your email a lot of private messages trying to sell their solution that > solves everything - with the information you provided this is impossible to > say. > > > > Then with every system: have incremental releases but have then in short > time frames - do not engineer a big system that you will deliver in 2 > years. In the cloud you have the perfect possibility to scale feature but > also infrastructure wise. > > > > Challenges with concurrent queries is the right definition of the > scheduler (eg fairscheduler) that not one query take all the resources or > that long running queries starve. > > > > User interfaces: what could help are notebooks (Jupyter etc) but you may > need to train your data scientists. Some may know or prefer other tools. > > > On 12. Nov 2017, at 08:32, Deepak Sharma <deepakmc...@gmail.com> wrote: > > I am looking for similar solution more aligned to data scientist group. > > The concern i have is about supporting complex aggregations at runtime . > > > > Thanks > > Deepak > > > > On Nov 12, 2017 12:51, "ashish rawat" <dceash...@gmail.com> wrote: > > Hello Everyone, > > > > I was trying to understand if anyone here has tried a data warehouse > solution using S3 and Spark SQL. Out of multiple possible options > (redshift, presto, hive etc), we were planning to go with Spark SQL, for > our aggregates and processing requirements. > > > > If anyone has tried it out, would like to understand the following: > > 1. Is Spark SQL and UDF, able to handle all the workloads? > 2. What user interface did you provide for data scientist, data > engineers and analysts > 3. What are the challenges in running concurrent queries, by many > users, over Spark SQL? Considering Spark still does not provide spill to > disk, in many scenarios, are there frequent query failures when executing > concurrent queries > 4. Are there any open source implementations, which provide something > similar? > > > > Regards, > > Ashish > > > > > > > > > > >