If you have only 1 user , its still possible to execute non-blocking long running queries . Best way is to have different users with pre assigned resources , run their queries .
HTH Thanks Deepak On Nov 13, 2017 23:56, "ashish rawat" <dceash...@gmail.com> wrote: > Thanks Everyone. I am still not clear on what is the right way to execute > support multiple users, running concurrent queries with Spark. Is it > through multiple spark contexts or through Livy (which creates a single > spark context only). > > Also, what kind of isolation is possible with Spark SQL? If one user fires > a big query, then would that choke all other queries in the cluster? > > Regards, > Ashish > > On Mon, Nov 13, 2017 at 3:10 AM, Patrick Alwell <palw...@hortonworks.com> > wrote: > >> Alcon, >> >> >> >> You can most certainly do this. I’ve done benchmarking with Spark SQL and >> the TPCDS queries using S3 as the filesystem. >> >> >> >> Zeppelin and Livy server work well for the dash boarding and concurrent >> query issues: https://hortonworks.com/blog/ >> livy-a-rest-interface-for-apache-spark/ >> >> >> >> Livy Server will allow you to create multiple spark contexts via REST: >> https://livy.incubator.apache.org/ >> >> >> >> If you are looking for broad SQL functionality I’d recommend >> instantiating a Hive context. And Spark is able to spill to disk à >> https://spark.apache.org/faq.html >> >> >> >> There are multiple companies running spark within their data warehouse >> solutions: https://ibmdatawarehousing.wordpress.com/2016/10/12/steinbac >> h_dashdb_local_spark/ >> >> >> >> Edmunds used Spark to allow business analysts to point Spark to files in >> S3 and infer schema: https://www.youtube.com/watch?v=gsR1ljgZLq0 >> >> >> >> Recommend running some benchmarks and testing query scenarios for your >> end users; but it sounds like you’ll be using it for exploratory analysis. >> Spark is great for this ☺ >> >> >> >> -Pat >> >> >> >> >> >> *From: *Vadim Semenov <vadim.seme...@datadoghq.com> >> *Date: *Sunday, November 12, 2017 at 1:06 PM >> *To: *Gourav Sengupta <gourav.sengu...@gmail.com> >> *Cc: *Phillip Henry <londonjava...@gmail.com>, ashish rawat < >> dceash...@gmail.com>, Jörn Franke <jornfra...@gmail.com>, Deepak Sharma < >> deepakmc...@gmail.com>, spark users <user@spark.apache.org> >> *Subject: *Re: Spark based Data Warehouse >> >> >> >> It's actually quite simple to answer >> >> >> >> > 1. Is Spark SQL and UDF, able to handle all the workloads? >> >> Yes >> >> >> >> > 2. What user interface did you provide for data scientist, data >> engineers and analysts >> >> Home-grown platform, EMR, Zeppelin >> >> >> >> > What are the challenges in running concurrent queries, by many users, >> over Spark SQL? Considering Spark still does not provide spill to disk, in >> many scenarios, are there frequent query failures when executing concurrent >> queries >> >> You can run separate Spark Contexts, so jobs will be isolated >> >> >> >> > Are there any open source implementations, which provide something >> similar? >> >> Yes, many. >> >> >> >> >> >> On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta < >> gourav.sengu...@gmail.com> wrote: >> >> Dear Ashish, >> >> what you are asking for involves at least a few weeks of dedicated >> understanding of your used case and then it takes at least 3 to 4 months to >> even propose a solution. You can even build a fantastic data warehouse just >> using C++. The matter depends on lots of conditions. I just think that your >> approach and question needs a lot of modification. >> >> >> >> Regards, >> >> Gourav >> >> >> >> On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <londonjava...@gmail.com> >> wrote: >> >> Hi, Ashish. >> >> You are correct in saying that not *all* functionality of Spark is >> spill-to-disk but I am not sure how this pertains to a "concurrent user >> scenario". Each executor will run in its own JVM and is therefore isolated >> from others. That is, if the JVM of one user dies, this should not effect >> another user who is running their own jobs in their own JVMs. The amount of >> resources used by a user can be controlled by the resource manager. >> >> AFAIK, you configure something like YARN to limit the number of cores and >> the amount of memory in the cluster a certain user or group is allowed to >> use for their job. This is obviously quite a coarse-grained approach as (to >> my knowledge) IO is not throttled. I believe people generally use something >> like Apache Ambari to keep an eye on network and disk usage to mitigate >> problems in a shared cluster. >> >> If the user has badly designed their query, it may very well fail with >> OOMEs but this can happen irrespective of whether one user or many is using >> the cluster at a given moment in time. >> >> >> >> Does this help? >> >> Regards, >> >> Phillip >> >> >> >> On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <dceash...@gmail.com> >> wrote: >> >> Thanks Jorn and Phillip. My question was specifically to anyone who have >> tried creating a system using spark SQL, as Data Warehouse. I was trying to >> check, if someone has tried it and they can help with the kind of workloads >> which worked and the ones, which have problems. >> >> >> >> Regarding spill to disk, I might be wrong but not all functionality of >> spark is spill to disk. So it still doesn't provide DB like reliability in >> execution. In case of DBs, queries get slow but they don't fail or go out >> of memory, specifically in concurrent user scenarios. >> >> >> >> Regards, >> >> Ashish >> >> >> >> On Nov 12, 2017 3:02 PM, "Phillip Henry" <londonjava...@gmail.com> wrote: >> >> Agree with Jorn. The answer is: it depends. >> >> >> >> In the past, I've worked with data scientists who are happy to use the >> Spark CLI. Again, the answer is "it depends" (in this case, on the skills >> of your customers). >> >> Regarding sharing resources, different teams were limited to their own >> queue so they could not hog all the resources. However, people within a >> team had to do some horse trading if they had a particularly intensive job >> to run. I did feel that this was an area that could be improved. It may be >> by now, I've just not looked into it for a while. >> >> BTW I'm not sure what you mean by "Spark still does not provide spill to >> disk" as the FAQ says "Spark's operators spill data to disk if it does not >> fit in memory" (http://spark.apache.org/faq.html). So, your data will >> not normally cause OutOfMemoryErrors (certain terms and conditions may >> apply). >> >> My 2 cents. >> >> Phillip >> >> >> >> >> >> On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <jornfra...@gmail.com> >> wrote: >> >> What do you mean all possible workloads? >> >> You cannot prepare any system to do all possible processing. >> >> >> >> We do not know the requirements of your data scientists now or in the >> future so it is difficult to say. How do they work currently without the >> new solution? Do they all work on the same data? I bet you will receive on >> your email a lot of private messages trying to sell their solution that >> solves everything - with the information you provided this is impossible to >> say. >> >> >> >> Then with every system: have incremental releases but have then in short >> time frames - do not engineer a big system that you will deliver in 2 >> years. In the cloud you have the perfect possibility to scale feature but >> also infrastructure wise. >> >> >> >> Challenges with concurrent queries is the right definition of the >> scheduler (eg fairscheduler) that not one query take all the resources or >> that long running queries starve. >> >> >> >> User interfaces: what could help are notebooks (Jupyter etc) but you may >> need to train your data scientists. Some may know or prefer other tools. >> >> >> On 12. Nov 2017, at 08:32, Deepak Sharma <deepakmc...@gmail.com> wrote: >> >> I am looking for similar solution more aligned to data scientist group. >> >> The concern i have is about supporting complex aggregations at runtime . >> >> >> >> Thanks >> >> Deepak >> >> >> >> On Nov 12, 2017 12:51, "ashish rawat" <dceash...@gmail.com> wrote: >> >> Hello Everyone, >> >> >> >> I was trying to understand if anyone here has tried a data warehouse >> solution using S3 and Spark SQL. Out of multiple possible options >> (redshift, presto, hive etc), we were planning to go with Spark SQL, for >> our aggregates and processing requirements. >> >> >> >> If anyone has tried it out, would like to understand the following: >> >> 1. Is Spark SQL and UDF, able to handle all the workloads? >> 2. What user interface did you provide for data scientist, data >> engineers and analysts >> 3. What are the challenges in running concurrent queries, by many >> users, over Spark SQL? Considering Spark still does not provide spill to >> disk, in many scenarios, are there frequent query failures when executing >> concurrent queries >> 4. Are there any open source implementations, which provide something >> similar? >> >> >> >> Regards, >> >> Ashish >> >> >> >> >> >> >> >> >> >> >> > >