Re: Spark based Data Warehouse

ashish rawat Mon, 13 Nov 2017 10:27:32 -0800

Thanks Everyone. I am still not clear on what is the right way to execute
support multiple users, running concurrent queries with Spark. Is it
through multiple spark contexts or through Livy (which creates a single
spark context only).


Also, what kind of isolation is possible with Spark SQL? If one user fires
a big query, then would that choke all other queries in the cluster?

Regards,
Ashish

On Mon, Nov 13, 2017 at 3:10 AM, Patrick Alwell <palw...@hortonworks.com>
wrote:

> Alcon,
>
>
>
> You can most certainly do this. I’ve done benchmarking with Spark SQL and
> the TPCDS queries using S3 as the filesystem.
>
>
>
> Zeppelin and Livy server work well for the dash boarding and concurrent
> query issues:  https://hortonworks.com/blog/livy-a-rest-interface-for-
> apache-spark/
>
>
>
> Livy Server will allow you to create multiple spark contexts via REST:
> https://livy.incubator.apache.org/
>
>
>
> If you are looking for broad SQL functionality I’d recommend instantiating
> a Hive context. And Spark is able to spill to disk à
> https://spark.apache.org/faq.html
>
>
>
> There are multiple companies running spark within their data warehouse
> solutions: https://ibmdatawarehousing.wordpress.com/2016/10/12/
> steinbach_dashdb_local_spark/
>
>
>
> Edmunds used Spark to allow business analysts to point Spark to files in
> S3 and infer schema: https://www.youtube.com/watch?v=gsR1ljgZLq0
>
>
>
> Recommend running some benchmarks and testing query scenarios for your end
> users; but it sounds like you’ll be using it for exploratory analysis.
> Spark is great for this ☺
>
>
>
> -Pat
>
>
>
>
>
> *From: *Vadim Semenov <vadim.seme...@datadoghq.com>
> *Date: *Sunday, November 12, 2017 at 1:06 PM
> *To: *Gourav Sengupta <gourav.sengu...@gmail.com>
> *Cc: *Phillip Henry <londonjava...@gmail.com>, ashish rawat <
> dceash...@gmail.com>, Jörn Franke <jornfra...@gmail.com>, Deepak Sharma <
> deepakmc...@gmail.com>, spark users <user@spark.apache.org>
> *Subject: *Re: Spark based Data Warehouse
>
>
>
> It's actually quite simple to answer
>
>
>
> > 1. Is Spark SQL and UDF, able to handle all the workloads?
>
> Yes
>
>
>
> > 2. What user interface did you provide for data scientist, data
> engineers and analysts
>
> Home-grown platform, EMR, Zeppelin
>
>
>
> > What are the challenges in running concurrent queries, by many users,
> over Spark SQL? Considering Spark still does not provide spill to disk, in
> many scenarios, are there frequent query failures when executing concurrent
> queries
>
> You can run separate Spark Contexts, so jobs will be isolated
>
>
>
> > Are there any open source implementations, which provide something
> similar?
>
> Yes, many.
>
>
>
>
>
> On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
> Dear Ashish,
>
> what you are asking for involves at least a few weeks of dedicated
> understanding of your used case and then it takes at least 3 to 4 months to
> even propose a solution. You can even build a fantastic data warehouse just
> using C++. The matter depends on lots of conditions. I just think that your
> approach and question needs a lot of modification.
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <londonjava...@gmail.com>
> wrote:
>
> Hi, Ashish.
>
> You are correct in saying that not *all* functionality of Spark is
> spill-to-disk but I am not sure how this pertains to a "concurrent user
> scenario". Each executor will run in its own JVM and is therefore isolated
> from others. That is, if the JVM of one user dies, this should not effect
> another user who is running their own jobs in their own JVMs. The amount of
> resources used by a user can be controlled by the resource manager.
>
> AFAIK, you configure something like YARN to limit the number of cores and
> the amount of memory in the cluster a certain user or group is allowed to
> use for their job. This is obviously quite a coarse-grained approach as (to
> my knowledge) IO is not throttled. I believe people generally use something
> like Apache Ambari to keep an eye on network and disk usage to mitigate
> problems in a shared cluster.
>
> If the user has badly designed their query, it may very well fail with
> OOMEs but this can happen irrespective of whether one user or many is using
> the cluster at a given moment in time.
>
>
>
> Does this help?
>
> Regards,
>
> Phillip
>
>
>
> On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <dceash...@gmail.com> wrote:
>
> Thanks Jorn and Phillip. My question was specifically to anyone who have
> tried creating a system using spark SQL, as Data Warehouse. I was trying to
> check, if someone has tried it and they can help with the kind of workloads
> which worked and the ones, which have problems.
>
>
>
> Regarding spill to disk, I might be wrong but not all functionality of
> spark is spill to disk. So it still doesn't provide DB like reliability in
> execution. In case of DBs, queries get slow but they don't fail or go out
> of memory, specifically in concurrent user scenarios.
>
>
>
> Regards,
>
> Ashish
>
>
>
> On Nov 12, 2017 3:02 PM, "Phillip Henry" <londonjava...@gmail.com> wrote:
>
> Agree with Jorn. The answer is: it depends.
>
>
>
> In the past, I've worked with data scientists who are happy to use the
> Spark CLI. Again, the answer is "it depends" (in this case, on the skills
> of your customers).
>
> Regarding sharing resources, different teams were limited to their own
> queue so they could not hog all the resources. However, people within a
> team had to do some horse trading if they had a particularly intensive job
> to run. I did feel that this was an area that could be improved. It may be
> by now, I've just not looked into it for a while.
>
> BTW I'm not sure what you mean by "Spark still does not provide spill to
> disk" as the FAQ says "Spark's operators spill data to disk if it does not
> fit in memory" (http://spark.apache.org/faq.html). So, your data will not
> normally cause OutOfMemoryErrors (certain terms and conditions may apply).
>
> My 2 cents.
>
> Phillip
>
>
>
>
>
> On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> What do you mean all possible workloads?
>
> You cannot prepare any system to do all possible processing.
>
>
>
> We do not know the requirements of your data scientists now or in the
> future so it is difficult to say. How do they work currently without the
> new solution? Do they all work on the same data? I bet you will receive on
> your email a lot of private messages trying to sell their solution that
> solves everything - with the information you provided this is impossible to
> say.
>
>
>
> Then with every system: have incremental releases but have then in short
> time frames - do not engineer a big system that you will deliver in 2
> years. In the cloud you have the perfect possibility to scale feature but
> also infrastructure wise.
>
>
>
> Challenges with concurrent queries is the right definition of the
> scheduler (eg fairscheduler) that not one query take all the resources or
> that long running queries starve.
>
>
>
> User interfaces: what could help are notebooks (Jupyter etc) but you may
> need to train your data scientists. Some may know or prefer other tools.
>
>
> On 12. Nov 2017, at 08:32, Deepak Sharma <deepakmc...@gmail.com> wrote:
>
> I am looking for similar solution more aligned to data scientist group.
>
> The concern i have is about supporting complex aggregations at runtime .
>
>
>
> Thanks
>
> Deepak
>
>
>
> On Nov 12, 2017 12:51, "ashish rawat" <dceash...@gmail.com> wrote:
>
> Hello Everyone,
>
>
>
> I was trying to understand if anyone here has tried a data warehouse
> solution using S3 and Spark SQL. Out of multiple possible options
> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
> our aggregates and processing requirements.
>
>
>
> If anyone has tried it out, would like to understand the following:
>
>    1. Is Spark SQL and UDF, able to handle all the workloads?
>    2. What user interface did you provide for data scientist, data
>    engineers and analysts
>    3. What are the challenges in running concurrent queries, by many
>    users, over Spark SQL? Considering Spark still does not provide spill to
>    disk, in many scenarios, are there frequent query failures when executing
>    concurrent queries
>    4. Are there any open source implementations, which provide something
>    similar?
>
>
>
> Regards,
>
> Ashish
>
>
>
>
>
>
>
>
>
>
>

Re: Spark based Data Warehouse

Reply via email to