Re: Spark based Data Warehouse

Deepak Sharma Mon, 13 Nov 2017 10:29:45 -0800

If you have only 1 user , its still possible to execute non-blocking long
running queries .
Best way is to have different users with pre assigned resources , run their
queries .


HTH

Thanks
Deepak

On Nov 13, 2017 23:56, "ashish rawat" <dceash...@gmail.com> wrote:

> Thanks Everyone. I am still not clear on what is the right way to execute
> support multiple users, running concurrent queries with Spark. Is it
> through multiple spark contexts or through Livy (which creates a single
> spark context only).
>
> Also, what kind of isolation is possible with Spark SQL? If one user fires
> a big query, then would that choke all other queries in the cluster?
>
> Regards,
> Ashish
>
> On Mon, Nov 13, 2017 at 3:10 AM, Patrick Alwell <palw...@hortonworks.com>
> wrote:
>
>> Alcon,
>>
>>
>>
>> You can most certainly do this. I’ve done benchmarking with Spark SQL and
>> the TPCDS queries using S3 as the filesystem.
>>
>>
>>
>> Zeppelin and Livy server work well for the dash boarding and concurrent
>> query issues:  https://hortonworks.com/blog/
>> livy-a-rest-interface-for-apache-spark/
>>
>>
>>
>> Livy Server will allow you to create multiple spark contexts via REST:
>> https://livy.incubator.apache.org/
>>
>>
>>
>> If you are looking for broad SQL functionality I’d recommend
>> instantiating a Hive context. And Spark is able to spill to disk à
>> https://spark.apache.org/faq.html
>>
>>
>>
>> There are multiple companies running spark within their data warehouse
>> solutions: https://ibmdatawarehousing.wordpress.com/2016/10/12/steinbac
>> h_dashdb_local_spark/
>>
>>
>>
>> Edmunds used Spark to allow business analysts to point Spark to files in
>> S3 and infer schema: https://www.youtube.com/watch?v=gsR1ljgZLq0
>>
>>
>>
>> Recommend running some benchmarks and testing query scenarios for your
>> end users; but it sounds like you’ll be using it for exploratory analysis.
>> Spark is great for this ☺
>>
>>
>>
>> -Pat
>>
>>
>>
>>
>>
>> *From: *Vadim Semenov <vadim.seme...@datadoghq.com>
>> *Date: *Sunday, November 12, 2017 at 1:06 PM
>> *To: *Gourav Sengupta <gourav.sengu...@gmail.com>
>> *Cc: *Phillip Henry <londonjava...@gmail.com>, ashish rawat <
>> dceash...@gmail.com>, Jörn Franke <jornfra...@gmail.com>, Deepak Sharma <
>> deepakmc...@gmail.com>, spark users <user@spark.apache.org>
>> *Subject: *Re: Spark based Data Warehouse
>>
>>
>>
>> It's actually quite simple to answer
>>
>>
>>
>> > 1. Is Spark SQL and UDF, able to handle all the workloads?
>>
>> Yes
>>
>>
>>
>> > 2. What user interface did you provide for data scientist, data
>> engineers and analysts
>>
>> Home-grown platform, EMR, Zeppelin
>>
>>
>>
>> > What are the challenges in running concurrent queries, by many users,
>> over Spark SQL? Considering Spark still does not provide spill to disk, in
>> many scenarios, are there frequent query failures when executing concurrent
>> queries
>>
>> You can run separate Spark Contexts, so jobs will be isolated
>>
>>
>>
>> > Are there any open source implementations, which provide something
>> similar?
>>
>> Yes, many.
>>
>>
>>
>>
>>
>> On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>> Dear Ashish,
>>
>> what you are asking for involves at least a few weeks of dedicated
>> understanding of your used case and then it takes at least 3 to 4 months to
>> even propose a solution. You can even build a fantastic data warehouse just
>> using C++. The matter depends on lots of conditions. I just think that your
>> approach and question needs a lot of modification.
>>
>>
>>
>> Regards,
>>
>> Gourav
>>
>>
>>
>> On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <londonjava...@gmail.com>
>> wrote:
>>
>> Hi, Ashish.
>>
>> You are correct in saying that not *all* functionality of Spark is
>> spill-to-disk but I am not sure how this pertains to a "concurrent user
>> scenario". Each executor will run in its own JVM and is therefore isolated
>> from others. That is, if the JVM of one user dies, this should not effect
>> another user who is running their own jobs in their own JVMs. The amount of
>> resources used by a user can be controlled by the resource manager.
>>
>> AFAIK, you configure something like YARN to limit the number of cores and
>> the amount of memory in the cluster a certain user or group is allowed to
>> use for their job. This is obviously quite a coarse-grained approach as (to
>> my knowledge) IO is not throttled. I believe people generally use something
>> like Apache Ambari to keep an eye on network and disk usage to mitigate
>> problems in a shared cluster.
>>
>> If the user has badly designed their query, it may very well fail with
>> OOMEs but this can happen irrespective of whether one user or many is using
>> the cluster at a given moment in time.
>>
>>
>>
>> Does this help?
>>
>> Regards,
>>
>> Phillip
>>
>>
>>
>> On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <dceash...@gmail.com>
>> wrote:
>>
>> Thanks Jorn and Phillip. My question was specifically to anyone who have
>> tried creating a system using spark SQL, as Data Warehouse. I was trying to
>> check, if someone has tried it and they can help with the kind of workloads
>> which worked and the ones, which have problems.
>>
>>
>>
>> Regarding spill to disk, I might be wrong but not all functionality of
>> spark is spill to disk. So it still doesn't provide DB like reliability in
>> execution. In case of DBs, queries get slow but they don't fail or go out
>> of memory, specifically in concurrent user scenarios.
>>
>>
>>
>> Regards,
>>
>> Ashish
>>
>>
>>
>> On Nov 12, 2017 3:02 PM, "Phillip Henry" <londonjava...@gmail.com> wrote:
>>
>> Agree with Jorn. The answer is: it depends.
>>
>>
>>
>> In the past, I've worked with data scientists who are happy to use the
>> Spark CLI. Again, the answer is "it depends" (in this case, on the skills
>> of your customers).
>>
>> Regarding sharing resources, different teams were limited to their own
>> queue so they could not hog all the resources. However, people within a
>> team had to do some horse trading if they had a particularly intensive job
>> to run. I did feel that this was an area that could be improved. It may be
>> by now, I've just not looked into it for a while.
>>
>> BTW I'm not sure what you mean by "Spark still does not provide spill to
>> disk" as the FAQ says "Spark's operators spill data to disk if it does not
>> fit in memory" (http://spark.apache.org/faq.html). So, your data will
>> not normally cause OutOfMemoryErrors (certain terms and conditions may
>> apply).
>>
>> My 2 cents.
>>
>> Phillip
>>
>>
>>
>>
>>
>> On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <jornfra...@gmail.com>
>> wrote:
>>
>> What do you mean all possible workloads?
>>
>> You cannot prepare any system to do all possible processing.
>>
>>
>>
>> We do not know the requirements of your data scientists now or in the
>> future so it is difficult to say. How do they work currently without the
>> new solution? Do they all work on the same data? I bet you will receive on
>> your email a lot of private messages trying to sell their solution that
>> solves everything - with the information you provided this is impossible to
>> say.
>>
>>
>>
>> Then with every system: have incremental releases but have then in short
>> time frames - do not engineer a big system that you will deliver in 2
>> years. In the cloud you have the perfect possibility to scale feature but
>> also infrastructure wise.
>>
>>
>>
>> Challenges with concurrent queries is the right definition of the
>> scheduler (eg fairscheduler) that not one query take all the resources or
>> that long running queries starve.
>>
>>
>>
>> User interfaces: what could help are notebooks (Jupyter etc) but you may
>> need to train your data scientists. Some may know or prefer other tools.
>>
>>
>> On 12. Nov 2017, at 08:32, Deepak Sharma <deepakmc...@gmail.com> wrote:
>>
>> I am looking for similar solution more aligned to data scientist group.
>>
>> The concern i have is about supporting complex aggregations at runtime .
>>
>>
>>
>> Thanks
>>
>> Deepak
>>
>>
>>
>> On Nov 12, 2017 12:51, "ashish rawat" <dceash...@gmail.com> wrote:
>>
>> Hello Everyone,
>>
>>
>>
>> I was trying to understand if anyone here has tried a data warehouse
>> solution using S3 and Spark SQL. Out of multiple possible options
>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>> our aggregates and processing requirements.
>>
>>
>>
>> If anyone has tried it out, would like to understand the following:
>>
>>    1. Is Spark SQL and UDF, able to handle all the workloads?
>>    2. What user interface did you provide for data scientist, data
>>    engineers and analysts
>>    3. What are the challenges in running concurrent queries, by many
>>    users, over Spark SQL? Considering Spark still does not provide spill to
>>    disk, in many scenarios, are there frequent query failures when executing
>>    concurrent queries
>>    4. Are there any open source implementations, which provide something
>>    similar?
>>
>>
>>
>> Regards,
>>
>> Ashish
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Spark based Data Warehouse

Reply via email to