We are running Spark in AWS EMR as data warehouse. All data are in S3 and
metadata in Hive metastore.

We have internal tools to creat juypter notebook on the dev cluster. I
guess you can use zeppelin instead, or Livy?

We run genie as a job server for the prod cluster, so users have to submit
their queries through the genie. For better resource utilization, we rely
on Yarn dynamic allocation to balance the load of multiple jobs/queries in
Spark.

Hope this helps.

On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <dceash...@gmail.com> wrote:

> Hello Everyone,
>
> I was trying to understand if anyone here has tried a data warehouse
> solution using S3 and Spark SQL. Out of multiple possible options
> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
> our aggregates and processing requirements.
>
> If anyone has tried it out, would like to understand the following:
>
>    1. Is Spark SQL and UDF, able to handle all the workloads?
>    2. What user interface did you provide for data scientist, data
>    engineers and analysts
>    3. What are the challenges in running concurrent queries, by many
>    users, over Spark SQL? Considering Spark still does not provide spill to
>    disk, in many scenarios, are there frequent query failures when executing
>    concurrent queries
>    4. Are there any open source implementations, which provide something
>    similar?
>
>
> Regards,
> Ashish
>

Reply via email to