Re: Spark based Data Warehouse

Affan Syed Mon, 13 Nov 2017 20:04:45 -0800

Another option that we are trying internally is to uses Mesos for isolating
different jobs or groups. Within a single group, using Livy to create
different spark contexts also works.


- Affan

On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat <dceash...@gmail.com> wrote:

> Thanks Sky Yin. This really helps.
>
> On Nov 14, 2017 12:11 AM, "Sky Yin" <sky....@gmail.com> wrote:
>
> We are running Spark in AWS EMR as data warehouse. All data are in S3 and
> metadata in Hive metastore.
>
> We have internal tools to creat juypter notebook on the dev cluster. I
> guess you can use zeppelin instead, or Livy?
>
> We run genie as a job server for the prod cluster, so users have to submit
> their queries through the genie. For better resource utilization, we rely
> on Yarn dynamic allocation to balance the load of multiple jobs/queries in
> Spark.
>
> Hope this helps.
>
> On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <dceash...@gmail.com> wrote:
>
>> Hello Everyone,
>>
>> I was trying to understand if anyone here has tried a data warehouse
>> solution using S3 and Spark SQL. Out of multiple possible options
>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>> our aggregates and processing requirements.
>>
>> If anyone has tried it out, would like to understand the following:
>>
>>    1. Is Spark SQL and UDF, able to handle all the workloads?
>>    2. What user interface did you provide for data scientist, data
>>    engineers and analysts
>>    3. What are the challenges in running concurrent queries, by many
>>    users, over Spark SQL? Considering Spark still does not provide spill to
>>    disk, in many scenarios, are there frequent query failures when executing
>>    concurrent queries
>>    4. Are there any open source implementations, which provide something
>>    similar?
>>
>>
>> Regards,
>> Ashish
>>
>
>

Re: Spark based Data Warehouse

Reply via email to