Hi Ashish, bear in mind that EMR has some additional tooling available that
smoothes out some S3 problems that you may / almost certainly will
encounter.

We are using Spark / S3 not on EMR and have encountered issues with file
consistency, you can deal with it but be aware it's additional technical
debt that you'll need to own.  We didn't want to own an HDFS cluster so we
consider it worthwhile.

Here are some additional resources:  The video is Steve Loughran talking
about S3.
https://medium.com/@subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
https://www.youtube.com/watch?v=ND4L_zSDqF0

For the record we use S3 heavily but tend to drop our processed data into
databases so they can be more easily consumed by visualization tools.

Good luck!

Gary Lucas

On 13 November 2017 at 20:04, Affan Syed <as...@an10.io> wrote:

> Another option that we are trying internally is to uses Mesos for
> isolating different jobs or groups. Within a single group, using Livy to
> create different spark contexts also works.
>
> - Affan
>
> On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat <dceash...@gmail.com> wrote:
>
>> Thanks Sky Yin. This really helps.
>>
>> On Nov 14, 2017 12:11 AM, "Sky Yin" <sky....@gmail.com> wrote:
>>
>> We are running Spark in AWS EMR as data warehouse. All data are in S3 and
>> metadata in Hive metastore.
>>
>> We have internal tools to creat juypter notebook on the dev cluster. I
>> guess you can use zeppelin instead, or Livy?
>>
>> We run genie as a job server for the prod cluster, so users have to
>> submit their queries through the genie. For better resource utilization, we
>> rely on Yarn dynamic allocation to balance the load of multiple
>> jobs/queries in Spark.
>>
>> Hope this helps.
>>
>> On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <dceash...@gmail.com>
>> wrote:
>>
>>> Hello Everyone,
>>>
>>> I was trying to understand if anyone here has tried a data warehouse
>>> solution using S3 and Spark SQL. Out of multiple possible options
>>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>>> our aggregates and processing requirements.
>>>
>>> If anyone has tried it out, would like to understand the following:
>>>
>>>    1. Is Spark SQL and UDF, able to handle all the workloads?
>>>    2. What user interface did you provide for data scientist, data
>>>    engineers and analysts
>>>    3. What are the challenges in running concurrent queries, by many
>>>    users, over Spark SQL? Considering Spark still does not provide spill to
>>>    disk, in many scenarios, are there frequent query failures when executing
>>>    concurrent queries
>>>    4. Are there any open source implementations, which provide
>>>    something similar?
>>>
>>>
>>> Regards,
>>> Ashish
>>>
>>
>>
>

Reply via email to