Hi Huizhe,

You can set the "fs.defaultFS" field in core-site.xml to some path on s3.
That way your spark job will use S3 for all operations that need HDFS.
Intermediate data will still be stored on local disk though.

Thanks,
Hari

On Mon, May 20, 2019 at 10:14 AM Abdeali Kothari <abdealikoth...@gmail.com>
wrote:

> While spark can read from S3 directly in EMR, I believe it still needs the
> HDFS to perform shuffles and to write intermediate data into disk when
> doing jobs (I.e. when the in memory need stop spill over to disk)
>
> For these operations, Spark does need a distributed file system - You
> could use something like EMRFS (which is like a HDFS backed by S3) on
> Amazon.
>
> The issue could be something else too - so a stacktrace or error message
> could help in understanding the problem.
>
>
>
> On Mon, May 20, 2019, 07:20 Huizhe Wang <wang.h...@husky.neu.edu> wrote:
>
>> Hi,
>>
>> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS and
>> using s3a to get them. However, when I use stop-dfs.sh stoped Namenode and
>> DataNode. I got an error when using yarn cluster mode. Could I using yarn
>> without start DFS, how could I use this mode?
>>
>> Yours,
>> Jane
>>
>

Reply via email to