> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be
> spread on Workers?
The data is read by workers. Only make sure that the data is splittable, by
using a splittable
format or by passing a list of files
sc.textFile('s3://.../*.txt')
to achieve full parallelism. Otherwise (e.g., if reading a single gzipped file)
only one worker
will read the data.
> So it migt be a trade-off compared to HDFS?
Accessing data on S3 fromHadoop is usually slower than HDFS, cf.
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Other_issues
> What kind of points in S3 is better than that of HDFS?
It's independent from your Hadoop cluster: easier to share, you don't have to
care for the data when maintaining your cluster, ...
Sebastian
On 12/13/2017 09:39 AM, Philip Lee wrote:
> Hi
>
>
>
> I have a few of questions about a structure of HDFS and S3 when Spark-like
> loads data from two storage.
>
>
> Generally, when Spark loads data from HDFS, HDFS supports data locality and
> already own distributed
> file on datanodes, right? Spark could just process data on workers.
>
>
> What about S3? many people in this field use S3 for storage or loading data
> remotely. When Spark
> loads data from S3 (sc.textFile('s3://...'), how all data will be spread on
> Workers? Master node's
> responsible for this task? It reads all data from S3, then spread the data to
> Worker? So it migt be
> a trade-off compared to HDFS? or I got a wrong point of this
>
> .
>
>
>
> What kind of points in S3 is better than that of HDFS?
>
>
> Thanks in Advanced
>
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org