Spark loads data from HDFS or S3

Philip Lee Wed, 13 Dec 2017 00:40:23 -0800

Hi


I have a few of questions about a structure of HDFS and S3 when Spark-like
loads data from two storage.


Generally, when Spark loads data from HDFS, HDFS supports data locality and
already own distributed file on datanodes, right? Spark could just process
data on workers.


What about S3? many people in this field use S3 for storage or loading data
remotely. When Spark loads data from S3 (sc.textFile('s3://...'), how all
data will be spread on Workers? Master node's responsible for this task? It
reads all data from S3, then spread the data to Worker? So it migt be a
trade-off compared to HDFS? or I got a wrong point of this
.



What kind of points in S3 is better than that of HDFS?


Thanks in Advanced

Spark loads data from HDFS or S3

Reply via email to