Re: Spark loads data from HDFS or S3

2017-12-13 Thread Jörn Franke
S3 can be realized cheaper than HDFS on Amazon. As you correctly describe it does not support data locality. The data is distributed to the workers. Depending on your use case it can make sense to have HDFS as a temporary “cache” for S3 data. > On 13. Dec 2017, at 09:39, Philip Lee

Re: Spark loads data from HDFS or S3

2017-12-13 Thread Sebastian Nagel
> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be > spread on Workers? The data is read by workers. Only make sure that the data is splittable, by using a splittable format or by passing a list of files sc.textFile('s3://.../*.txt') to achieve full parallelism.

Spark loads data from HDFS or S3

2017-12-13 Thread Philip Lee
Hi ​ I have a few of questions about a structure of HDFS and S3 when Spark-like loads data from two storage. Generally, when Spark loads data from HDFS, HDFS supports data locality and already own distributed file on datanodes, right? Spark could just process data on workers. What about S3?