Re: Spark loads data from HDFS or S3

2017-12-13 Thread Sebastian Nagel
> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be 
> spread on Workers?

The data is read by workers. Only make sure that the data is splittable, by 
using a splittable
format or by passing a list of files
 sc.textFile('s3://.../*.txt')
to achieve full parallelism. Otherwise (e.g., if reading a single gzipped file) 
only one worker
will read the data.

> So it migt be a trade-off compared to HDFS?

Accessing data on S3 fromHadoop is usually slower than HDFS, cf.
  
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Other_issues

> What kind of points in S3 is better than that of HDFS?

It's independent from your Hadoop cluster: easier to share, you don't have to
care for the data when maintaining your cluster, ...

Sebastian

On 12/13/2017 09:39 AM, Philip Lee wrote:
> Hi
> ​
> 
> 
> I have a few of questions about a structure of HDFS and S3 when Spark-like 
> loads data from two storage.
> 
> 
> Generally, when Spark loads data from HDFS, HDFS supports data locality and 
> already own distributed
> file on datanodes, right? Spark could just process data on workers.
> 
> 
> What about S3? many people in this field use S3 for storage or loading data 
> remotely. When Spark
> loads data from S3 (sc.textFile('s3://...'), how all data will be spread on 
> Workers? Master node's
> responsible for this task? It reads all data from S3, then spread the data to 
> Worker? So it migt be
> a trade-off compared to HDFS? or I got a wrong point of this
> 
> ​.
> 
> ​
> 
> What kind of points in S3 is better than that of HDFS?
> ​
> 
> ​Thanks in Advanced​
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Pyspark, Python 2.7] Executor hangup caused by Unicode error while logging uncaught exception in worker

2017-04-27 Thread Sebastian Nagel
Hi,

I've seen a hangup of a job (resp. one of the executors) if the message of an 
uncaught exception
contains bytes which cannot be properly decoded as Unicode characters. The last 
lines in the
executor logs were

PySpark worker failed with exception:
Traceback (most recent call last):
  File
"/data/1/yarn/local/usercache/ubuntu/appcache/application_1492496523387_0009/container_1492496523387_0009_01_06/pyspark.zip/pyspark/worker.py",
lin
e 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1386: 
ordinal not in range(128)

After that nothing happened for hours, no CPU used on the machine running the 
executor.
First seen with Spark on Yarn
 Spark 2.1.0, Scala 2.11.8
 Python 2.7.6
 Hadoop 2.6.0-cdh5.11.0

Reproduced with Spark 2.1.0 and Python 2.7.12 in local mode and traced down to 
this small script:
   https://gist.github.com/sebastian-nagel/310a5a5f39cc668fb71b6ace208706f7

Is this a known problem?

Of course, one may argue that the job would have been failed anyway, but a 
hang-up isn't that nice,
on Yarn it blocks resources (containers) until killed.


Thanks,
Sebastian


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org