I am using a spark cluster on Amazon (launched using
spark-1.6-prebuilt-with-hadoop-2.6 spark-ec2 script)
to run a scala driver application to read S3 object content in parallel. 

I have tried “s3n://bucket” with sc.textFile as well as set up an RDD with
the S3 keys and then used 
java aws sdk to map it to s3client.getObject(bucket,key) call on each key to
read content. 

In both cases, it gets quite slow as the number of objects increase to even
few thousands.  For example, 
for 8000 objects it took almost 4mins. In contrast, a custom c++ s3 http
client with multiple threads to 
fetch them in parallel on a VM with single cpu, took 1min with 8 threads and
12secs if I use 80 threads.

I observed two issues in both s3n based read, and s3client based calls:
(a) Each slave in spark cluster has 2 vcpus. The job [ref:NOTE] is
partitioned into #tasks, either = #objects,
 or = 2x#nodes. But at the node, it is not always starting 2/more executor
worker threads as I expected.
Instead a single worker thread serially executes the tasks one after the
other. 

(b) Each executor worker thread seems to go through each object key
serially, and the thread ends up almost 
always waiting on socket streaming. 

Has anyone seen this or am I doing something wrong? How do I make sure all
the cpus are used? And how to instruct the executors to start multiple
threads to process data in parallel within the partition/node/task? 

NOTE: since sc.textFile as well as map are transformation, I use
resultRDD.count to kick off the actual read and the times are for the count
job.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/parallelism-of-task-executor-worker-threads-during-s3-reads-tp26935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to