Hi,

I have a question on the number of workers that Spark enable to parallelize the 
loading of files using sc.textFile. When I used sc.textFile to access multiple 
files in AWS S3, it seems to only enable 2 workers regardless of how many 
worker nodes I have in my cluster. So how does Spark configure the 
parallelization in regard of the size of cluster nodes? In the following case, 
spark has 896 tasks split between only two nodes 10.162.97.235 and 
10.162.97.237, while I have 9 nodes in the cluster.

thanks

Example of doing a count:
 scala> s3File.count
16/02/04 18:12:06 INFO SparkContext: Starting job: count at <console>:30
16/02/04 18:12:06 INFO DAGScheduler: Got job 0 (count at <console>:30) with 896 
output partitions
16/02/04 18:12:06 INFO DAGScheduler: Final stage: ResultStage 0 (count at 
<console>:30)
16/02/04 18:12:06 INFO DAGScheduler: Parents of final stage: List()
16/02/04 18:12:06 INFO DAGScheduler: Missing parents: List()
16/02/04 18:12:06 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[1] at textFile at <console>:27), which has no missing parents
16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 3.0 KB, free 228.3 KB)
16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 1834.0 B, free 230.1 KB)
16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
10.162.98.112:46425 (size: 1834.0 B, free: 517.4 MB)
16/02/04 18:12:07 INFO SparkContext: Created broadcast 1 from broadcast at 
DAGScheduler.scala:1006
16/02/04 18:12:07 INFO DAGScheduler: Submitting 896 missing tasks from 
ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:27)
16/02/04 18:12:07 INFO YarnScheduler: Adding task set 0.0 with 896 tasks
16/02/04 18:12:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
10.162.97.235, partition 0,RACK_LOCAL, 2213 bytes)
16/02/04 18:12:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
10.162.97.237, partition 1,RACK_LOCAL, 2213 bytes)
16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
10.162.97.235:38643 (size: 1834.0 B, free: 1259.8 MB)
16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
10.162.97.237:45360 (size: 1834.0 B, free: 1259.8 MB)
16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
10.162.97.237:45360 (size: 23.8 KB, free: 1259.8 MB)
16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
10.162.97.235:38643 (size: 23.8 KB, free: 1259.8 MB)

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.

Reply via email to