Hi, I have a question on the number of workers that Spark enable to parallelize the loading of files using sc.textFile. When I used sc.textFile to access multiple files in AWS S3, it seems to only enable 2 workers regardless of how many worker nodes I have in my cluster. So how does Spark configure the parallelization in regard of the size of cluster nodes? In the following case, spark has 896 tasks split between only two nodes 10.162.97.235 and 10.162.97.237, while I have 9 nodes in the cluster.
thanks Example of doing a count: scala> s3File.count 16/02/04 18:12:06 INFO SparkContext: Starting job: count at <console>:30 16/02/04 18:12:06 INFO DAGScheduler: Got job 0 (count at <console>:30) with 896 output partitions 16/02/04 18:12:06 INFO DAGScheduler: Final stage: ResultStage 0 (count at <console>:30) 16/02/04 18:12:06 INFO DAGScheduler: Parents of final stage: List() 16/02/04 18:12:06 INFO DAGScheduler: Missing parents: List() 16/02/04 18:12:06 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:27), which has no missing parents 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 228.3 KB) 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1834.0 B, free 230.1 KB) 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.162.98.112:46425 (size: 1834.0 B, free: 517.4 MB) 16/02/04 18:12:07 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006 16/02/04 18:12:07 INFO DAGScheduler: Submitting 896 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:27) 16/02/04 18:12:07 INFO YarnScheduler: Adding task set 0.0 with 896 tasks 16/02/04 18:12:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.162.97.235, partition 0,RACK_LOCAL, 2213 bytes) 16/02/04 18:12:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.162.97.237, partition 1,RACK_LOCAL, 2213 bytes) 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.162.97.235:38643 (size: 1834.0 B, free: 1259.8 MB) 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.162.97.237:45360 (size: 1834.0 B, free: 1259.8 MB) 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.162.97.237:45360 (size: 23.8 KB, free: 1259.8 MB) 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.162.97.235:38643 (size: 23.8 KB, free: 1259.8 MB) Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.