Hi, ISTM these tasks are just assigned with executors in preferred nodes, so how about repartitioning rdd?
s3File.repartition(9).count On Fri, Feb 5, 2016 at 5:04 AM, Lin, Hao <hao....@finra.org> wrote: > Hi, > > > > I have a question on the number of workers that Spark enable to > parallelize the loading of files using sc.textFile. When I used sc.textFile > to access multiple files in AWS S3, it seems to only enable 2 workers > regardless of how many worker nodes I have in my cluster. So how does Spark > configure the parallelization in regard of the size of cluster nodes? In > the following case, spark has 896 tasks split between only two nodes > 10.162.97.235 and 10.162.97.237, while I have 9 nodes in the cluster. > > > > thanks > > > > Example of doing a count: > > scala> s3File.count > > 16/02/04 18:12:06 INFO SparkContext: Starting job: count at <console>:30 > > 16/02/04 18:12:06 INFO DAGScheduler: Got job 0 (count at <console>:30) > with 896 output partitions > > 16/02/04 18:12:06 INFO DAGScheduler: Final stage: ResultStage 0 (count at > <console>:30) > > 16/02/04 18:12:06 INFO DAGScheduler: Parents of final stage: List() > > 16/02/04 18:12:06 INFO DAGScheduler: Missing parents: List() > > 16/02/04 18:12:06 INFO DAGScheduler: Submitting ResultStage 0 > (MapPartitionsRDD[1] at textFile at <console>:27), which has no missing > parents > > 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1 stored as values in > memory (estimated size 3.0 KB, free 228.3 KB) > > 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1_piece0 stored as > bytes in memory (estimated size 1834.0 B, free 230.1 KB) > > 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in > memory on 10.162.98.112:46425 (size: 1834.0 B, free: 517.4 MB) > > 16/02/04 18:12:07 INFO SparkContext: Created broadcast 1 from broadcast at > DAGScheduler.scala:1006 > > 16/02/04 18:12:07 INFO DAGScheduler: Submitting 896 missing tasks from > ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:27) > > 16/02/04 18:12:07 INFO YarnScheduler: Adding task set 0.0 with 896 tasks > > 16/02/04 18:12:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID > 0, 10.162.97.235, partition 0,RACK_LOCAL, 2213 bytes) > > 16/02/04 18:12:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID > 1, 10.162.97.237, partition 1,RACK_LOCAL, 2213 bytes) > > 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in > memory on 10.162.97.235:38643 (size: 1834.0 B, free: 1259.8 MB) > > 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in > memory on 10.162.97.237:45360 (size: 1834.0 B, free: 1259.8 MB) > > 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in > memory on 10.162.97.237:45360 (size: 23.8 KB, free: 1259.8 MB) > > 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in > memory on 10.162.97.235:38643 (size: 23.8 KB, free: 1259.8 MB) > Confidentiality Notice:: This email, including attachments, may include > non-public, proprietary, confidential or legally privileged information. If > you are not an intended recipient or an authorized agent of an intended > recipient, you are hereby notified that any dissemination, distribution or > copying of the information contained in or transmitted with this e-mail is > unauthorized and strictly prohibited. If you have received this email in > error, please notify the sender by replying to this message and permanently > delete this e-mail, its attachments, and any copies of it immediately. You > should not retain, copy or use this e-mail or any attachment for any > purpose, nor disclose all or any part of the contents to any other person. > Thank you. > -- --- Takeshi Yamamuro