increase minPartitions: sc.textFile(path, minPartitions = 9)
On Thu, Feb 4, 2016 at 11:41 PM, Takeshi Yamamuro <linguin....@gmail.com> wrote: > Hi, > > ISTM these tasks are just assigned with executors in preferred nodes, so > how about repartitioning rdd? > > s3File.repartition(9).count > > On Fri, Feb 5, 2016 at 5:04 AM, Lin, Hao <hao....@finra.org> wrote: > >> Hi, >> >> >> >> I have a question on the number of workers that Spark enable to >> parallelize the loading of files using sc.textFile. When I used sc.textFile >> to access multiple files in AWS S3, it seems to only enable 2 workers >> regardless of how many worker nodes I have in my cluster. So how does Spark >> configure the parallelization in regard of the size of cluster nodes? In >> the following case, spark has 896 tasks split between only two nodes >> 10.162.97.235 and 10.162.97.237, while I have 9 nodes in the cluster. >> >> >> >> thanks >> >> >> >> Example of doing a count: >> >> scala> s3File.count >> >> 16/02/04 18:12:06 INFO SparkContext: Starting job: count at <console>:30 >> >> 16/02/04 18:12:06 INFO DAGScheduler: Got job 0 (count at <console>:30) >> with 896 output partitions >> >> 16/02/04 18:12:06 INFO DAGScheduler: Final stage: ResultStage 0 (count at >> <console>:30) >> >> 16/02/04 18:12:06 INFO DAGScheduler: Parents of final stage: List() >> >> 16/02/04 18:12:06 INFO DAGScheduler: Missing parents: List() >> >> 16/02/04 18:12:06 INFO DAGScheduler: Submitting ResultStage 0 >> (MapPartitionsRDD[1] at textFile at <console>:27), which has no missing >> parents >> >> 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1 stored as values in >> memory (estimated size 3.0 KB, free 228.3 KB) >> >> 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1_piece0 stored as >> bytes in memory (estimated size 1834.0 B, free 230.1 KB) >> >> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in >> memory on 10.162.98.112:46425 (size: 1834.0 B, free: 517.4 MB) >> >> 16/02/04 18:12:07 INFO SparkContext: Created broadcast 1 from broadcast >> at DAGScheduler.scala:1006 >> >> 16/02/04 18:12:07 INFO DAGScheduler: Submitting 896 missing tasks from >> ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:27) >> >> 16/02/04 18:12:07 INFO YarnScheduler: Adding task set 0.0 with 896 tasks >> >> 16/02/04 18:12:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 >> (TID 0, 10.162.97.235, partition 0,RACK_LOCAL, 2213 bytes) >> >> 16/02/04 18:12:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 >> (TID 1, 10.162.97.237, partition 1,RACK_LOCAL, 2213 bytes) >> >> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in >> memory on 10.162.97.235:38643 (size: 1834.0 B, free: 1259.8 MB) >> >> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in >> memory on 10.162.97.237:45360 (size: 1834.0 B, free: 1259.8 MB) >> >> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in >> memory on 10.162.97.237:45360 (size: 23.8 KB, free: 1259.8 MB) >> >> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in >> memory on 10.162.97.235:38643 (size: 23.8 KB, free: 1259.8 MB) >> Confidentiality Notice:: This email, including attachments, may include >> non-public, proprietary, confidential or legally privileged information. If >> you are not an intended recipient or an authorized agent of an intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of the information contained in or transmitted with this e-mail is >> unauthorized and strictly prohibited. If you have received this email in >> error, please notify the sender by replying to this message and permanently >> delete this e-mail, its attachments, and any copies of it immediately. You >> should not retain, copy or use this e-mail or any attachment for any >> purpose, nor disclose all or any part of the contents to any other person. >> Thank you. >> > > > > -- > --- > Takeshi Yamamuro >