Hi,

ISTM these tasks are just assigned with executors in preferred nodes, so
how about repartitioning rdd?

s3File.repartition(9).count

On Fri, Feb 5, 2016 at 5:04 AM, Lin, Hao <hao....@finra.org> wrote:

> Hi,
>
>
>
> I have a question on the number of workers that Spark enable to
> parallelize the loading of files using sc.textFile. When I used sc.textFile
> to access multiple files in AWS S3, it seems to only enable 2 workers
> regardless of how many worker nodes I have in my cluster. So how does Spark
> configure the parallelization in regard of the size of cluster nodes? In
> the following case, spark has 896 tasks split between only two nodes
> 10.162.97.235 and 10.162.97.237, while I have 9 nodes in the cluster.
>
>
>
> thanks
>
>
>
> Example of doing a count:
>
>  scala> s3File.count
>
> 16/02/04 18:12:06 INFO SparkContext: Starting job: count at <console>:30
>
> 16/02/04 18:12:06 INFO DAGScheduler: Got job 0 (count at <console>:30)
> with 896 output partitions
>
> 16/02/04 18:12:06 INFO DAGScheduler: Final stage: ResultStage 0 (count at
> <console>:30)
>
> 16/02/04 18:12:06 INFO DAGScheduler: Parents of final stage: List()
>
> 16/02/04 18:12:06 INFO DAGScheduler: Missing parents: List()
>
> 16/02/04 18:12:06 INFO DAGScheduler: Submitting ResultStage 0
> (MapPartitionsRDD[1] at textFile at <console>:27), which has no missing
> parents
>
> 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1 stored as values in
> memory (estimated size 3.0 KB, free 228.3 KB)
>
> 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1_piece0 stored as
> bytes in memory (estimated size 1834.0 B, free 230.1 KB)
>
> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
> memory on 10.162.98.112:46425 (size: 1834.0 B, free: 517.4 MB)
>
> 16/02/04 18:12:07 INFO SparkContext: Created broadcast 1 from broadcast at
> DAGScheduler.scala:1006
>
> 16/02/04 18:12:07 INFO DAGScheduler: Submitting 896 missing tasks from
> ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:27)
>
> 16/02/04 18:12:07 INFO YarnScheduler: Adding task set 0.0 with 896 tasks
>
> 16/02/04 18:12:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
> 0, 10.162.97.235, partition 0,RACK_LOCAL, 2213 bytes)
>
> 16/02/04 18:12:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
> 1, 10.162.97.237, partition 1,RACK_LOCAL, 2213 bytes)
>
> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
> memory on 10.162.97.235:38643 (size: 1834.0 B, free: 1259.8 MB)
>
> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
> memory on 10.162.97.237:45360 (size: 1834.0 B, free: 1259.8 MB)
>
> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on 10.162.97.237:45360 (size: 23.8 KB, free: 1259.8 MB)
>
> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on 10.162.97.235:38643 (size: 23.8 KB, free: 1259.8 MB)
> Confidentiality Notice:: This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information. If
> you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited. If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately. You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>



-- 
---
Takeshi Yamamuro

Reply via email to