Re: sc.textFile the number of the workers to parallelize

Koert Kuipers Thu, 04 Feb 2016 20:48:36 -0800

increase minPartitions:
sc.textFile(path, minPartitions = 9)


On Thu, Feb 4, 2016 at 11:41 PM, Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> Hi,
>
> ISTM these tasks are just assigned with executors in preferred nodes, so
> how about repartitioning rdd?
>
> s3File.repartition(9).count
>
> On Fri, Feb 5, 2016 at 5:04 AM, Lin, Hao <hao....@finra.org> wrote:
>
>> Hi,
>>
>>
>>
>> I have a question on the number of workers that Spark enable to
>> parallelize the loading of files using sc.textFile. When I used sc.textFile
>> to access multiple files in AWS S3, it seems to only enable 2 workers
>> regardless of how many worker nodes I have in my cluster. So how does Spark
>> configure the parallelization in regard of the size of cluster nodes? In
>> the following case, spark has 896 tasks split between only two nodes
>> 10.162.97.235 and 10.162.97.237, while I have 9 nodes in the cluster.
>>
>>
>>
>> thanks
>>
>>
>>
>> Example of doing a count:
>>
>>  scala> s3File.count
>>
>> 16/02/04 18:12:06 INFO SparkContext: Starting job: count at <console>:30
>>
>> 16/02/04 18:12:06 INFO DAGScheduler: Got job 0 (count at <console>:30)
>> with 896 output partitions
>>
>> 16/02/04 18:12:06 INFO DAGScheduler: Final stage: ResultStage 0 (count at
>> <console>:30)
>>
>> 16/02/04 18:12:06 INFO DAGScheduler: Parents of final stage: List()
>>
>> 16/02/04 18:12:06 INFO DAGScheduler: Missing parents: List()
>>
>> 16/02/04 18:12:06 INFO DAGScheduler: Submitting ResultStage 0
>> (MapPartitionsRDD[1] at textFile at <console>:27), which has no missing
>> parents
>>
>> 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1 stored as values in
>> memory (estimated size 3.0 KB, free 228.3 KB)
>>
>> 16/02/04 18:12:07 INFO MemoryStore: Block broadcast_1_piece0 stored as
>> bytes in memory (estimated size 1834.0 B, free 230.1 KB)
>>
>> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
>> memory on 10.162.98.112:46425 (size: 1834.0 B, free: 517.4 MB)
>>
>> 16/02/04 18:12:07 INFO SparkContext: Created broadcast 1 from broadcast
>> at DAGScheduler.scala:1006
>>
>> 16/02/04 18:12:07 INFO DAGScheduler: Submitting 896 missing tasks from
>> ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:27)
>>
>> 16/02/04 18:12:07 INFO YarnScheduler: Adding task set 0.0 with 896 tasks
>>
>> 16/02/04 18:12:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0
>> (TID 0, 10.162.97.235, partition 0,RACK_LOCAL, 2213 bytes)
>>
>> 16/02/04 18:12:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0
>> (TID 1, 10.162.97.237, partition 1,RACK_LOCAL, 2213 bytes)
>>
>> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
>> memory on 10.162.97.235:38643 (size: 1834.0 B, free: 1259.8 MB)
>>
>> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in
>> memory on 10.162.97.237:45360 (size: 1834.0 B, free: 1259.8 MB)
>>
>> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in
>> memory on 10.162.97.237:45360 (size: 23.8 KB, free: 1259.8 MB)
>>
>> 16/02/04 18:12:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in
>> memory on 10.162.97.235:38643 (size: 23.8 KB, free: 1259.8 MB)
>> Confidentiality Notice:: This email, including attachments, may include
>> non-public, proprietary, confidential or legally privileged information. If
>> you are not an intended recipient or an authorized agent of an intended
>> recipient, you are hereby notified that any dissemination, distribution or
>> copying of the information contained in or transmitted with this e-mail is
>> unauthorized and strictly prohibited. If you have received this email in
>> error, please notify the sender by replying to this message and permanently
>> delete this e-mail, its attachments, and any copies of it immediately. You
>> should not retain, copy or use this e-mail or any attachment for any
>> purpose, nor disclose all or any part of the contents to any other person.
>> Thank you.
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: sc.textFile the number of the workers to parallelize

Reply via email to