Repartition not working on a csv file

Abdeali Kothari Mon, 18 Jun 2018 14:50:11 -0700

I am using Spark 2.3.0 and trying to read a CSV file which has 500 records.
When I try to read it, spark says that it has two stages: 10, 11 and then
they join into stage 12.


This makes sense and is what I would expect, as I have 30 map-based UDFs
after which i do a join, and run another 10 UDFs and then save the file as
parquet.

The stages 10 and 11 have only 2 tasks according to spark. I have a
max-executors possible of 20 on my cluster. I would like Spark to use all
20 executors for this task.

*1csv+Repartition*: Right after reading the file, if I do a repartition, it
still takes *2 tasks*
*1csv+Repartition+count()*: Right after reading the file, if I do a
repartition and then do an action word like count(), it still takes *2
tasks*
*50csv*: If I split my 500line csv into 50 files with 10 lines each, it
takes *18 tasks*
*50csv+Repartition*: If I split my 500line csv into 50 files with 10 lines
each, and do a repartition and a count, it takes *19 tasks*
*500csv+Repartition*: If I split my 500line csv into 500 files with 1 line
each, and do a repartition and a count, it takes *19 tasks*

All repartitions above are: .repartition(200)

I can't understand what it's trying to do.
I was expecting that if I do a .repartition(200) it would just create 200
tasks after shuffling the data. But it's not doing that.
I can recollect this worked find on Spark 1.6.x.

PS: The reason I want more tasks is because those UDFs are very heavy and
slow - I'd like to use more executors to reduce computation time. I'm sure
they are parallelizable ...

Repartition not working on a csv file

Reply via email to