I am using Spark 2.3.0 and trying to read a CSV file which has 500 records. When I try to read it, spark says that it has two stages: 10, 11 and then they join into stage 12.
This makes sense and is what I would expect, as I have 30 map-based UDFs after which i do a join, and run another 10 UDFs and then save the file as parquet. The stages 10 and 11 have only 2 tasks according to spark. I have a max-executors possible of 20 on my cluster. I would like Spark to use all 20 executors for this task. *1csv+Repartition*: Right after reading the file, if I do a repartition, it still takes *2 tasks* *1csv+Repartition+count()*: Right after reading the file, if I do a repartition and then do an action word like count(), it still takes *2 tasks* *50csv*: If I split my 500line csv into 50 files with 10 lines each, it takes *18 tasks* *50csv+Repartition*: If I split my 500line csv into 50 files with 10 lines each, and do a repartition and a count, it takes *19 tasks* *500csv+Repartition*: If I split my 500line csv into 500 files with 1 line each, and do a repartition and a count, it takes *19 tasks* All repartitions above are: .repartition(200) I can't understand what it's trying to do. I was expecting that if I do a .repartition(200) it would just create 200 tasks after shuffling the data. But it's not doing that. I can recollect this worked find on Spark 1.6.x. PS: The reason I want more tasks is because those UDFs are very heavy and slow - I'd like to use more executors to reduce computation time. I'm sure they are parallelizable ...