Repartitioning by partition size, not by number of partitions.

jan.zikes Fri, 31 Oct 2014 03:27:57 -0700

Hi,

I have inpot data that are many of very small files containing one .json.

For performance reasons (I use PySpark) I have to do repartioning, currently I 
do:


sc.textFile(files).coalesce(100))
 
Problem is that I have to guess the number of partitions in a such way that 
it's as fast as possible and I am still on the sefe side with the RAM memory. 
So this is quiet difficult.

For this reason I would like to ask if there is some way, how to replace 
coalesce(100) by something that creates N partitions of the given size? I went 
through the documentation, but I was not able to find some way, how to do that.

thank you in advance for any help or advice.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Repartitioning by partition size, not by number of partitions.

Reply via email to