Partitioning in spark

Darshan Singh Thu, 23 Jun 2016 13:47:15 -0700

Hi,

My default parallelism is 100. Now I join 2 dataframes with 20 partitions
each , joined dataframe has 100 partition. I want to know what is the way
to keep it to 20 (except re-partition and coalesce.


Also, when i join these 2 dataframes I am using 4 columns as joined
columns. The dataframes are partitions based on first 2 columns of join and
thus, in effect one partition should be joined corresponding joins and
doesn't need to join with rest of partitions so why spark is shuffling all
the data.

Simialrly, when my dataframe is partitioned by col1,col2 and if i use group
by on col1,col2,col3,col4 then why does it shuffle everything whereas it
need to sort each partitions and then should grouping there itself.

Bit confusing , I am using 1.5.1

Is it fixed in future versions.

Thanks

Partitioning in spark

Reply via email to