Informing Spark about specific Partitioning scheme to avoid shuffles

saatvikshah1994 Sat, 22 Jul 2017 10:23:59 -0700

Hi everyone,

My environment is PySpark with Spark 2.0.0.


I'm using spark to load data from a large number of files into a Spark
dataframe with fields say field1 to field10. While loading my data I have
ensured that records are partitioned by field1 and field2(without using
partitionBy). This was done when loading the data into a RDD of lists and
before the .toDF() call. So I assume Spark would not already know that such
a partitioning exists and might trigger a shuffle if I call a shuffling
transform using field1 or field2 as keys and then cache that information. 

Is it possible to inform Spark once I've created the data-frame about my
custom partitioning scheme? Or would spark have already discovered this
somehow before the shuffling transform call? 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Informing-Spark-about-specific-Partitioning-scheme-to-avoid-shuffles-tp28922.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Informing Spark about specific Partitioning scheme to avoid shuffles

Reply via email to