Hi everyone, My environment is PySpark with Spark 2.0.0.
I'm using spark to load data from a large number of files into a Spark dataframe with fields say field1 to field10. While loading my data I have ensured that records are partitioned by field1 and field2(without using partitionBy). This was done when loading the data into a RDD of lists and before the .toDF() call. So I assume Spark would not already know that such a partitioning exists and might trigger a shuffle if I call a shuffling transform using field1 or field2 as keys and then cache that information. Is it possible to inform Spark once I've created the data-frame about my custom partitioning scheme? Or would spark have already discovered this somehow before the shuffling transform call? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Informing-Spark-about-specific-Partitioning-scheme-to-avoid-shuffles-tp28922.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org