Is there an efficient way to save an RDD with saveAsTextFile in such a way that the data gets shuffled into separated directories according to a key? (My end goal is to wrap the result in a multi-partitioned Hive table)
Suppose you have: case class MyData(val0: Int, val1: string, directory_name: String) and an RDD called myrdd with type RDD[MyData]. Suppose that you already have an array of the distinct directory_name's, called distinct_directories. A very inefficient way to to this is: distinct_directories.foreach( dir_name => myrdd.filter( mydata => mydata.directory_name == dir_name ) .map( mydata => Array(mydata.val0.toString, mydata.val1).mkString(",")) .coalesce(5) .saveAsTextFile("base_dir_name/" + f"$dir_name") ) I tried this solution, and it does not do the multiple myrdd.filter's in parallel. I'm guessing partitionBy might be in the efficient solution if an easy efficient solution exists... Thanks, Arun