Thanks Sean, I had seen that post you mentioned. What you suggest looks an in-memory sort, which is fine if each partition is small enough to fit in memory. Is it true that rdd.sortByKey(...) requires partitions to fit in memory? I wasn't sure if there was some magic behind the scenes that supports arbitrarily large sorts.
None of this is a show stopper, it just might require a little more code on the part of the developer. If there's a requirement for Spark partitions to fit in memory, developers will have to be aware of that and plan accordingly. One nice feature of Hadoop MR is the ability to sort very large sets without thinking about data size. In the case that a developer repartitions an RDD such that some partitions don't fit in memory, sorting those partitions requires more work. For these cases, I think there is value in having a robust partition sorting method that deals with it efficiently and reliably. Is there another solution for sorting arbitrarily large partitions? If not, I don't mind developing and contributing a solution. ----- -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-partitions-in-Java-tp6715p6719.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.