Thanks Sean, I had seen that post you mentioned.

What you suggest looks an in-memory sort, which is fine if each partition is
small enough to fit in memory. Is it true that rdd.sortByKey(...) requires
partitions to fit in memory? I wasn't sure if there was some magic behind
the scenes that supports arbitrarily large sorts.

None of this is a show stopper, it just might require a little more code on
the part of the developer. If there's a requirement for Spark partitions to
fit in memory, developers will have to be aware of that and plan
accordingly. One nice feature of Hadoop MR is the ability to sort very large
sets without thinking about data size.

In the case that a developer repartitions an RDD such that some partitions
don't fit in memory, sorting those partitions requires more work. For these
cases, I think there is value in having a robust partition sorting method
that deals with it efficiently and reliably.

Is there another solution for sorting arbitrarily large partitions? If not,
I don't mind developing and contributing a solution.




-----
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-partitions-in-Java-tp6715p6719.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Reply via email to