sortByKey is indeed O(n log n), it's a first pass to figure out even-sized partitions (by sampling the RDD), then a second pass to do a distributed merge-sort (first partition the data on each machine, then run a reduce phase that merges the data for each partition). The point where it becomes useful to scale out versus a single machine is probably pretty high, because communication over a network is *much* slower than memory bandwidth within a machine. Generally it would make the most sense for data that doesn't fit in memory on a single machine, or data that already starts out distributed.
Please also note that if you run Spark on just one multicore machine, it still goes through many of the same code paths as on a cluster (e.g. serializing data between tasks) -- it's not optimized to be as fast as, say, a multithreaded sort framework. So it wouldn't make a ton of sense to use it for that. Matei On September 15, 2014 at 10:32:14 PM, cjwang (c...@cjwang.us) wrote: I wonder what algorithm is used to implement sortByKey? I assume it is some O(n*log(n)) parallelized on x number of nodes, right? Then, what size of data would make it worthwhile to use sortByKey on multiple processors rather than use standard Scala sort functions on a single processor (considering the overhead of putting stuff into RDDs and collecting them back)? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Complexity-Efficiency-of-SortByKey-tp14328.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org