[
https://issues.apache.org/jira/browse/SPARK-22724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-22724.
----------------------------------
Resolution: Incomplete
> TakeOrderedAndProjectExec operator has poor performance when sorting on low
> cardinality keys
> --------------------------------------------------------------------------------------------
>
> Key: SPARK-22724
> URL: https://issues.apache.org/jira/browse/SPARK-22724
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.0.2
> Reporter: Jonny Serencsa
> Priority: Major
> Labels: bulk-closed
>
> The com.google.guava.collect.Ordering implementation used by current versions
> of spark (including 2.0.2 which I use) has a performance issue when
> performing TopK operations using sort keys with relatively low cardinalities.
> For example, when performing a top-k for 1M rows of randomly chosen integers
> between 0-9 we have the following (approximate) number of compare operations
> for various k:
> k, # compares
> 1000, 1.6E6
> 2000, 3.2E6
> 4000, 2.6E7
> 8000, 1E9
> 16000, 4E9
> 32000, 16E9
> While the distribution isn't perfectly O(K^2), it's pretty close.
> Seems like guava has addressed this problem in their latest version.
> Here is the code for the debug script I used for the experiment.
> {noformat}
> import scala.collection.JavaConverters._
> import com.google.common.collect.{Ordering => GuavaOrdering}
> import scala.util.Random
> object SortTest {
> class OrderingWithCounter extends GuavaOrdering[Int] {
> var counter = 0L
> override def compare(t: Int, t1: Int): Int = {
> counter += 1
> t.compare(t1)
> }
> }
> def run(size: Int, limit: Int, card: Int): Unit = {
> val r = new Random(13)
> val e = 0 until size map(a => r.nextInt(card))
> least(e, limit)
> }
> def least(e: Seq[Int], limit:Int): Unit = {
> val o = new OrderingWithCounter
> o.leastOf(e.asJava, limit)
> println(f"${e.size},${e.distinct.size},$limit,${o.counter}")
> }
> /**
> *
> * Output:
> Limit test
> 1000000,10,1000,1557236
> 1000000,10,2000,3205997
> 1000000,10,4000,26011722
> 1000000,10,8000,102248988
> 1000000,10,16000,407248708
> 1000000,10,32000,1623467135
> *
> */
> def main(args: Array[String]) {
> val card = 10
> val size = 1e6.toInt
> println("Limit test")
> Seq(1000, 2000, 4000, 8000, 16000, 32000).foreach { limit =>
> run(size, limit, card)
> }
> }
> }
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]