[ 
https://issues.apache.org/jira/browse/SPARK-22724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22724.
----------------------------------
    Resolution: Incomplete

> TakeOrderedAndProjectExec operator has poor performance when sorting on low 
> cardinality keys
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22724
>                 URL: https://issues.apache.org/jira/browse/SPARK-22724
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Jonny Serencsa
>            Priority: Major
>              Labels: bulk-closed
>
> The com.google.guava.collect.Ordering implementation used by current versions 
> of spark (including 2.0.2 which I use) has a performance issue when 
> performing TopK operations using sort keys with relatively low cardinalities. 
> For example, when performing a top-k for 1M rows of randomly chosen integers 
> between 0-9 we have the following (approximate) number of compare operations 
> for various k:
> k, # compares
> 1000, 1.6E6
> 2000, 3.2E6
> 4000, 2.6E7
> 8000, 1E9
> 16000, 4E9
> 32000, 16E9
> While the distribution isn't perfectly O(K^2), it's pretty close. 
> Seems like guava has addressed this problem in their latest version. 
> Here is the code for the debug script I used for the experiment.
> {noformat}
> import scala.collection.JavaConverters._
> import com.google.common.collect.{Ordering => GuavaOrdering}
> import scala.util.Random
> object SortTest {
>   class OrderingWithCounter extends GuavaOrdering[Int] {
>     var counter = 0L
>     override def compare(t: Int, t1: Int): Int = {
>       counter += 1
>       t.compare(t1)
>     }
>   }
>   def run(size: Int, limit: Int, card: Int): Unit = {
>     val r = new Random(13)
>     val e = 0 until size map(a => r.nextInt(card))
>     least(e, limit)
>   }
>   def least(e: Seq[Int], limit:Int): Unit = {
>     val o = new OrderingWithCounter
>     o.leastOf(e.asJava, limit)
>     println(f"${e.size},${e.distinct.size},$limit,${o.counter}")
>   }
>   /**
>     *
>     * Output:
> Limit test
> 1000000,10,1000,1557236
> 1000000,10,2000,3205997
> 1000000,10,4000,26011722
> 1000000,10,8000,102248988
> 1000000,10,16000,407248708
> 1000000,10,32000,1623467135
>     *
>     */
>   def main(args: Array[String]) {
>     val card = 10
>     val size = 1e6.toInt
>     println("Limit test")
>     Seq(1000, 2000, 4000, 8000, 16000, 32000).foreach { limit =>
>       run(size, limit, card)
>     }
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to