GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22579
[SPARK-25429][SQL] Use Set instead of Array to improve lookup performance ## What changes were proposed in this pull request? Use `Set` instead of `Array` to improve `accumulatorIds.contains(acc.id)` performance. This PR close https://github.com/apache/spark/pull/22420 ## How was this patch tested? manual tests. Benchmark code: ```scala def benchmark(func: () => Unit): Long = { val start = System.currentTimeMillis() func() val end = System.currentTimeMillis() end - start } val range = Range(1, 1000000) val set = range.toSet val array = range.toArray for (i <- 0 until 5) { val setExecutionTime = benchmark(() => for (i <- 0 until 500) { set.contains(scala.util.Random.nextInt()) }) val arrayExecutionTime = benchmark(() => for (i <- 0 until 500) { array.contains(scala.util.Random.nextInt()) }) println(s"set execution time: $setExecutionTime, array execution time: $arrayExecutionTime") } ``` Benchmark result: ``` set execution time: 4, array execution time: 2760 set execution time: 1, array execution time: 1911 set execution time: 3, array execution time: 2043 set execution time: 12, array execution time: 2214 set execution time: 6, array execution time: 1770 ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25429 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22579.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22579 ---- commit 5a92eeb636257ff5079f763405cfc6446dcffe09 Author: Yuming Wang <yumwang@...> Date: 2018-09-28T08:04:47Z Use Set instead of Array to provide lookup performance ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org