Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19317 Test case: ```scala test("performance of aggregateByKeyLocally ") { val random = new Random(1) val pairs = sc.parallelize(0 until 10000000, 20) .map(p => (random.nextInt(100), p)) .persist(StorageLevel.MEMORY_ONLY) pairs.count() val start = System.currentTimeMillis() // val jHashMap = pairs.aggregateByKeyLocallyWithJHashMap(new HashSet[Int]())(_ += _, _ ++= _).toArray val openHashMap = pairs.aggregateByKeyLocally(new HashSet[Int]())(_ += _, _ ++= _).toArray println(System.currentTimeMillis() - start) } ``` Test result: | map| 1| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | avg | | ------| ------ | ------ |------| ------ | ------ |------| ------ | ------ |------| ------ | ------ | | JHashMap | 2921 | 2920 | 2843 | 2950 | 2898 | 3316 | 2770 | 2994 | 3016 | 3005 | 2963.3 | | OpenHashMap | 3029 | 2884 | 3064 | 3023 | 3108 | 3194 | 3003 | 2961 | 3115 | 3023 | 3040.4 | Looks almost the same performance.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org