Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 And I have to point out that your impl have high risk causing OOM. The current impl will auto spill when local hashmap is too large and can take advantage of spark auto memory management mechanism which you'd better take a look. Another thing is the JHashmap will be slow perf and it is better to use `org.apache.spark.util.collection.OpenHashSet`, in the case the hashmap is append-only.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org