Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141117878 The solution with the iterator wrapper was my first approach that I prototyped (http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html). It's dangerous because there is buffering at many levels, in which case we can run into a deadlock situation. - NEW: the ForkingIterator LinkedBlockingDeque - batching the rows before pickling them - os buffers on both sides - pyspark.serializers.BatchedSerializer We can avoid deadlock by being very disciplined. For example, we can have the ForkingIterator instead always do a check of whether the LinkedBlockingDeque is full and if so: Java - flush the java pickling buffer - send a flush command to the python process - os.flush the java side Python - flush BatchedSerializer - os.flush() I'm not sure that this udf performance regression for one UDF is going to hit many people. For one, most upstreams are not a range() call, which doesn't have to go back to disk and deserialize. My personal opinion is that the blocking performance shouldn't be the reason that we reject this approach, but because it adds complexity. If we want a quick fix that is safe, I would be in favor of passing the row, which indeed is slower, but better than deadlocking or calculating upstream twice. It's just that the current system is unacceptable. Maybe we can also consider going with a complete architecture shift that goes with a batching system, but uses thrift to serialize the scala types to a language agnostic format, and also handle the blocking RPC. Then we can have PySpark and SparkR using the same simple UDF architecture. The main drawback is that I'm not sure how we're going to support broadcast variables or aggregators, but should those even be supported with UDFs?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org