Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140936346 Hey davies, I think the performance regression for a single UDF may be because there were multiple threads per task that could potentially be taking up CPU time. I highly doubt that the actual IO using loopback is actually add much time, compared to the time of deserializing and serializing the individual items in the row. The other approach of passing the entire row can potentially be okay, and it doesn't add a lot of changes to PythonRDD and Python UDFs, but I'm afraid that the cost of serializing the entire row can be prohibitive. After all, isn't serialization from in-memory jvm types to the pickled representations the most expensive part? What if I have a giant row of 100 columns, and I only want to do a UDF on one column? Do I need to serialize the entire row to pickle?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org