Github user justinuang commented on the pull request:

    https://github.com/apache/spark/pull/8662#issuecomment-140936346
  
    Hey davies, I think the performance regression for a single UDF may be 
because there were multiple threads per task that could potentially be taking 
up CPU time. I highly doubt that the actual IO using loopback is actually add 
much time, compared to the time of deserializing and serializing the individual 
items in the row.
    
    The other approach of passing the entire row can potentially be okay, and 
it doesn't add a lot of changes to PythonRDD and Python UDFs, but I'm afraid 
that the cost of serializing the entire row can be prohibitive. After all, 
isn't serialization from in-memory jvm types to the pickled representations the 
most expensive part? What if I have a giant row of 100 columns, and I only want 
to do a UDF on one column? Do I need to serialize the entire row to pickle?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to