Github user djvulee commented on the issue:

    https://github.com/apache/spark/pull/15505
  
    >I agree with Kay that putting in a smaller change first is better, 
assuming it still has the performance gains. That doesn't preclude any further 
optimizations that are bigger changes.
    
    >I'm a little surprised that the serializing tasks has much of an impact, 
given how little data is getting serialized. But if it really is, I feel like 
there is a much bigger optimization we're completely missing. Why are we 
repeating the work of serialization for each task in a taskset? The serialized 
data is almost exactly the same for every task. they only differ in the 
partition id (an int) and the preferred locations (which aren't even used by 
the executor at all).
    
    >Task serialization already leverages the idea of having info across all 
the tasks in the Broadcast for the task binary. We just need to use that same 
idea for all the rest of the task data that is sent to the executor. Then the 
only difference between the serialized task data sent to executors is the int 
for the partitionId. You'd serialize into a bytebuffer once, and then your 
per-task "serialization" becomes copying the buffer and modifying that int 
directly.
    
    
    
    
    @squito  I like this idea very much. I just encounte the de-serialization 
time is too long (about more than 10s for some tasks). Is there any PR try to 
solve this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to