Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/143#issuecomment-37685303
  
    I'm not sure this fixes the problem Reynold was referring to in his pull 
request.  If you look in DAGScheduler.scala, on line 773, it does essentially 
the same thing you do here (serialize the closure to make sure it works); this 
gets called as a result of dagScheduler.submitJob, which happens right after 
the clean() function gets called on the RDD (so I think the functionality you 
added already exists, it just gets invoked a bit later).
    
    I think what @rxin was referring to is the fact that if you do a 
transformation (e.g., call map on an RDD), it gets lazily evaluated (you can 
see this if you look in RDD.scala:247, at the map() function -- it just creates 
a new RDD object but doesn't evaluate the transformation).  So the 
serialization error won't occur until potentially much later, when the user 
calls some other function that forces the transformation to be computed.  My 
understanding is that Reynold was suggesting adding the serialization in the 
map() and other functions, as mentioned in the JIRA, so that the serialization 
error gets triggered as soon as the user calls map().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to