Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-37685303 I'm not sure this fixes the problem Reynold was referring to in his pull request. If you look in DAGScheduler.scala, on line 773, it does essentially the same thing you do here (serialize the closure to make sure it works); this gets called as a result of dagScheduler.submitJob, which happens right after the clean() function gets called on the RDD (so I think the functionality you added already exists, it just gets invoked a bit later). I think what @rxin was referring to is the fact that if you do a transformation (e.g., call map on an RDD), it gets lazily evaluated (you can see this if you look in RDD.scala:247, at the map() function -- it just creates a new RDD object but doesn't evaluate the transformation). So the serialization error won't occur until potentially much later, when the user calls some other function that forces the transformation to be computed. My understanding is that Reynold was suggesting adding the serialization in the map() and other functions, as mentioned in the JIRA, so that the serialization error gets triggered as soon as the user calls map().
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---