Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3518#issuecomment-64996599 Formatting nits aside, a couple of higher-level comments: If I understand this patch correctly, it only tells me which RDD contains the non-serializable object that caused the serialization error. It's still up to me to determine _how_ that object got captured in my closure, right? In my experience, that's the trickier part, since Scala's closure capture rules can sometimes be tricky to reason about. I have some experimental code for traversing Java object graphs: https://gist.github.com/JoshRosen/d6a8972c99992e97d040. This code happens to generate GraphViz DOT output, but I think you could adopt the same basic idea and write something to do a bounded search to find a path from some root object to a non-serializable object. You could have some function which decides which references / edges to follow: this could exclude transient references, references to certain Spark internals, etc. That could be complementary to PR's approach: we could first use this code to narrow the search space down to an individual RDD's closure and then perform a more fine-grained search through the object graph to find a direct path to the non-serializable object. Even if we don't decide to add this extra functionality, I think that we should still refactor this PR's design to make it easier to do that later. Specifically, I think we should separate the result reporting / formatting from the search: there should be one set of functions which return references to RDDs / objects / etc, and a separate function which uses this to format strings, etc. This would make testing easier, too, since the unit tests wouldn't have to match on the exact string output of the error message, but instead could match on the data used to produce that message.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org