[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

JoshRosen Sun, 30 Nov 2014 11:28:36 -0800

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3518#issuecomment-64996599
  
    Formatting nits aside, a couple of higher-level comments:
    
    If I understand this patch correctly, it only tells me which RDD contains 
the non-serializable object that caused the serialization error.  It's still up 
to me to determine _how_ that object got captured in my closure, right?  In my 
experience, that's the trickier part, since Scala's closure capture rules can 
sometimes be tricky to reason about.
    
    I have some experimental code for traversing Java object graphs: 
https://gist.github.com/JoshRosen/d6a8972c99992e97d040.  This code happens to 
generate GraphViz DOT output, but I think you could adopt the same basic idea 
and write something to do a bounded search to find a path from some root object 
to a non-serializable object.  You could have some function which decides which 
references / edges to follow: this could exclude transient references, 
references to certain Spark internals, etc.  That could be complementary to 
PR's approach: we could first use this code to narrow the search space down to 
an individual RDD's closure and then perform a more fine-grained search through 
the object graph to find a direct path to the non-serializable object.
    
    Even if we don't decide to add this extra functionality, I think that we 
should still refactor this PR's design to make it easier to do that later.  
Specifically, I think we should separate the result reporting / formatting from 
the search: there should be one set of functions which return references to 
RDDs / objects / etc, and a separate function which uses this to format 
strings, etc.  This would make testing easier, too, since the unit tests 
wouldn't have to match on the exact string output of the error message, but 
instead could match on the data used to produce that message.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

Reply via email to