Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/3518#discussion_r21062401 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -788,6 +793,63 @@ class DAGScheduler( } } + /** + * Helper function to check whether an RDD is serializable. + * + * Note: This function is defined seperately from the SerializationHelper.isSerializable() + * since DAGScheduler.isSerialiazble() is passed as a parameter to the RDDWalker class's graph + * traversal, which would otherwise require knowledge of the closureSerializer + * (which was undesirable). + * + * @param rdd - Rdd to attempt to serialize + * @return - An output string qualifying success or failure. + */ + private def isSerializable(rdd: RDD[_]): String = { + SerializationHelper.isSerializable(closureSerializer,rdd) + } + + /** + * Use the RDDWalker class to execute a graph traversal of an RDD and its dependencies to help + * identify which RDDs are not serializable. In short, attempt to serialize the RDD and catch + * any Exceptions thrown (this is the same mechanism used within submitMissingTasks() to deal with + * serialization failures). + * + * Note: This is defined here since it uses the isSerializale function which in turn uses + * the closure serializer. Although the better place for the serializer would be in the + * SerializationHelper, the Helper is not guaranteed to run in a single thread unlike the + * DAGScheduler. + + * + * @param rdd - The rdd for which to print the serialization trace to identify unserializable + * components + * @return - String - The serialization trace + * + */ + def getSerializationTrace(rdd : RDD[_]): String = { + // Next, if there are dependencies, attempt to serialize those + val results: util.ArrayList[String] = RDDWalker.walk(rdd, isSerializable) + + var trace = "Serialization trace:\n" + + val it = results.iterator() + while(it.hasNext){ + trace += it.next() + "\n" + + } + + trace + } + + /** + * Use the RDD toDebugString function to print a formatted dependency trace for an RDD + * @param rdd - The RDD for which to print the dependency graph --- End diff -- I'd drop this `@param` and the `@return` since this documentation is obvious from context and the previous line.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org