[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

JoshRosen Sun, 30 Nov 2014 10:53:06 -0800

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3518#discussion_r21062406
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
    @@ -788,6 +793,63 @@ class DAGScheduler(
         }
       }
     
    +  /**
    +   * Helper function to check whether an RDD is serializable. 
    +   * 
    +   * Note: This function is defined seperately from the 
SerializationHelper.isSerializable()
    +   * since DAGScheduler.isSerialiazble() is passed as a parameter to the 
RDDWalker class's graph
    +   * traversal, which would otherwise require knowledge of the 
closureSerializer 
    +   * (which was undesirable).
    +   * 
    +   * @param rdd - Rdd to attempt to serialize
    +   * @return - An output string qualifying success or failure.
    +   */
    +  private def isSerializable(rdd: RDD[_]): String = {
    +    SerializationHelper.isSerializable(closureSerializer,rdd)
    +  }
    +
    +  /**
    +   * Use the RDDWalker class to execute a graph traversal of an RDD and 
its dependencies to help 
    +   * identify which RDDs are not serializable. In short, attempt to 
serialize the RDD and catch 
    +   * any Exceptions thrown (this is the same mechanism used within 
submitMissingTasks() to deal with
    +   * serialization failures). 
    +   * 
    +   * Note: This is defined here since it uses the isSerializale function 
which in turn uses 
    +   * the closure serializer. Although the better place for the serializer 
would be in the 
    +   * SerializationHelper, the Helper is not guaranteed to run in a single 
thread unlike the 
    +   * DAGScheduler.  
    +    
    +   * 
    +   * @param rdd - The rdd for which to print the serialization trace to 
identify unserializable 
    +   *              components
    +   * @return - String - The serialization trace
    +   *              
    +   */
    +  def getSerializationTrace(rdd : RDD[_]): String = {
    +    // Next, if there are dependencies, attempt to serialize those 
    +    val results: util.ArrayList[String] = RDDWalker.walk(rdd, 
isSerializable)
    +    
    +    var trace = "Serialization trace:\n"
    +    
    +    val it = results.iterator()
    +    while(it.hasNext){
    +      trace += it.next() + "\n"
    +      
    +    }
    +    
    +    trace
    +  }
    +
    +  /**
    +   * Use the RDD toDebugString function to print a formatted dependency 
trace for an RDD
    +   * @param rdd - The RDD for which to print the dependency graph
    +   * @return
    +   */
    +  def getDependencyTrace(rdd: RDD[_]): String ={
    --- End diff --
    
    Why not just call `getDebugString`?  What does this level of indirection 
add?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

Reply via email to