[GitHub] spark pull request #15307: [WIP][SPARK-17731][SQL][STREAMING] Metrics for st...

marmbrus Tue, 04 Oct 2016 17:44:02 -0700

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15307#discussion_r81875491
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
    @@ -525,8 +645,62 @@ class StreamExecution(
       case object TERMINATED extends State
     }
     
    -object StreamExecution {
    +object StreamExecution extends Logging {
       private val _nextId = new AtomicLong(0)
     
    +  /**
    +   * Get the number of input rows from the executed plan of the trigger
    +   * @param triggerExecutionPlan Execution plan of the trigger
    +   * @param triggerLogicalPlan Logical plan of the trigger, generated from 
the query logical plan
    +   * @param sourceToDataframe Source to DataFrame returned by the 
source.getBatch in this trigger
    +   */
    +  def getNumInputRowsFromTrigger(
    +      triggerExecutionPlan: SparkPlan,
    +      triggerLogicalPlan: LogicalPlan,
    +      sourceToDataframe: Map[Source, DataFrame]): Map[Source, Long] = {
    +
    +    // We want to associate execution plan leaves to sources that generate 
them, so that we match
    +    // the their metrics (e.g. numOutputRows) to the sources. To do this 
we do the following.
    +    // Consider the translation from the streaming logical plan to the 
final executed plan.
    +    //
    +    //  streaming logical plan (with sources) <==> trigger's logical plan 
<==> executed plan
    +    //
    +    // 1. We keep track of streaming sources associated with each leaf in 
the trigger's logical plan
    +    //    - Each logical plan leaf will be associated with a single 
streaming source.
    +    //    - There can be multiple logical plan leaves associated a 
streaming source.
    +    //    - There can be leaves not associated with any streaming source, 
because they were
    +    //      generated from a batch source (e.g. stream-batch joins)
    +    //
    +    // 2. Assuming that the executed plan has same number of leaves in the 
same order as that of
    +    //    the trigger logical plan, we associate executed plan leaves with 
corresponding
    +    //    streaming sources.
    +    //
    +    // 3. For each source, we sum the metrics of the associated execution 
plan leaves.
    +    //
    +    val logicalPlanLeafToSource = sourceToDataframe.flatMap { case 
(source, df) =>
    +      df.logicalPlan.collectLeaves().map { leaf => leaf -> source }
    +    }
    +    val allLogicalPlanLeaves = triggerLogicalPlan.collectLeaves() // 
includes non-streaming sources
    +    val allExecPlanLeaves = triggerExecutionPlan.collectLeaves()
    +    if (allLogicalPlanLeaves.size == allExecPlanLeaves.size) {
    +      val execLeafToSource = 
allLogicalPlanLeaves.zip(allExecPlanLeaves).flatMap {
    +        case (lp, ep) => logicalPlanLeafToSource.get(lp).map { source => 
ep -> source }
    +      }
    +      val sourceToNumInputRows = execLeafToSource.map { case (execLeaf, 
source) =>
    +        val numRows = 
execLeaf.metrics.get("numOutputRows").map(_.value).getOrElse(0L)
    +        source -> numRows
    +      }
    +      sourceToNumInputRows.groupBy(_._1).mapValues(_.map(_._2).sum) // sum 
up rows for each source
    +    } else {
    +      def toString[T](seq: Seq[T]): String = s"(size = ${seq.size}), 
${seq.mkString(", ")}"
    +      logWarning(
    +        "Could not report metrics as number leaves in trigger logical plan 
did not match that" +
    --- End diff --
    
    Seems this is going to flood the logs if its ever triggered.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15307: [WIP][SPARK-17731][SQL][STREAMING] Metrics for st...

Reply via email to