[jira] [Created] (SPARK-11013) SparkPlan may mistakenly register child plan's accumulators for SQL metrics

Wenchen Fan (JIRA) Thu, 08 Oct 2015 13:23:27 -0700

Wenchen Fan created SPARK-11013:
-----------------------------------

             Summary: SparkPlan may mistakenly register child plan's 
accumulators for SQL metrics
                 Key: SPARK-11013
                 URL: https://issues.apache.org/jira/browse/SPARK-11013
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Wenchen Fan



The reason is that: when we call RDD API inside SparkPlan, we are very likely 
to reference the SparkPlan in the closure and thus serialize and transfer a 
SparkPlan tree to executor side. When we deserialize it, the accumulators in 
child SparkPlan are also deserialized and registered, and always report zero 
value.
This is not a problem currently because we only have one operation to aggregate 
the accumulators: add. However, if we wanna support more complex metric like 
min, the extra zero values will lead to wrong result.

Take TungstenAggregate as an example, I logged "stageId, partitionId, 
accumName, accumId" when an accumulator is deserialized and registered, and 
logged the "accumId -> accumValue" map when a task ends. The output is:
{code}
scala> val df = Seq(1 -> "a", 2 -> "b").toDF("a", "b").groupBy().count()
df: org.apache.spark.sql.DataFrame = [count: bigint]

scala> df.collect
register: 0 0 Some(number of input rows) 4
register: 0 0 Some(number of output rows) 5
register: 1 0 Some(number of input rows) 4
register: 1 0 Some(number of output rows) 5
register: 1 0 Some(number of input rows) 2
register: 1 0 Some(number of output rows) 3
Map(5 -> 1, 4 -> 2, 6 -> 4458496)
Map(5 -> 0, 2 -> 1, 7 -> 4458496, 3 -> 1, 4 -> 0)
res0: Array[org.apache.spark.sql.Row] = Array([2])
{code}


The best choice is to avoid serialize and deserialize a SparkPlan tree, which 
can be achieved by LocalNode.
Or we can do some workaround to fix this serialization problem for the 
problematic SparkPlans like TungstenAggregate, TungstenSort.
Or we can improve the SQL metrics framework to make it more robust to this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11013) SparkPlan may mistakenly register child plan's accumulators for SQL metrics

Reply via email to