[ https://issues.apache.org/jira/browse/SPARK-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust resolved SPARK-6550. ------------------------------------- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5217 [https://github.com/apache/spark/pull/5217] > Add PreAnalyzer to keep logical plan consistent across DataFrame > ---------------------------------------------------------------- > > Key: SPARK-6550 > URL: https://issues.apache.org/jira/browse/SPARK-6550 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Liang-Chi Hsieh > Fix For: 1.3.1, 1.4.0 > > > h2. Problems > In some cases, the expressions in a logical plan will be modified to new ones > during analysis, e.g. the handling for self-join cases. If some expressions > are resolved based on the analyzed plan, they are referring to changed > expression ids, not original ids. > But the transformation of DataFrame will use logical plan to construct new > DataFrame, e.g. {{groupBy}} and aggregation. So in such cases, the > expressions in these DataFrames will be inconsistent. > The problems are specified as following: > # Expression ids in logical plan are possibly inconsistent if expression ids > are changed during analysis and some expressions are resolved after that > When we try to run the following codes: > {code} > val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str") > val df2 = df.as('x).join(df.as('y), $"x.str" === > $"y.str").groupBy("y.str").min("y.int") > {code} > Because {{groupBy}} and {{min}} will perform resolving based on the analyzed > logical plan, their expression ids refer to analyzed plan, instead of logical > plan. > So the logical plan of df2 looks like: > {code} > 'Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6] > 'Join Inner, Some(('x.str = 'y.str)) > Subquery x > Project [_1#0 AS int#2,_2#1 AS str#3] > LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]] > Subquery y > Project [_1#0 AS int#2,_2#1 AS str#3] > LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]] > {code} > As you see, the expression ids in {{Aggregate}} are different to the > expression ids in {{Subquery y}}. This is the first problem. > # The {{df2}} can't be performed > The showing logical plan of {{df2}} can't be performed. Because the > expression ids of {{Subquery y}} will be modified for self-join handling > during analysis, the analyzed plan of {{df2}} becomes: > {code} > Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6] > Join Inner, Some((str#3 = str#8)) > Subquery x > Project [_1#0 AS int#2,_2#1 AS str#3] > LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]] > Subquery y > Project [_1#0 AS int#7,_2#1 AS str#8] > LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]] > {code} > The expressions referred in {{Aggregate}} are not matching to these in > {{Subquery y}}. This is the second problem. > h2. Proposed solution > We try to add a {{PreAnalyzer}}. When a logical plan {{rawPlan}} is given to > SQLContext, it uses PreAnalyzer to modify the logical plan before assigning > to {{QueryExecution.logical}}. Then later operations will based on the > pre-analyzed logical plan, instead of the original {{rawPlan}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org