[ 
https://issues.apache.org/jira/browse/SPARK-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271120#comment-15271120
 ] 

Herman van Hovell edited comment on SPARK-14986 at 5/4/16 6:13 PM:
-------------------------------------------------------------------

I have taken a look at this. Your query yields the following plan:
{noformat}
== Parsed Logical Plan ==
'Project ['nil]
+- 'Generate 'EXPLODE('array()), true, true, Some(n), ['nil]
   +- SubqueryAlias x
      +- Project [1 AS x#0]
         +- OneRowRelation$

== Analyzed Logical Plan ==
nil: null
Project [nil#6]
+- Generate explode(array()), true, true, Some(n), [nil#6]
   +- SubqueryAlias x
      +- Project [1 AS x#0]
         +- OneRowRelation$

== Optimized Logical Plan ==
Generate explode([]), false, true, Some(n), [nil#6]
+- OneRowRelation$

== Physical Plan ==
Generate explode([]), false, true, [nil#6]
+- Scan OneRowRelation[]
{noformat}

The optimizer set the {{join}} flag to false because no fields from the first 
relation ({{select 1 as x}})are used. See: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L365

Setting the join flag to false, triggers a different code path. This code path 
emits all the rows in the generated relation for each input row. It does not 
return any rows if the relation is empty; which is what you are seeing. The 
other code path would generate a row because it performs a left join like 
operation on the generated results.

This is only a problem for {{OUTER}} lateral views. We could add the {{outer}} 
flag to the optimizer rule. Does anyone know what the default behavior of Hive 
is?


was (Author: hvanhovell):
I have taken a look at this. Your query yields the following plan:
{noformat}
== Parsed Logical Plan ==
'Project ['nil]
+- 'Generate 'EXPLODE('array()), true, true, Some(n), ['nil]
   +- SubqueryAlias x
      +- Project [1 AS x#0]
         +- OneRowRelation$

== Analyzed Logical Plan ==
nil: null
Project [nil#6]
+- Generate explode(array()), true, true, Some(n), [nil#6]
   +- SubqueryAlias x
      +- Project [1 AS x#0]
         +- OneRowRelation$

== Optimized Logical Plan ==
Generate explode([]), false, true, Some(n), [nil#6]
+- OneRowRelation$

== Physical Plan ==
Generate explode([]), false, true, [nil#6]
+- Scan OneRowRelation[]
{noformat}

The optimizer set the {join} flag to false because no fields from the first 
relation ({select 1 as x})are used. See: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L365

Setting the join flag to false, triggers a different code path. This code path 
emits all the rows in the generated relation for each input row. It does not 
return any rows if the relation is empty; which is what you are seeing. The 
other code path would generate a row because it performs a left join like 
operation on the generated results.

This is only a problem for {OUTER} lateral views. We could add the {outer} flag 
to the optimizer rule. Does anyone know what the default behavior of Hive is?

> Spark SQL returns incorrect results for LATERAL VIEW OUTER queries if all 
> inner columns are projected out
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14986
>                 URL: https://issues.apache.org/jira/browse/SPARK-14986
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: Andrey Balmin
>
> Repro:   using Hive context, run this SQL query:
>    select  nil from (select 1 as x ) x LATERAL VIEW OUTER EXPLODE( array ()) 
> n as nil
> Actual result:             returns 0 rows.
> Expected results:      should return 1 row with null value.
> Details:
> If the query is modified to also return x:
>    select x, nil from (select 1 as x ) x LATERAL VIEW OUTER EXPLODE( array 
> ()) n as nil
> it works correctly and returns 1 row: [ 1, null ]
> Clearly, changing Select clause of a query should not change the number of 
> rows it returns.
> Looking at the query plan it seems that the Generator object was 
> (incorrectly) marked with “join=false"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to