[ 
https://issues.apache.org/jira/browse/SPARK-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543602#comment-14543602
 ] 

Santiago M. Mola commented on SPARK-6743:
-----------------------------------------

This problem only happens for cached relations. Here is the root of the problem:

{code}
/* Fails. Got: Array(Row("A1"), Row("A2") */
assertResult(Array(Row(), Row()))(
  InMemoryColumnarTableScan(Nil, Nil, 
sqlc.table("tab0").queryExecution.sparkPlan.asInstanceOf[InMemoryColumnarTableScan].relation)
    .execute().collect()
)
{code}

InMemoryColumnarTableScan returns the narrowest column when no attributes are 
requested:

{code}
 // Find the ordinals and data types of the requested columns.  If none are 
requested, use the
 // narrowest (the field with minimum default element size).
      val (requestedColumnIndices, requestedColumnDataTypes) = if 
(attributes.isEmpty) {
        val (narrowestOrdinal, narrowestDataType) =
          relation.output.zipWithIndex.map { case (a, ordinal) =>
            ordinal -> a.dataType
          } minBy { case (_, dataType) =>
            ColumnType(dataType).defaultSize
          }
        Seq(narrowestOrdinal) -> Seq(narrowestDataType)
      } else {
        attributes.map { a =>
          relation.output.indexWhere(_.exprId == a.exprId) -> a.dataType
        }.unzip
      }
{code}

It seems this is what leads to incorrect results.

> Join with empty projection on one side produces invalid results
> ---------------------------------------------------------------
>
>                 Key: SPARK-6743
>                 URL: https://issues.apache.org/jira/browse/SPARK-6743
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Santiago M. Mola
>            Priority: Critical
>
> {code:java}
> val sqlContext = new SQLContext(sc)
> val tab0 = sc.parallelize(Seq(
>       (83,0,38),
>       (26,0,79),
>       (43,81,24)
>     ))
>     sqlContext.registerDataFrameAsTable(sqlContext.createDataFrame(tab0), 
> "tab0")
> sqlContext.cacheTable("tab0")   
> val df1 = sqlContext.sql("SELECT tab0._2, cor0._2 FROM tab0, tab0 cor0 GROUP 
> BY tab0._2, cor0._2")
> val result1 = df1.collect()
> val df2 = sqlContext.sql("SELECT cor0._2 FROM tab0, tab0 cor0 GROUP BY 
> cor0._2")
> val result2 = df2.collect()
> val df3 = sqlContext.sql("SELECT cor0._2 FROM tab0 cor0 GROUP BY cor0._2")
> val result3 = df3.collect()
> {code}
> Given the previous code, result2 equals to Row(43), Row(83), Row(26), which 
> is wrong. These results correspond to cor0._1, instead of cor0._2. Correct 
> results would be Row(0), Row(81), which are ok for the third query. The first 
> query also produces valid results, and the only difference is that the left 
> side of the join is not empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to