[jira] [Commented] (SPARK-15112) Dataset filter returns garbage

Cheng Lian (JIRA) Wed, 04 May 2016 10:23:23 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271025#comment-15271025
 ]


Cheng Lian commented on SPARK-15112:
------------------------------------

The following Spark shell session illustrates this issue:

{noformat}
scala> case class T(a: String, b: Int)
defined class T

scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T]
ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string]

scala> ds.show()
+---+---+
|  b|  a|
+---+---+
|foo|  1|
|bar|  2|
+---+---+


scala> ds.filter(_.b > 1).show()
+---+---+
|  a|  b|
+---+---+
| |  3|
+---+---+
{noformat}

Dataset encoders actually doesn't require the order of input columns to be 
exactly the same as its own schema. Essentially it performs a projection to 
adjust column order during analysis phase. This is can be quite helpful for 
data sources that support schema evolution, where the column order of merged 
schema may be non-deterministic. The JSON data source falls into this category, 
and it always sorts all input columns by name.

This leads to the following facts, for a Dataset {{ds}}:

# {{ds.resolvedTEncoder.schema}} may differ from {{ds.logicalPlan.schema}}, and
# {{ds.schema}} should conform to {{ds.resolvedTEncoder.schema}}, and
# {{ds.toDF()}} uses a {{RowEncoder}} to convert user space Scala objects to 
{{InternalRow}}s, and this {{RowEncoder}} should be initialized using 
{{ds.logicalPlan.schema}}.

Spark 1.6 conforms to the above requirements. For example:

{noformat}
scala> case class T(a: String, b: Int)
defined class T

scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T]
ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string]

scala> ds.show()
+---+---+
|  b|  a|
+---+---+
|foo|  1|
|bar|  2|
+---+---+


scala> ds.toDF().show()
+---+---+
|  a|  b|
+---+---+
|  1|foo|
|  2|bar|
+---+---+
{noformat}

However, while merging DF/DF API in Spark 2.0, requirement 2 was broken by 
accident, and we are using {{ds.logicalPlan.schema}} as {{ds.schema}}, which 
leads to this bug.

Working on a fix for it.

> Dataset filter returns garbage
> ------------------------------
>
>                 Key: SPARK-15112
>                 URL: https://issues.apache.org/jira/browse/SPARK-15112
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Blocker
>         Attachments: demo 1 dataset - Databricks.htm
>
>
> See the following notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2727501386611535/5382278320999420/latest.html
> I think it happens only when using JSON. I'm also going to attach it to the 
> ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15112) Dataset filter returns garbage

Reply via email to