[ https://issues.apache.org/jira/browse/SPARK-15112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271025#comment-15271025 ]
Cheng Lian commented on SPARK-15112: ------------------------------------ The following Spark shell session illustrates this issue: {noformat} scala> case class T(a: String, b: Int) defined class T scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T] ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string] scala> ds.show() +---+---+ | b| a| +---+---+ |foo| 1| |bar| 2| +---+---+ scala> ds.filter(_.b > 1).show() +---+---+ | a| b| +---+---+ | | 3| +---+---+ {noformat} Dataset encoders actually doesn't require the order of input columns to be exactly the same as its own schema. Essentially it performs a projection to adjust column order during analysis phase. This is can be quite helpful for data sources that support schema evolution, where the column order of merged schema may be non-deterministic. The JSON data source falls into this category, and it always sorts all input columns by name. This leads to the following facts, for a Dataset {{ds}}: # {{ds.resolvedTEncoder.schema}} may differ from {{ds.logicalPlan.schema}}, and # {{ds.schema}} should conform to {{ds.resolvedTEncoder.schema}}, and # {{ds.toDF()}} uses a {{RowEncoder}} to convert user space Scala objects to {{InternalRow}}s, and this {{RowEncoder}} should be initialized using {{ds.logicalPlan.schema}}. Spark 1.6 conforms to the above requirements. For example: {noformat} scala> case class T(a: String, b: Int) defined class T scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T] ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string] scala> ds.show() +---+---+ | b| a| +---+---+ |foo| 1| |bar| 2| +---+---+ scala> ds.toDF().show() +---+---+ | a| b| +---+---+ | 1|foo| | 2|bar| +---+---+ {noformat} However, while merging DF/DF API in Spark 2.0, requirement 2 was broken by accident, and we are using {{ds.logicalPlan.schema}} as {{ds.schema}}, which leads to this bug. Working on a fix for it. > Dataset filter returns garbage > ------------------------------ > > Key: SPARK-15112 > URL: https://issues.apache.org/jira/browse/SPARK-15112 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Reynold Xin > Priority: Blocker > Attachments: demo 1 dataset - Databricks.htm > > > See the following notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2727501386611535/5382278320999420/latest.html > I think it happens only when using JSON. I'm also going to attach it to the > ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org