[jira] [Comment Edited] (SPARK-15112) Dataset filter returns garbage

Cheng Lian (JIRA) Fri, 06 May 2016 03:23:55 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273837#comment-15273837
 ]


Cheng Lian edited comment on SPARK-15112 at 5/6/16 10:22 AM:
-------------------------------------------------------------

Actually there's another issue that contributes to this bug. The problem lies 
in the {{EmbedSerializerInFilter}} optimization rule. In short, this rule 
optimizes plan fragments like

{noformat}
SerializeFromObject     <--.
 Filter                    | column order may differ
  DeserializeToObject      |
   <child-plan>         <--'
{noformat}

into

{noformat}
Filter
 <child-plan>
{noformat}

by embedding the deserializer expression into the {{Filter}} condition 
expression. Namely, when filtering an input row, the new {{Filter}} operator 
always deserializes the input row into a Scala object, and then uses the object 
as argument to invoke the user-provided Scala predicate function.

The problem here is that, output column order of {{SerializeFromObject}} may 
differ from column order of the child plan (as explained in my comment above). 
Thus the simplified plan fragment may produce wrong result because the column 
order isn't adjusted accordingly.

To fix this issue, we should add a {{Project}} on top of the result {{Filter}} 
plan when necessary to adjust output column order.



was (Author: lian cheng):
Actually there's another issue that contributes to this bug. The problem lies 
in the {{EmbedSerializerInFilter}} optimization rule. In short, this rule 
optimizes plan fragments like

{noformat}
SerializeFromObject     <--.
 Filter                    | column order may differ
  DeserializeToObject      |
   <child-plan>         <--'
{noformat}

into

{noformat}
Filter
 <child-plan>
{noformat}

by embedding the deserializer expression into the {{Filter}} condition 
expression. Namely, when filtering an input row, always deserialize the input 
row into a Scala object, then uses the object as argument to invoke the 
user-provided Scala predicate function.

The problem here is that, output column order of {{SerializeFromObject}} may 
differ from column order of the child plan (as explained in my comment above). 
Thus the simplified plan fragment may produce wrong result because the column 
order isn't adjusted accordingly.

To fix this issue, we should add a {{Project}} on top of the result {{Filter}} 
plan when necessary to adjust output column order.


> Dataset filter returns garbage
> ------------------------------
>
>                 Key: SPARK-15112
>                 URL: https://issues.apache.org/jira/browse/SPARK-15112
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Cheng Lian
>            Priority: Blocker
>         Attachments: demo 1 dataset - Databricks.htm
>
>
> See the following notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2727501386611535/5382278320999420/latest.html
> I think it happens only when using JSON. I'm also going to attach it to the 
> ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15112) Dataset filter returns garbage

Reply via email to