[GitHub] [spark] cloud-fan opened a new pull request #24442: [SPARK-27547][SQL] fix DataFrame self-join problems

GitBox Tue, 23 Apr 2019 06:53:15 -0700

cloud-fan opened a new pull request #24442: [SPARK-27547][SQL] fix DataFrame 
self-join problems
URL: https://github.com/apache/spark/pull/24442
 
 
   ## What changes were proposed in this pull request?
   
   This is a long-standing bug and I've seen many people complaining about it 
in JIRA/dev list.
   
   A typical example:
   ```
   val df1 = …
   val df2 = df1.filter(...)
   df1.join(df2, df1("a") > df2("a")) // returns empty result
   ```
   The root cause is, `Dataset.apply` is so powerful that users think it 
returns a column reference which can point to the column of the Dataset at 
anywhere. This is not true in many cases. `Dataset.apply` returns an 
`AttributeReference` . Different Datasets may share the same 
`AttributeReference`. In the example above, `df2` adds a Filter operator above 
the logical plan of `df1`, and the Filter operator reserves the output 
`AttributeReference` of its child. This means, `df1("a")` is exactly the same 
as `df2("a")`, and `df1("a") > df2("a")` always evaluates to false.
   
   The key problem here is, when analyzer resolves self-join by assigning new 
expr IDs to the conflicting attributes in the right side of the join, how can 
we map the Dataset column reference to the new attributes of the right side?
   
   The proposal here is:
   1. each Dataset has a globally unique ID
   2. the `AttributeReference` returned by `Dataset.apply` carries the ID and 
column position(e.g. 3rd column of the Dataset) via metadata
   3. In `Dataset.join`, the left and right side of the join node carry the ID 
via `SubqueryAlias`.
   4. Add a new analyzer rule, which finds the special `AttributeReference`s 
that have dataset id, and traverses the logical plan to find the corresponding 
`SubqueryAlias` with the same dataset id, to get the actual 
`AttributeReference`.
   
   When analyzer resolves self-join, it transforms down the right side plan to 
find nodes like `MultiInstanceRelation` to generate attributes with new exprId. 
If the right side is a `SubqueryAlias`, its output will be the deduplicated 
`AttributeReference`s after self-join is resolved.
   
   When we resolve dataset column reference(`AttributeReference` with dataset 
id and col position), and find a matching `SubqueryAlias` with the same dataset 
id, we can replace the column reference with the corresponding 
`AttributeReference` of `SubqueryAlias`.
   
   ## How was this patch tested?
   
   new test cases


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan opened a new pull request #24442: [SPARK-27547][SQL] fix DataFrame self-join problems

Reply via email to