[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2016-01-11 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-170755883
  
@gatorsmile we will revisit this in the future. Do you mind closing the 
pull request for now?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2016-01-11 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-170756672
  
Ok, let me close it. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2016-01-11 Thread gatorsmile
Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/9548


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-11 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-155835390
  
@cloud-fan Before discussing the solution details, let us first talk about 
the design issues.  

IMO, the `DataFrame` is a query language, kind of a dialect of SQL. Or, 
maybe, SQL is a dialect of `DataFrame`. We need to formalize it and clearly 
define the concepts of each major classes like `DataFrame` and `Column`. If 
`Column` represents a concept independent of `DataFrame`, could you define what 
it is? If one `Column` with the same ID can appear in different `DataFrame`, 
how to enforce such a "referential integrity" between different `DataFrame`? If 
two `Column` with different ID could represent the same entity, should we keep 
such a relation for generating a better physical plan? 

In the current implementation, each `Column` actually corresponds to an 
expression in logical plans, but we are unable to apply an expression above 
`Column` instances to generate a new expression. So far, `Column` is kind of a 
wrapper, but it is not a subclass of `TreeNode`.   

When more components are built on `DataFrame` to access and operate, we 
have to carefully think about this problem. If possible, I think we need to 
resolve it in the release of Spark 2.0.  

Will answer your design suggestion in a separate post.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-11 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-155912100
  
@cloud-fan So far, we do not have an easy fix, but I believe we should 
never return a wrong result for self join. 

Let me post the test case I added. This test case will return an incorrect 
result without any exception:
```scala
test("[SPARK-10838] self join - conflicting attributes in condition - 
incorrect result 2") {
   val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2")
   val df2 = Seq((1, 4), (2, 1)).toDF("keyCol1", "keyCol3")
   val df3 = df1.join(df2, df1("keyCol1") === 
df2("keyCol1")).select(df1("keyCol1"), $"keyCol3")
   checkAnswer(
 df3.join(df1, df3("keyCol3") === df1("keyCol1") && df1("keyCol1") === 
df3("keyCol3")),
 Row(2, 1, 1, 3) :: Nil)
 }
```

Before resolving this problem, what we can do it is to detect it and let 
customers use the workaround you mentioned. The detection condition is simple. 
The incorrect result could happen when the conflicting attributes contain the 
`AttributeReference` that appear in join condition. 

Do you agree @cloud-fan @marmbrus ?

If OK, I will submit another PR for detecting it and issuing an exception 
with a meaningful message to users. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-09 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-155183305
  
I can't fix the problem without a major code change. The current design of 
dataFrame has a fundamental problem. When using column references, we might hit 
various strange issues if the dataFrame has the columns with the same name and 
expression id. Note that this might occur even if we do not have self joins. 

For example, in the following code, 

```scala
val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2")
val df2 = Seq((1, 4, 0), (2, 1, 0)).toDF("keyCol1", "keyCol3", 
"keyColToDrop")
val df3 = df1.join(df2, df1("keyCol1") === df2("keyCol1"))

val col = df3("keyColToDrop")
val df = df2.drop(col)
df.printSchema() 
```

Above, we can use a column reference of df3 to drop the column in df2. That 
does not make sense, right?

In each column reference, we have to know the original data source. 

@marmbrus @rxin @liancheng 
Should I propose a solution to fix this problem? Does the new Dataset APIs 
resolve this issue?





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-09 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-155226523
  
@marmbrus Thank you for your suggestions! 

That is also like my initial idea. I did a try last night. Unfortunately, I 
hit a problem when adding such a field to `Column` API. In the current design, 
the class `Column` corresponds to the class `Expression`, which includes both 
`AttributeReference` and the other types. For `Column`, it makes sense to have 
such a dataFrame identifier. However, when `Column` is generated from the 
binary expression types (e.g., `gt`), it could have more than one dataFrame 
identifiers. Does that sound good to you? 

When implementing the idea, it becomes more difficult. For example, in the 
following binary operators,

```scala
  def === (other: Any): Column = {
val right = lit(other).expr
EqualTo(expr, right)
  }
```

`EqualTo` is an `Expression`. `expr` and `right` are not `Column`s. Thus, 
when accessing the `Column` generated from `===`, we are unable to know the 
dataframe sources of `expr` and `right` if we do not change 
`AttributeReference`.  

That is why I am thinking this could mean a major code change to 
`DataFrame` and `Column`. Thank you for any further suggestion. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-154881540
  
**[Test build #45319 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45319/consoleFull)**
 for PR 9548 at commit 
[`7d04713`](https://github.com/apache/spark/commit/7d047136cd710ee0e9ff34aa37c1e6d299165233).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-08 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/9548

[SPARK-10838][SPARK-11576][SQL][WIP] Incorrect results or exceptions when 
using self-joins

When resolving the attributeReference's ambiguity caused by self joins, the 
current solution only handles the conflicting attributes. However, this does 
not work when the join conditions use the column names that appear in both 
dataFrames since the join conditions are evaluated before resolving the 
ambiguity of conflicting attributes.  Currently, we did not update the 
search-condition. When generating the new expression IDs in the right tree, we 
must update the corresponding columns' expression ID in search condition.

Here, I am trying to propose a solution to resolve this issue. When 
evaluating the join conditions, we record the dataFrame of the search-condition 
columns. Then, when resolving the ambiguity of conflicting attributes, we can 
use this information to know which columns are from the right tree, and then 
update their expression IDs.  

When designing this solution, I am trying to reduce the code changes, and 
thus, I am using quantifiers to record this information. Ideally, I think each 
column needs to clearly correlate with its original source, but this requires a 
lot of code changes, but this will help to optimize the plan in the future. 

Thanks for any suggestion! 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark selfJoinConflictingConditions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9548.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9548


commit 376691af5f639ca4f7ff07cc9e8f572d53e961bf
Author: xiaoli 
Date:   2015-11-08T18:28:12Z

Spark-10838

commit 7d047136cd710ee0e9ff34aa37c1e6d299165233
Author: xiaoli 
Date:   2015-11-08T21:07:37Z

Merge branch 'selfJoinCondition' into selfJoinConflictingConditions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-154883241
  
**[Test build #45319 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45319/consoleFull)**
 for PR 9548 at commit 
[`7d04713`](https://github.com/apache/spark/commit/7d047136cd710ee0e9ff34aa37c1e6d299165233).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-154883261
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-08 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-154881441
  
Since this solution requires adding quantifier comparison into the equation 
of attributeReferences, this will fail a couple test cases in expand. 

We have already identified the bugs in the expand and submitted pull 
requests to resolve this issue. https://github.com/apache/spark/pull/9216 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-154879753
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-154879759
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-08 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-154911463
  
To fix these failed cases, I will move the dataFrame's hashCode to the 
Column class, instead of directly putting the values to quantifiers. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org