[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-170755883 @gatorsmile we will revisit this in the future. Do you mind closing the pull request for now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-170756672 Ok, let me close it. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/9548 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-155835390 @cloud-fan Before discussing the solution details, let us first talk about the design issues. IMO, the `DataFrame` is a query language, kind of a dialect of SQL. Or, maybe, SQL is a dialect of `DataFrame`. We need to formalize it and clearly define the concepts of each major classes like `DataFrame` and `Column`. If `Column` represents a concept independent of `DataFrame`, could you define what it is? If one `Column` with the same ID can appear in different `DataFrame`, how to enforce such a "referential integrity" between different `DataFrame`? If two `Column` with different ID could represent the same entity, should we keep such a relation for generating a better physical plan? In the current implementation, each `Column` actually corresponds to an expression in logical plans, but we are unable to apply an expression above `Column` instances to generate a new expression. So far, `Column` is kind of a wrapper, but it is not a subclass of `TreeNode`. When more components are built on `DataFrame` to access and operate, we have to carefully think about this problem. If possible, I think we need to resolve it in the release of Spark 2.0. Will answer your design suggestion in a separate post. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-155912100 @cloud-fan So far, we do not have an easy fix, but I believe we should never return a wrong result for self join. Let me post the test case I added. This test case will return an incorrect result without any exception: ```scala test("[SPARK-10838] self join - conflicting attributes in condition - incorrect result 2") { val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2") val df2 = Seq((1, 4), (2, 1)).toDF("keyCol1", "keyCol3") val df3 = df1.join(df2, df1("keyCol1") === df2("keyCol1")).select(df1("keyCol1"), $"keyCol3") checkAnswer( df3.join(df1, df3("keyCol3") === df1("keyCol1") && df1("keyCol1") === df3("keyCol3")), Row(2, 1, 1, 3) :: Nil) } ``` Before resolving this problem, what we can do it is to detect it and let customers use the workaround you mentioned. The detection condition is simple. The incorrect result could happen when the conflicting attributes contain the `AttributeReference` that appear in join condition. Do you agree @cloud-fan @marmbrus ? If OK, I will submit another PR for detecting it and issuing an exception with a meaningful message to users. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-155183305 I can't fix the problem without a major code change. The current design of dataFrame has a fundamental problem. When using column references, we might hit various strange issues if the dataFrame has the columns with the same name and expression id. Note that this might occur even if we do not have self joins. For example, in the following code, ```scala val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2") val df2 = Seq((1, 4, 0), (2, 1, 0)).toDF("keyCol1", "keyCol3", "keyColToDrop") val df3 = df1.join(df2, df1("keyCol1") === df2("keyCol1")) val col = df3("keyColToDrop") val df = df2.drop(col) df.printSchema() ``` Above, we can use a column reference of df3 to drop the column in df2. That does not make sense, right? In each column reference, we have to know the original data source. @marmbrus @rxin @liancheng Should I propose a solution to fix this problem? Does the new Dataset APIs resolve this issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-155226523 @marmbrus Thank you for your suggestions! That is also like my initial idea. I did a try last night. Unfortunately, I hit a problem when adding such a field to `Column` API. In the current design, the class `Column` corresponds to the class `Expression`, which includes both `AttributeReference` and the other types. For `Column`, it makes sense to have such a dataFrame identifier. However, when `Column` is generated from the binary expression types (e.g., `gt`), it could have more than one dataFrame identifiers. Does that sound good to you? When implementing the idea, it becomes more difficult. For example, in the following binary operators, ```scala def === (other: Any): Column = { val right = lit(other).expr EqualTo(expr, right) } ``` `EqualTo` is an `Expression`. `expr` and `right` are not `Column`s. Thus, when accessing the `Column` generated from `===`, we are unable to know the dataframe sources of `expr` and `right` if we do not change `AttributeReference`. That is why I am thinking this could mean a major code change to `DataFrame` and `Column`. Thank you for any further suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-154881540 **[Test build #45319 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45319/consoleFull)** for PR 9548 at commit [`7d04713`](https://github.com/apache/spark/commit/7d047136cd710ee0e9ff34aa37c1e6d299165233). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/9548 [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect results or exceptions when using self-joins When resolving the attributeReference's ambiguity caused by self joins, the current solution only handles the conflicting attributes. However, this does not work when the join conditions use the column names that appear in both dataFrames since the join conditions are evaluated before resolving the ambiguity of conflicting attributes. Currently, we did not update the search-condition. When generating the new expression IDs in the right tree, we must update the corresponding columns' expression ID in search condition. Here, I am trying to propose a solution to resolve this issue. When evaluating the join conditions, we record the dataFrame of the search-condition columns. Then, when resolving the ambiguity of conflicting attributes, we can use this information to know which columns are from the right tree, and then update their expression IDs. When designing this solution, I am trying to reduce the code changes, and thus, I am using quantifiers to record this information. Ideally, I think each column needs to clearly correlate with its original source, but this requires a lot of code changes, but this will help to optimize the plan in the future. Thanks for any suggestion! You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark selfJoinConflictingConditions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9548.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9548 commit 376691af5f639ca4f7ff07cc9e8f572d53e961bf Author: xiaoliDate: 2015-11-08T18:28:12Z Spark-10838 commit 7d047136cd710ee0e9ff34aa37c1e6d299165233 Author: xiaoli Date: 2015-11-08T21:07:37Z Merge branch 'selfJoinCondition' into selfJoinConflictingConditions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-154883241 **[Test build #45319 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45319/consoleFull)** for PR 9548 at commit [`7d04713`](https://github.com/apache/spark/commit/7d047136cd710ee0e9ff34aa37c1e6d299165233). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-154883261 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-154881441 Since this solution requires adding quantifier comparison into the equation of attributeReferences, this will fail a couple test cases in expand. We have already identified the bugs in the expand and submitted pull requests to resolve this issue. https://github.com/apache/spark/pull/9216 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-154879753 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-154879759 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-154911463 To fix these failed cases, I will move the dataFrame's hashCode to the Column class, instead of directly putting the values to quantifiers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org