[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/9216 [SPARK-8658] [SQL] AttributeReference's equals method compares all the members This fix is to change the equals method to check all of the specified fields for equality of AttributeReference. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark namedExpressEqual Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9216.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9216 commit 029b5babcc842347d6c55df95b0fa51fff43f0e6 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-10-22T04:08:53Z Spark-8658 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9216#issuecomment-150395202 My code change expose a new defect: Both rollup and cube are not working correctly no matter whether the build include my changes or not. Without my changes, the outputs of rollup query are [3,2,-1] [3,null,null] [6,4,-2] [6,null,null] [null,null,null] However, the expected results should be like [3,2,-1] [3,null,-1] [6,4,-2] [6,null,-2] [null,null,-3] I need more time to find the root cause why rollup and cube do not work. The current test cases in HiveDataFrameAnalyticsSuite hide the errors because both the results of both SQL and DataFrame are wrong, although the results are the same. Thanks, Xiao Li --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11360] [Doc] Loss of nullability when w...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/9314 [SPARK-11360] [Doc] Loss of nullability when writing parquet files This fix is to add one line to explain the current behavior of Spark SQL when writing Parquet files. All columns are forced to be nullable for compatibility reasons. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark lossNull Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9314.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9314 commit 4a63fad3b432bcb16d0fa3774c86112a2425 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-10-28T01:33:04Z Document fix: loss of nullability when writing parquet files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9216#issuecomment-150475832 Hi, @cloud-fan Sure. Will do. I am trying to see if I can easily fix it. Anyway, I will open a JIRA tonight. Thanks, Xiao Li --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9216#issuecomment-150486350 The JIRA is opened: https://issues.apache.org/jira/browse/SPARK-11275 I will continue the investigation on this JIRA issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-155835390 @cloud-fan Before discussing the solution details, let us first talk about the design issues. IMO, the `DataFrame` is a query language, kind of a dialect of SQL. Or, maybe, SQL is a dialect of `DataFrame`. We need to formalize it and clearly define the concepts of each major classes like `DataFrame` and `Column`. If `Column` represents a concept independent of `DataFrame`, could you define what it is? If one `Column` with the same ID can appear in different `DataFrame`, how to enforce such a "referential integrity" between different `DataFrame`? If two `Column` with different ID could represent the same entity, should we keep such a relation for generating a better physical plan? In the current implementation, each `Column` actually corresponds to an expression in logical plans, but we are unable to apply an expression above `Column` instances to generate a new expression. So far, `Column` is kind of a wrapper, but it is not a subclass of `TreeNode`. When more components are built on `DataFrame` to access and operate, we have to carefully think about this problem. If possible, I think we need to resolve it in the release of Spark 2.0. Will answer your design suggestion in a separate post. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/9385 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
GitHub user gatorsmile reopened a pull request: https://github.com/apache/spark/pull/9385 [SPARK-11433] [SQL] Cleanup the subquery name after eliminating subquery This fix is to remove the subquery name in qualifiers after eliminating subquery. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark eliminateSubQ Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9385.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9385 commit db69ccf2c60679c4ca111a618190258d5b5cef62 Author: Xiao Li <xiaoli@xiaos-imac.local> Date: 2015-10-30T22:12:19Z cleanup the subquery name after eliminating subquery commit a26763d758bc58dacf81be171428ede215775532 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-04T00:04:31Z flatmap->map --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9385#issuecomment-155605985 @marmbrus After rechecking the root reason why Expand failed, I still think we should cleanup the subquery name after subquery elimination. My current fix needs a change to enable a deeper clean of subquery. Let me explain what happened in Expand. Before subquery elimination, the subquery name "mytable" is shown in all the three upper levels (Aggregate, Expand and Project). ```scala Aggregate [(a#2 + b#3)#7,b#3,grouping__id#6], [(a#2 + b#3)#7 AS _c0#4,b#3,sum(cast((a#2 - b#3) as bigint)) AS _c2#5L] Expand [0,1,3], [(a#2 + b#3)#7,b#3], grouping__id#6 Project [a#2,b#3,(a#2 + b#3) AS (a#2 + b#3)#7] Subquery mytable Project [_1#0 AS a#2,_2#1 AS b#3] LocalRelation [_1#0,_2#1], [[1,2],[2,4]] ``` After subquery elimination, the subquery name "mytable" is not removed in these three levels. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/9548 [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect results or exceptions when using self-joins When resolving the attributeReference's ambiguity caused by self joins, the current solution only handles the conflicting attributes. However, this does not work when the join conditions use the column names that appear in both dataFrames since the join conditions are evaluated before resolving the ambiguity of conflicting attributes. Currently, we did not update the search-condition. When generating the new expression IDs in the right tree, we must update the corresponding columns' expression ID in search condition. Here, I am trying to propose a solution to resolve this issue. When evaluating the join conditions, we record the dataFrame of the search-condition columns. Then, when resolving the ambiguity of conflicting attributes, we can use this information to know which columns are from the right tree, and then update their expression IDs. When designing this solution, I am trying to reduce the code changes, and thus, I am using quantifiers to record this information. Ideally, I think each column needs to clearly correlate with its original source, but this requires a lot of code changes, but this will help to optimize the plan in the future. Thanks for any suggestion! You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark selfJoinConflictingConditions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9548.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9548 commit 376691af5f639ca4f7ff07cc9e8f572d53e961bf Author: xiaoli <lixiao1...@gmail.com> Date: 2015-11-08T18:28:12Z Spark-10838 commit 7d047136cd710ee0e9ff34aa37c1e6d299165233 Author: xiaoli <lixiao1...@gmail.com> Date: 2015-11-08T21:07:37Z Merge branch 'selfJoinCondition' into selfJoinConflictingConditions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-154881441 Since this solution requires adding quantifier comparison into the equation of attributeReferences, this will fail a couple test cases in expand. We have already identified the bugs in the expand and submitted pull requests to resolve this issue. https://github.com/apache/spark/pull/9216 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-155183305 I can't fix the problem without a major code change. The current design of dataFrame has a fundamental problem. When using column references, we might hit various strange issues if the dataFrame has the columns with the same name and expression id. Note that this might occur even if we do not have self joins. For example, in the following code, ```scala val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2") val df2 = Seq((1, 4, 0), (2, 1, 0)).toDF("keyCol1", "keyCol3", "keyColToDrop") val df3 = df1.join(df2, df1("keyCol1") === df2("keyCol1")) val col = df3("keyColToDrop") val df = df2.drop(col) df.printSchema() ``` Above, we can use a column reference of df3 to drop the column in df2. That does not make sense, right? In each column reference, we have to know the original data source. @marmbrus @rxin @liancheng Should I propose a solution to fix this problem? Does the new Dataset APIs resolve this issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-154911463 To fix these failed cases, I will move the dataFrame's hashCode to the Column class, instead of directly putting the values to quantifiers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11360] [Doc] Loss of nullability when w...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9314#issuecomment-155334571 Got it, thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11360] [Doc] Loss of nullability when w...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9314#issuecomment-155309645 @marmbrus Should I reopen it? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11275][SQL] Rollup and Cube Generates t...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9419#issuecomment-155973699 Thank you, Hao! Will do it in the next few days. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-11637][SQL] Regression in UDF: exceptio...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/9683 [Spark-11637][SQL] Regression in UDF: exceptions when using Stars and Alias When using UDF in Spark SQL, the query failed if star and alias are used at the same time. This works in 1.4.x, but 1.5.x failed. For example, ```scala select hash(*) as x from src ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark hiveUDFStarAlias Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9683.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9683 commit 8e104094edd8e69f8c44ff8a0fc6d83a2d61dd07 Author: xiaoli <lixiao1...@gmail.com> Date: 2015-11-13T03:05:21Z Spark-11637 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-11637][SQL] Regression in UDF: exceptio...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9683#issuecomment-156319117 The issue has been fixed in https://github.com/apache/spark/pull/9343. I will close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9385#issuecomment-156612181 Hi, @marmbrus Originally, I thought quantifiers are part of identifiers, like schema name in traditional RDBMS. Based on your explanation, this is not true. I did a code change. Please check if the latest changes make sense. `semanticEquals` is used. Now, all the test cases passed. https://github.com/gatorsmile/spark/commit/8e72b17561e4cc1a6cce86fc70f6ed968ebf5b38 Just did a merge to the latest master. Thank you for your time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9385#issuecomment-157239704 Sure. Close it. Thank you for your time! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/9385 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/9216#discussion_r45011838 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala --- @@ -194,7 +194,9 @@ case class AttributeReference( def sameRef(other: AttributeReference): Boolean = this.exprId == other.exprId override def equals(other: Any): Boolean = other match { -case ar: AttributeReference => name == ar.name && exprId == ar.exprId && dataType == ar.dataType +case ar: AttributeReference => + name == ar.name && dataType == ar.dataType && nullable == ar.nullable && +metadata == ar.metadata && exprId == ar.exprId && qualifiers == ar.qualifiers --- End diff -- sure, will do it tonight. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9717#issuecomment-156943032 Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11275][SQL] Rollup and Cube Generates t...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9419#issuecomment-155951403 Please let me know if I need to resolve these conflicts. @cloud-fan @chenghao-intel @marmbrus @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-155912100 @cloud-fan So far, we do not have an easy fix, but I believe we should never return a wrong result for self join. Let me post the test case I added. This test case will return an incorrect result without any exception: ```scala test("[SPARK-10838] self join - conflicting attributes in condition - incorrect result 2") { val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2") val df2 = Seq((1, 4), (2, 1)).toDF("keyCol1", "keyCol3") val df3 = df1.join(df2, df1("keyCol1") === df2("keyCol1")).select(df1("keyCol1"), $"keyCol3") checkAnswer( df3.join(df1, df3("keyCol3") === df1("keyCol1") && df1("keyCol1") === df3("keyCol3")), Row(2, 1, 1, 3) :: Nil) } ``` Before resolving this problem, what we can do it is to detect it and let customers use the workaround you mentioned. The detection condition is simple. The incorrect result could happen when the conflicting attributes contain the `AttributeReference` that appear in join condition. Do you agree @cloud-fan @marmbrus ? If OK, I will submit another PR for detecting it and issuing an exception with a meaningful message to users. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8658] [SQL] [FOLLOW-UP] AttributeRefere...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9761#issuecomment-157414057 @nongli I saw you have a related discussion with @chenghao-intel . The failed test case was introduced in your PR https://github.com/apache/spark/pull/9480. I am not sure the root reason why we intentionally exclude `name` from `hashCode`, but the original `equals` include `name`. It breaks a general principle of hashCode function design: ``` An objectâs hashCode method must take the same fields into account as its equals method. ``` Based on my understanding, in a case-sensitive HiveContext, we still should detect their differences when the case of `name` is different but the `exprId` values are the same --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8658] [SQL] [FOLLOW-UP] AttributeRefere...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9761#issuecomment-157434770 Ok. I will also add three more lines for covering the new `hashCode` and `equals` functions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11072][SQL] simplify self join handling
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9081#issuecomment-157440817 @cloud-fan I am wondering if this will be merged soon? I am not sure if I should fix a couple of self join issues before your merge. Or I should not waste time until you merge this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9385#issuecomment-15565 Hi, @marmbrus After digging the root reason why Expand cases failed, I found we still need a deeper clean of subquery after elimination. Let me use the following example to explain what happened in Expand. This query works well if we do not compare the qualifiers when comparing two AttributeReferences. I think this is a bug if merging https://github.com/apache/spark/pull/9216, right? ```scala val sqlDF = sql("select a, b, sum(a) from mytable group by a, b with rollup").explain(true) ``` Before subquery elimination, the subquery name "mytable" is shown in all the two upper layers (Aggregate and Expand). ```scala Aggregate [a#2,b#3,grouping__id#5], [a#2,b#3,sum(cast(a#2 as bigint)) AS _c2#4L] Expand [0,1,3], [a#2,b#3], grouping__id#5 Subquery mytable Project [_1#0 AS a#2,_2#1 AS b#3] LocalRelation [_1#0,_2#1], [[1,2],[2,4]] ``` After subquery elimination, the subquery name "mytable" is not removed in these two upper layers. ```scala Aggregate [a#2,b#3,grouping__id#5], [a#2,b#3,sum(cast(a#2 as bigint)) AS _c2#4L] Expand [0,1,3], [a#2,b#3], grouping__id#5 Project [_1#0 AS a#2,_2#1 AS b#3] LocalRelation [_1#0,_2#1], [[1,2],[2,4]] ``` In SparkStrategies, we create an array of Projections for the child projection of Expand. ```scala case e @ logical.Expand(_, _, _, child) => execution.Expand(e.projections, e.output, planLater(child)) :: Nil ``` `e.projections` calls the function `expand()`. Inside the function `expand()`, I do not think we should use `semanticEquals` there. Let me post the incorrect physical plan ```scala TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Final,isDistinct=false)], output=[a#2,b#3,_c2#11L]) TungstenExchange hashpartitioning(a#2,b#3,grouping__id#12,5) TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Partial,isDistinct=false)], output=[a#2,b#3,grouping__id#12,currentSum#15L]) Expand [List(a#2, b#3, 0),List(a#2, b#3, 1),List(a#2, b#3, 3)], [a#2,b#3,grouping__id#12] LocalTableScan [a#2,b#3], [[1,2],[2,4]] ``` For you convenience, below is the correct one: ```scala TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Final,isDistinct=false)], output=[a#2,b#3,_c2#11L]) TungstenExchange hashpartitioning(a#2,b#3,grouping__id#12,5) TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 as bigint)),mode=Partial,isDistinct=false)], output=[a#2,b#3,grouping__id#12,currentSum#15L]) Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], [a#2,b#3,grouping__id#12] LocalTableScan [a#2,b#3], [[1,2],[2,4]] ``` My current fix does not fix this issue yet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9548#issuecomment-155226523 @marmbrus Thank you for your suggestions! That is also like my initial idea. I did a try last night. Unfortunately, I hit a problem when adding such a field to `Column` API. In the current design, the class `Column` corresponds to the class `Expression`, which includes both `AttributeReference` and the other types. For `Column`, it makes sense to have such a dataFrame identifier. However, when `Column` is generated from the binary expression types (e.g., `gt`), it could have more than one dataFrame identifiers. Does that sound good to you? When implementing the idea, it becomes more difficult. For example, in the following binary operators, ```scala def === (other: Any): Column = { val right = lit(other).expr EqualTo(expr, right) } ``` `EqualTo` is an `Expression`. `expr` and `right` are not `Column`s. Thus, when accessing the `Column` generated from `===`, we are unable to know the dataframe sources of `expr` and `right` if we do not change `AttributeReference`. That is why I am thinking this could mean a major code change to `DataFrame` and `Column`. Thank you for any further suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9419#issuecomment-153148409 @hvanhovell Your understanding is right. If we merge both grouping and aggregation together, it will introduce extra complexity to generate the logical plan for the following case: "select a + b, b, sum(a - b), sum(a) from mytable group by a + b, b with rollup". Of course, in theory, it is doable, but the code will be harder to maintain in the future. Extra Project will be collapsed by optimizer. Thus, in analyzer, I just introduce the extra Project. I am writing unit test cases. Will try to deliver them ASAP. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9419#issuecomment-153146798 @holdenk This is the PR I mentioned in the email. Could you review it too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/9419 [SPARK-11275][SQL][WIP] Rollup and Cube Generates the Incorrect Results when Aggregation Functions Use Group By Columns In the current implementation, Rollup and Cube are unable to generate the correct results for the following cases: When the aggregation functions use the group by key columns: sql("select b, a, sum(a), min(a), min(b+b) from mytable group by a, b with rollup").collect() sql("select a, b, sum(a), min(a), min(b+b) from mytable group by b, a with cube").collect() The problem becomes more complex if the group by clauses have the functions whose inputs are also appear in the group by. sql("select a + b, b, sum(a - b) from mytable group by a + b, b with rollup").collect() sql("select a + b, b, sum(a - b) from mytable group by a + b, b with cube").collect() The basic solutions are adding extra Projection when the query are part of the above situations. The projection will add duplicate values for these affected columns with alias names so that the column values will not be removed when expand is evaluated during the runtime. Working on the test cases. Will add more cases into Hive golden buckets. Welcome any comment and suggestion! Thank you! You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark rollupCube Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9419.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9419 commit b10418e161d5809f3b1de92cf4a33b2f362cd2b4 Author: Xiao Li <xiaoli@xiaos-imac.local> Date: 2015-11-02T09:05:31Z Spark-11275 commit 7721442cdf65924af204d39e3b3b7bda6c41dfc6 Author: Xiao Li <xiaoli@xiaos-imac.local> Date: 2015-11-02T09:51:16Z syntax cleaning --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9419#issuecomment-153145967 Hi, Rick, 1) This is a defect identified by me. It blocks my PR. It was introduced in the initial implementation. Thus, it is not a regression. 2) I updated my PR summary with a few query examples. 3) It is limited to ResolveGroupingAnalytics. Thus, the only affected queries are group by queries. I tried to follow the original way to make sure the coding styles are the same. Not sure if I need to put more comments into the code. Thanks, Xiao Li --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/9385#discussion_r43559826 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1019,7 +1019,16 @@ class Analyzer( * scoping information for attributes and can be removed once analysis is complete. */ object EliminateSubQueries extends Rule[LogicalPlan] { - def apply(plan: LogicalPlan): LogicalPlan = plan transform { + def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { +case Project(projectList, child: Subquery) => { + Project( +projectList.flatMap { + case ar: AttributeReference if ar.qualifiers.contains(child.alias) => --- End diff -- Should I use NamedExpression to replace AttributeReference? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11360] [Doc] Loss of nullability when w...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9314#issuecomment-152656067 @marmbrus : as you suggested, I submitted the pull request. Could you review it? Thanks, Xiao Li --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/9385 [SPARK-11433] [SQL] Cleanup the subquery name after eliminating subquery This fix is to remove the subquery name in qualifiers after eliminating subquery. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark eliminateSubQ Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9385.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9385 commit db69ccf2c60679c4ca111a618190258d5b5cef62 Author: Xiao Li <xiaoli@xiaos-imac.local> Date: 2015-10-30T22:12:19Z cleanup the subquery name after eliminating subquery --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9385#issuecomment-152708351 So far, I just observed this strange ghosting values when I read the optimized logical tree, but my query did not trigger any issue. Based on my understanding, usage of qualifiers is still limited in the current code base. It could be a potential issue when we support more complex SQL syntax/functions. Thus, I submitted this pull request to resolve this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9055#issuecomment-153857289 @jameszhouyi We hit the same issue. Now, we bypass it by using joins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9055#issuecomment-153920042 @jameszhouyi Agree. This is an important feature for any SQL engine. We are also waiting for this feature. So far, using joins is an alternative to bypass it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9419#issuecomment-153451771 @chenghao-intel @hvanhovell Unit test cases are added. Will finish the code for resolving the comments by @holdenk @rick-ibm @rxin @marmbrus @liancheng @yhuai I am wondering if my incremental low-risk fix will be merged to Spark 1.6? If not, personally, I prefer to fixing all the bugs and improve the solution by @aray (Andrew Ray). That solution simplifies the implementation of rollup and cube. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/9385#discussion_r43817697 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1019,7 +1019,16 @@ class Analyzer( * scoping information for attributes and can be removed once analysis is complete. */ object EliminateSubQueries extends Rule[LogicalPlan] { - def apply(plan: LogicalPlan): LogicalPlan = plan transform { + def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { --- End diff -- Thank you for your comments! If we doing transformUp, subquery will be removed at first. Then, Project(projectList, child: Subquery) will not be applicable in this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6231][SQL/DF] Automatically resolve joi...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/5919#issuecomment-154612297 @rxin @marmbrus This fix is unable to resolve the condition ambiguity for nested self join. I also found the self joins could generate incorrect results. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9385#issuecomment-154650945 @marmbrus Thanks! I will try to change equals to semanticEquals in the pull request https://github.com/apache/spark/pull/9216. Then, you can decide if this is a right solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9385#issuecomment-154609690 @marmbrus I already hit this issue when resolving https://issues.apache.org/jira/browse/SPARK-8658. That means, when comparing two AttributeReferences, we should not compare their qualifiers. That looks a strange fix, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9385#issuecomment-153529973 @cloud-fan @dbtsai , Jenkins did not start the testing. Could you let Jenkins to test it? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/9385#discussion_r43826123 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1019,7 +1019,16 @@ class Analyzer( * scoping information for attributes and can be removed once analysis is complete. */ object EliminateSubQueries extends Rule[LogicalPlan] { - def apply(plan: LogicalPlan): LogicalPlan = plan transform { + def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { +case Project(projectList, child: Subquery) => { + Project( +projectList.flatMap { --- End diff -- Thank you! I did the change based on your suggestion. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9216#issuecomment-153511090 @JoshRosen @cloud-fan I submitted a pull request for JIRA Spark-11275: https://github.com/apache/spark/pull/9419 Hopefully, after finishing the problem, this one can pass all the tests. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9385#issuecomment-153620226 @dbtsai Thank you! Please let me know if you need any extra code change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11275][SQL] Rollup and Cube Generates t...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/9419#discussion_r43850164 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -232,7 +232,7 @@ class Analyzer( // substitute the group by expressions. val newGroupByExprs = groupByExprPairs.map(_._2) --- End diff -- Hi, @chenghao-intel , Could you explain it a little bit more? So far, this query is correctly processed and returned a correct result. Since b is part of an aggregated function, the fix added extra columns for b. Below is the generated plan: `== Analyzed Logical Plan == ab: bigint Aggregate [b#3,grouping__id#12], [sum(cast((a#2 - b#3#13) as bigint)) AS ab#4L] Expand [0,1], [b#3], grouping__id#12 Project [a#2,b#3,b#3 AS b#3#13] Subquery mytable Project [_1#0 AS a#2,_2#1 AS b#3] LocalRelation [_1#0,_2#1], [[1,2],[2,4],[2,9]] == Optimized Logical Plan == Aggregate [b#3,grouping__id#12], [sum(cast((a#2 - b#3#13) as bigint)) AS ab#4L] Expand [0,1], [b#3], grouping__id#12 LocalRelation [a#2,b#3,b#3#13], [[1,2,2],[2,4,4],[2,9,9]]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9419#issuecomment-153192515 @rick-ibm Will add more comments to explain it. Especially, I will emphasize this design will expect the optimizer collapses these two projections into a single one. @chenghao-intel , could you also review the code changes? Does the solution look ok? Really appreciate your original work. It looks very concise to me. @holdenk Got it. Will try to follow your suggestions and do more code cleaning and resend a request to you again. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11275][SQL] Rollup and Cube Generates t...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/9419#discussion_r44107333 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -232,7 +232,7 @@ class Analyzer( // substitute the group by expressions. val newGroupByExprs = groupByExprPairs.map(_._2) --- End diff -- @chenghao-intel , a good catch! Thank you! I fixed this issue if you integrate the latest change. Also added two more test cases to cover it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11633] [SQL] HiveContext's Case Insensi...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9762#issuecomment-157786406 @cloud-fan @marmbrus Will follow your suggestions to update the fix. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11803][SQL] fix Dataset self-join
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9806#issuecomment-157749632 Your code looks pretty clean to me. Let me share my test cases this PR failed. ``` test("joinWith tuple - self join 1") { val ds = Seq(("a", 1), ("b", 2)).toDS() ds.joinWith(ds, $"_2" === $"_2").collect() } test("joinWith tuple - self join 2") { val ds1 = Seq(("a", 1), ("b", 2)).toDS() val ds2 = Seq(("a", 1), ("b", 2)).toDS().as("a") ds1.joinWith(ds2, $"_2" === $"a._2").collect() } ``` Do you want me to send me a PR? Or you will fix them? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11803][SQL] fix Dataset self-join
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9806#issuecomment-157807682 Sure. Will do. Thanks! 2015-11-18 10:16 GMT-08:00 Michael Armbrust <notificati...@github.com>: > LGTM, merging to master and 1.6. > > @gatorsmile <https://github.com/gatorsmile> please open JIRAs targeted at > 1.6.0 for the bugs you have found. (also use checkAnswer when writing > test cases). Thanks! > > â > Reply to this email directly or view it on GitHub > <https://github.com/apache/spark/pull/9806#issuecomment-157806994>. > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11633] [SQL] HiveContext's Case Insensi...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9762#issuecomment-157857086 @marmbrus @cloud-fan Based on your comments, I did the change. Please review the new change. I also tried the fix after excluding the change in `attributeRewrites`. The newly introduced test case still works fine. That means this should be the root cause. The fix still keeps the extra filter in `attributeRewrites`. I think it can avoid extra compare and replacement in the subsequent transformation. Please let me know if you want to remove the filter. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9717#issuecomment-156738825 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9385#issuecomment-156730810 @marmbrus CachedTableSuite failed due to the same reason. We did not clean up the subquery names. Thus, it is unable to give a correct result when deciding if Exchange is needed. I did the fix by using `semanticEquals`. Please check if the changes appropriate. https://github.com/apache/spark/pull/9216 Now, all the test cases passed. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/9717 [SPARK-9928][SQL] Removal of LogicalLocalTable LogicalLocalTable in ExistingRDD.scala is replaced by localRelation in LocalRelation.scala? Do you know any reason why we still keep this class? You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark LogicalLocalTable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9717.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9717 commit 01e4cdfcfc4ac37644165923c6e8eb65fcfdf3ac Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-13T22:50:39Z Merge remote-tracking branch 'upstream/master' commit e25b3785b923d44b3d48fe4100c4672d85787318 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-14T02:43:37Z Merge remote-tracking branch 'upstream/master' into LogicalLocalTable commit 7555b76633fdeff6dd65c97be41e733cc28ba04c Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-14T02:44:22Z Merge branch 'master' of https://github.com/gatorsmile/spark into LogicalLocalTable merge. commit 6835704c273abc13e8eda37f5a10715027e4d17b Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-14T02:50:51Z Merge remote-tracking branch 'upstream/master' commit 3a7d4654e3c7e69bafbe14d7c5b6158666e36b0e Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-14T16:59:31Z Merge remote-tracking branch 'upstream/master' into LogicalLocalTable commit cb49f8cb79b7f9fcca0b62b0645709c7c8c539dc Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-14T16:59:56Z Merge branch 'master' of https://github.com/gatorsmile/spark into LogicalLocalTable commit 9180687775649f97763bdbd7c004fe6fc392989c Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-14T17:01:59Z Merge remote-tracking branch 'upstream/master' commit 45ef950ff8c6c082e3fb1de84f85329060daf27c Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-14T17:03:36Z Merge branch 'master' of https://github.com/gatorsmile/spark into LogicalLocalTable commit 195d176da9d4a58650690c2f5cc3ba27883b63ad Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-14T17:45:22Z SPARK-9928 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9717#issuecomment-156738584 The failure of this test case is not related to the code changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9717#issuecomment-156739572 @srowen Could you review the changes? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/9717#issuecomment-156745623 Another case failed due to the same reasons. ``` [error] Test org.apache.spark.ml.util.JavaDefaultReadWriteSuite.testDefaultReadWrite failed: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. ``` Timing issues? Or introduced by a recent merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11633] [SQL] HiveContext's Case Insensi...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/9762 [SPARK-11633] [SQL] HiveContext's Case Insensitivity in Self-Join Handling When handling self joins, the implementation did not consider the case insensitivity of HiveContext. It could cause an exception as shown in the JIRA: ``` TreeNodeException: Failed to copy node. ``` The fix is low risk. It avoids unnecessary attribute replacement. It should not affect the existing behavior of self joins. Also added the test case to cover this case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark joinMakeCopy Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9762.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9762 commit 01e4cdfcfc4ac37644165923c6e8eb65fcfdf3ac Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-13T22:50:39Z Merge remote-tracking branch 'upstream/master' commit 6835704c273abc13e8eda37f5a10715027e4d17b Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-14T02:50:51Z Merge remote-tracking branch 'upstream/master' commit 9180687775649f97763bdbd7c004fe6fc392989c Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-14T17:01:59Z Merge remote-tracking branch 'upstream/master' commit b38a21ef6146784e4b93ef4ce8c899f1eee14572 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-17T02:30:26Z SPARK-11633 commit d2b84af8cce7fc2c03c748a2d443c07bad3f0ed1 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-17T02:32:12Z Merge remote-tracking branch 'upstream/master' into joinMakeCopy commit a15f267206215352f91f0699d813b0d71b15f11f Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-17T03:40:41Z scala style fix. commit 7d48e1e95d39656317235d274b353c8645e3f93d Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-17T04:55:24Z Merge remote-tracking branch 'upstream/master' into joinMakeCopy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8658] [SQL] [FOLLOW-UP] AttributeRefere...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/9761 [SPARK-8658] [SQL] [FOLLOW-UP] AttributeReference's equals method compares all the members Based on the comment of @cloud-fan , update the AttributeReference's hashCode function by including the hashCode of the other attributes including name, nullable and qualifiers. Here, I am not 100% sure if we should include name in the hashCode calculation, since the original hashCode calculation does not include it. @marmbrus @cloud-fan Please review if the changes are good. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark hashCodeNamedExpression Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9761.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9761 commit eb63f097df5595cebf09954bcd188a87c5ebfdb0 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-17T07:47:19Z follow-up: SPARK-8658 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12028] [SQL] get_json_object returns an...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/10018 [SPARK-12028] [SQL] get_json_object returns an incorrect result when the value is null literals When calling `get_json_object` for the following two cases, both results are `"null"`: ```scala val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil val df: DataFrame = tuple.toDF("key", "jstring") val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect() ``` ```scala val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil val df2: DataFrame = tuple2.toDF("key", "jstring") val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect() ``` Fixed the problem and also added a test case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark get_json_object Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10018.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10018 commit 06d9eae73e4b40a0451d1d21f7174aacbf05f780 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-27T18:59:31Z fixed a bug of get_json_object commit 54edc84f21918c5cb69a0abfc51f680190f27a1f Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-11-27T19:04:13Z Merge remote-tracking branch 'upstream/master' into get_json_object --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12195] [SQL] Adding BigDecimal, Date an...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10188#discussion_r46917559 --- Diff: sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java --- @@ -386,6 +389,20 @@ public void testNestedTupleEncoder() { } @Test + public void testTypeEncoder() { --- End diff -- Sure. Thank you! Let me change it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12091] [PYSPARK] [Minor] Default storag...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10092#issuecomment-161372323 @mateiz Thank you for your answer! Will try to do it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12113] [SQL] Add some timing metrics fo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10116#discussion_r46501447 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala --- @@ -149,6 +149,32 @@ private[sql] object SQLMetrics { } /** + * Create a timing metric that reports duration in millis relative to startTime. + * + * The expected usage pattern is: + * On the driver: + * metric = createTimingMetric(..., System.currentTimeMillis) + * On each executor + * < Do some work > + * metric += System.currentTimeMillis + * The metric will then output the latest value across all the executors. This is a proxy for + * wall clock latency as it measures when the last executor finished this stage. + */ + def createTimingMetric(sc: SparkContext, name: String, startTime: Long): LongSQLMetric = { +val stringValue = (values: Seq[Long]) => { + val validValues = values.filter(_ >= startTime) + if (validValues.isEmpty) { +// The clocks between the different machines are not perfectly synced so this can happen. +"0" --- End diff -- This is a nice feature for performance investigation! Should we detect if the machine clocks are synced when starting Spark? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12091] [PYSPARK] Removal of the JAVA-sp...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10092#discussion_r46520645 --- Diff: python/pyspark/storagelevel.py --- @@ -49,12 +51,8 @@ def __str__(self): StorageLevel.DISK_ONLY = StorageLevel(True, False, False, False) StorageLevel.DISK_ONLY_2 = StorageLevel(True, False, False, False, 2) -StorageLevel.MEMORY_ONLY = StorageLevel(False, True, False, True) -StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, False, True, 2) -StorageLevel.MEMORY_ONLY_SER = StorageLevel(False, True, False, False) --- End diff -- Agree! Just updated the codes with the deprecated notes. Trying to follow the existing PySpark style. Please check if they are good. : ) Not sure if this will be merged to 1.6. The note is still using 1.6. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12091] [PYSPARK] Removal of the JAVA-sp...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10092#issuecomment-161515703 Just saw the comments and will change the names soon. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/9358#discussion_r46657105 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala --- @@ -37,3 +37,120 @@ trait Encoder[T] extends Serializable { /** A ClassTag that can be used to construct and Array to contain a collection of `T`. */ def clsTag: ClassTag[T] } + +object Encoder { + import scala.reflect.runtime.universe._ + + def BOOLEAN: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = true) + def BYTE: Encoder[java.lang.Byte] = ExpressionEncoder(flat = true) + def SHORT: Encoder[java.lang.Short] = ExpressionEncoder(flat = true) + def INT: Encoder[java.lang.Integer] = ExpressionEncoder(flat = true) + def LONG: Encoder[java.lang.Long] = ExpressionEncoder(flat = true) + def FLOAT: Encoder[java.lang.Float] = ExpressionEncoder(flat = true) + def DOUBLE: Encoder[java.lang.Double] = ExpressionEncoder(flat = true) + def STRING: Encoder[java.lang.String] = ExpressionEncoder(flat = true) + + def tuple[T1, T2](enc1: Encoder[T1], enc2: Encoder[T2]): Encoder[(T1, T2)] = { +tuple(Seq(enc1, enc2).map(_.asInstanceOf[ExpressionEncoder[_]])) + .asInstanceOf[ExpressionEncoder[(T1, T2)]] + } + + def tuple[T1, T2, T3]( + enc1: Encoder[T1], + enc2: Encoder[T2], + enc3: Encoder[T3]): Encoder[(T1, T2, T3)] = { +tuple(Seq(enc1, enc2, enc3).map(_.asInstanceOf[ExpressionEncoder[_]])) + .asInstanceOf[ExpressionEncoder[(T1, T2, T3)]] + } + + def tuple[T1, T2, T3, T4]( + enc1: Encoder[T1], + enc2: Encoder[T2], + enc3: Encoder[T3], + enc4: Encoder[T4]): Encoder[(T1, T2, T3, T4)] = { +tuple(Seq(enc1, enc2, enc3, enc4).map(_.asInstanceOf[ExpressionEncoder[_]])) + .asInstanceOf[ExpressionEncoder[(T1, T2, T3, T4)]] + } + + def tuple[T1, T2, T3, T4, T5]( + enc1: Encoder[T1], + enc2: Encoder[T2], + enc3: Encoder[T3], + enc4: Encoder[T4], + enc5: Encoder[T5]): Encoder[(T1, T2, T3, T4, T5)] = { +tuple(Seq(enc1, enc2, enc3, enc4, enc5).map(_.asInstanceOf[ExpressionEncoder[_]])) + .asInstanceOf[ExpressionEncoder[(T1, T2, T3, T4, T5)]] + } + + private def tuple(encoders: Seq[ExpressionEncoder[_]]): ExpressionEncoder[_] = { --- End diff -- Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/9358#discussion_r46650956 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala --- @@ -37,3 +37,120 @@ trait Encoder[T] extends Serializable { /** A ClassTag that can be used to construct and Array to contain a collection of `T`. */ def clsTag: ClassTag[T] } + +object Encoder { + import scala.reflect.runtime.universe._ + + def BOOLEAN: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = true) + def BYTE: Encoder[java.lang.Byte] = ExpressionEncoder(flat = true) + def SHORT: Encoder[java.lang.Short] = ExpressionEncoder(flat = true) + def INT: Encoder[java.lang.Integer] = ExpressionEncoder(flat = true) + def LONG: Encoder[java.lang.Long] = ExpressionEncoder(flat = true) + def FLOAT: Encoder[java.lang.Float] = ExpressionEncoder(flat = true) + def DOUBLE: Encoder[java.lang.Double] = ExpressionEncoder(flat = true) + def STRING: Encoder[java.lang.String] = ExpressionEncoder(flat = true) + + def tuple[T1, T2](enc1: Encoder[T1], enc2: Encoder[T2]): Encoder[(T1, T2)] = { +tuple(Seq(enc1, enc2).map(_.asInstanceOf[ExpressionEncoder[_]])) + .asInstanceOf[ExpressionEncoder[(T1, T2)]] + } + + def tuple[T1, T2, T3]( + enc1: Encoder[T1], + enc2: Encoder[T2], + enc3: Encoder[T3]): Encoder[(T1, T2, T3)] = { +tuple(Seq(enc1, enc2, enc3).map(_.asInstanceOf[ExpressionEncoder[_]])) + .asInstanceOf[ExpressionEncoder[(T1, T2, T3)]] + } + + def tuple[T1, T2, T3, T4]( + enc1: Encoder[T1], + enc2: Encoder[T2], + enc3: Encoder[T3], + enc4: Encoder[T4]): Encoder[(T1, T2, T3, T4)] = { +tuple(Seq(enc1, enc2, enc3, enc4).map(_.asInstanceOf[ExpressionEncoder[_]])) + .asInstanceOf[ExpressionEncoder[(T1, T2, T3, T4)]] + } + + def tuple[T1, T2, T3, T4, T5]( + enc1: Encoder[T1], + enc2: Encoder[T2], + enc3: Encoder[T3], + enc4: Encoder[T4], + enc5: Encoder[T5]): Encoder[(T1, T2, T3, T4, T5)] = { +tuple(Seq(enc1, enc2, enc3, enc4, enc5).map(_.asInstanceOf[ExpressionEncoder[_]])) + .asInstanceOf[ExpressionEncoder[(T1, T2, T3, T4, T5)]] + } + + private def tuple(encoders: Seq[ExpressionEncoder[_]]): ExpressionEncoder[_] = { --- End diff -- @cloud-fan , does that mean the limit will be 22? Do you think we should at least add it up to Tuple22, which is the limit of Scala? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/9358#discussion_r46657310 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala --- @@ -37,3 +37,120 @@ trait Encoder[T] extends Serializable { /** A ClassTag that can be used to construct and Array to contain a collection of `T`. */ def clsTag: ClassTag[T] } + +object Encoder { + import scala.reflect.runtime.universe._ + + def BOOLEAN: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = true) + def BYTE: Encoder[java.lang.Byte] = ExpressionEncoder(flat = true) + def SHORT: Encoder[java.lang.Short] = ExpressionEncoder(flat = true) + def INT: Encoder[java.lang.Integer] = ExpressionEncoder(flat = true) + def LONG: Encoder[java.lang.Long] = ExpressionEncoder(flat = true) + def FLOAT: Encoder[java.lang.Float] = ExpressionEncoder(flat = true) + def DOUBLE: Encoder[java.lang.Double] = ExpressionEncoder(flat = true) + def STRING: Encoder[java.lang.String] = ExpressionEncoder(flat = true) --- End diff -- @cloud-fan Could you share me your idea why we do not add the other basic types like DecimalType, DateType and TimestampType? Thank you! DecimalType -> java.math.BigDecimal DateType -> java.sql.Date TimestampType -> java.sql.Timestamp --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12091] [PYSPARK] Removal of the JAVA-sp...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10092#issuecomment-161514977 - Removed all the constants whose `deserialized` values are true. - Update the comments of StorageLevel - Change the default storage levels of Kinesis level from `MEMORY_AND_DISK_2` to `MEMORY_AND_DISK_SER_2`. Please verify if my changes are OK. @mateiz @davies Thank you very much! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12091] [PYSPARK] Removal of the JAVA-sp...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10092#issuecomment-161522366 Based on the comments of @mateiz , the extra changes are made: - Renaming MEMORY_ONLY_SER to MEMORY_ONLY - Renaming MEMORY_ONLY_SER_2 to MEMORY_ONLY_2 - Renaming MEMORY_AND_DISK_SER to MEMORY_AND_DISK - Renaming MEMORY_AND_DISK_SER_2 to MEMORY_AND_DISK_2 Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12091] [PYSPARK] Deprecate the JAVA-spe...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10092#discussion_r46522595 --- Diff: python/pyspark/storagelevel.py --- @@ -49,12 +51,8 @@ def __str__(self): StorageLevel.DISK_ONLY = StorageLevel(True, False, False, False) StorageLevel.DISK_ONLY_2 = StorageLevel(True, False, False, False, 2) -StorageLevel.MEMORY_ONLY = StorageLevel(False, True, False, True) -StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, False, True, 2) -StorageLevel.MEMORY_ONLY_SER = StorageLevel(False, True, False, False) --- End diff -- Sure. Just changed it. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12158] [R] [SQL] Fix 'sample' functions...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10160#issuecomment-162286394 @felixcheung @sun-rui Thank you! Based on your comments, I did the changes. Please review the changes. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12158] [R] [SQL] Fix 'sample' functions...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10160#issuecomment-162328075 @felixcheung I am not sure if we need to add a test case for `sample`. Normally, using a specific seed is the common way to verify the result of `sample`. The existing test case may be enough? ``` sampled <- sample(df, FALSE, 1.0) expect_equal(nrow(collect(sampled)), count(df)) ``` If needed, maybe we can add something like the below: ``` repeat { if (count(sample(df, FALSE, 0.1)) != count(sample(df, FALSE, 0.1))) { break } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12164] [SQL] Display the binary/encoded...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/10165 [SPARK-12164] [SQL] Display the binary/encoded values When the dataset is encoded, the existing display looks strange. Decimal format is not common when the type is binary. ``` implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); ``` The output is like ``` +--+ |value | +--+ |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 97, 2]| |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 98, 4]| |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 99, 6]| +--+ ``` After the fix, it will be like the below ``` ++ |value | ++ |[01 00 6F 72 67 2E 61 70 61 63 68 65 2E 73 70 61 72 6B 2E 73 71 6C 2E 4B 72 79 6F 43 6C 61 73 73 44 61 74 E1 01 01 82 61 02]| |[01 00 6F 72 67 2E 61 70 61 63 68 65 2E 73 70 61 72 6B 2E 73 71 6C 2E 4B 72 79 6F 43 6C 61 73 73 44 61 74 E1 01 01 82 62 04]| |[01 00 6F 72 67 2E 61 70 61 63 68 65 2E 73 70 61 72 6B 2E 73 71 6C 2E 4B 72 79 6F 43 6C 61 73 73 44 61 74 E1 01 01 82 63 06]| ++ ``` In addition, do we need to add a new method to decode and then display the contents? You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark binaryOutput Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10165.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10165 commit f63c43519b2e8eeab9428397c519de1032e1ae45 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-12-05T00:50:03Z Merge remote-tracking branch 'upstream/master' into binaryOutput commit 8754979da599743112f392250cee5606a3ce8864 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-12-06T17:44:04Z Displays the encoded content of the Dataset commit 5d0d64c76772d8d8d1a164be130d61e0abb50352 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-12-06T17:44:56Z Merge remote-tracking branch 'upstream/master' into binaryOutput --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12158] [R] [SQL] Fix 'sample' functions...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10160#issuecomment-162286420 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12150] [SQL] [Minor] Add range API with...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/10149 [SPARK-12150] [SQL] [Minor] Add range API without specifying the slice number For usability, add another sqlContext.range() method. Users can specify start, end, and step without specifying the slice number. The slice number is based on the sparkContext's defaultParallelism. It just makes consistent with the RDD range API. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark range Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10149.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10149 commit 8c4bd8351f79db2ce2aebc8a641442ba882295b8 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-12-04T20:23:36Z range API with a default partition number commit 6655b9d9515819cf81844c63c7105eb59882be12 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-12-04T20:25:52Z 2.0->1.6 commit 72860c4e93de38da18ee13e46368493d04819094 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-12-04T20:27:02Z Merge remote-tracking branch 'upstream/master' into range --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12164] [SQL] Display the binary/encoded...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10165#issuecomment-162400140 I have the exact same question when calling the show function. From the perspectives of users, they might not care the encoded values at all when calling the function `show`. The results of encoded values look weird to most users, I think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12158] [SparkR] [SQL] Fix 'sample' func...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10160#issuecomment-162415274 @felixcheung @shivaram Sure, just added that test case. Please review it. Thank you! : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12158] [SparkR] [SQL] Fix 'sample' func...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10160#discussion_r46788421 --- Diff: R/pkg/R/DataFrame.R --- @@ -677,13 +677,15 @@ setMethod("unique", #' collect(sample(df, TRUE, 0.5)) #'} setMethod("sample", - # TODO : Figure out how to send integer as java.lang.Long to JVM so --- End diff -- True. Added it back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12158] [SparkR] [SQL] Fix 'sample' func...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10160#discussion_r46789803 --- Diff: R/pkg/R/DataFrame.R --- @@ -692,8 +696,8 @@ setMethod("sample", setMethod("sample_frac", signature(x = "DataFrame", withReplacement = "logical", fraction = "numeric"), - function(x, withReplacement, fraction) { -sample(x, withReplacement, fraction) + function(x, withReplacement, fraction, seed) { + sample(x, withReplacement, fraction, as.integer(seed)) --- End diff -- Yeah, done! This is my first time to read and write R. : ) Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12158] [SparkR] [SQL] Fix 'sample' func...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10160#issuecomment-162428939 @shivaram @felixcheung @sun-rui Thank you everyone! Hopefully, my code changes resolve all your concerns. I learned a lot from you! : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12158] [SparkR] [SQL] Fix 'sample' func...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10160#discussion_r46789542 --- Diff: R/pkg/R/DataFrame.R --- @@ -692,8 +696,8 @@ setMethod("sample", setMethod("sample_frac", signature(x = "DataFrame", withReplacement = "logical", fraction = "numeric"), - function(x, withReplacement, fraction) { -sample(x, withReplacement, fraction) + function(x, withReplacement, fraction, seed) { + sample(x, withReplacement, fraction, as.integer(seed)) --- End diff -- Then, you need to change the test case. If we do not use `as.integer(seed)`, we need to change the input type. For example, ``` sampled2 <- sample(df, FALSE, 0.1, 0) ``` needs to be changed to ``` sampled2 <- sample(df, FALSE, 0.1, 0L) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12150] [SQL] [Minor] Add range API with...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10149#issuecomment-163093870 @marmbrus @cloud-fan This PR changes the external API. Not sure if this will be merged or we should revisit it after the release of 1.6? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12164] [SQL] Decode the encoded values ...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/10215 [SPARK-12164] [SQL] Decode the encoded values and then display Based on the suggestions from @marmbrus @cloud-fan in https://github.com/apache/spark/pull/10165 , this PR is to print the decoded values(user objects) in `Dataset.show` ``` implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); ``` The current output is like ``` +--+ |value | +--+ |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 97, 2]| |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 98, 4]| |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 99, 6]| +--+ ``` After the fix, it will be like the below ``` +---+ |value | +---+ |KryoClassData(a, 1)| |KryoClassData(b, 2)| |KryoClassData(c, 3)| +---+ ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark showDecodedValue Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10215.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10215 commit 1e1ad1970a8bf3d9076165074f18ee7f28ab4acd Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-12-09T04:08:17Z show decoded values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10184#discussion_r47046420 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -429,18 +432,18 @@ class Dataset[T] private[sql]( /** * (Java-specific) - * Returns a [[GroupedDataset]] where the data is grouped by the given key function. + * Returns a [[GroupedDataset]] where the data is grouped by the given key `func`. * @since 1.6.0 */ - def groupBy[K](f: MapFunction[T, K], encoder: Encoder[K]): GroupedDataset[K, T] = -groupBy(f.call(_))(encoder) + def groupBy[K](func: MapFunction[T, K], encoder: Encoder[K]): GroupedDataset[K, T] = --- End diff -- Sure, next time, I will be careful. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10184#discussion_r47046431 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -67,15 +67,21 @@ class Dataset[T] private[sql]( tEncoder: Encoder[T]) extends Queryable with Serializable { /** - * An unresolved version of the internal encoder for the type of this dataset. This one is marked - * implicit so that we can use it when constructing new [[Dataset]] objects that have the same - * object type (that will be possibly resolved to a different schema). + * An unresolved version of the internal encoder for the type of this [[Dataset]]. This one is + * marked implicit so that we can use it when constructing new [[Dataset]] objects that have the + * same object type (that will be possibly resolved to a different schema). */ private[sql] implicit val unresolvedTEncoder: ExpressionEncoder[T] = encoderFor(tEncoder) /** The encoder for this [[Dataset]] that has been resolved to its output schema. */ private[sql] val resolvedTEncoder: ExpressionEncoder[T] = -unresolvedTEncoder.resolve(queryExecution.analyzed.output, OuterScopes.outerScopes) +unresolvedTEncoder.resolve(logicalPlan.output, OuterScopes.outerScopes) + + /** + * The encoder where the expressions used to construct an object from an input row have been + * bound to the ordinals of the given schema. --- End diff -- I see. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10184#discussion_r47046739 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -67,15 +67,21 @@ class Dataset[T] private[sql]( tEncoder: Encoder[T]) extends Queryable with Serializable { /** - * An unresolved version of the internal encoder for the type of this dataset. This one is marked - * implicit so that we can use it when constructing new [[Dataset]] objects that have the same - * object type (that will be possibly resolved to a different schema). + * An unresolved version of the internal encoder for the type of this [[Dataset]]. This one is + * marked implicit so that we can use it when constructing new [[Dataset]] objects that have the + * same object type (that will be possibly resolved to a different schema). */ private[sql] implicit val unresolvedTEncoder: ExpressionEncoder[T] = encoderFor(tEncoder) /** The encoder for this [[Dataset]] that has been resolved to its output schema. */ private[sql] val resolvedTEncoder: ExpressionEncoder[T] = -unresolvedTEncoder.resolve(queryExecution.analyzed.output, OuterScopes.outerScopes) +unresolvedTEncoder.resolve(logicalPlan.output, OuterScopes.outerScopes) + + /** + * The encoder where the expressions used to construct an object from an input row have been + * bound to the ordinals of the given schema. --- End diff -- Let me add the change in a follow-up PR. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12188] [SQL] [FOLLOW-UP] Code refactori...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/10214 [SPARK-12188] [SQL] [FOLLOW-UP] Code refactoring and comment correction in Dataset APIs @marmbrus This PR is to address your comment. Thanks for your review! You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark followup12188 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10214.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10214 commit 145cd5b5e5b0ad4a229e1621acaf26d02d25cd41 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-12-09T03:10:33Z address the comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12164] [SQL] Display the binary/encoded...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10165#issuecomment-163093753 Thank you! @cloud-fan Will this PR be merged to 1.6? Or waiting for another PR for showing decoded values? @marmbrus Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12158] [R] [SQL] Fix 'sample' functions...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/10160 [SPARK-12158] [R] [SQL] Fix 'sample' functions that break R unit test cases The existing sample functions miss the parameter 'seed', however, the corresponding function interface in `generics` has such a parameter. Thus, although the function caller calls the function with the 'seed', we are not using it. This could cause SparkR unit tests failed. For example, I hit it in another PR: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark sampleR Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10160.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10160 commit ec770100452ca1a869058e448b1b41c8efb810d9 Author: gatorsmile <gatorsm...@gmail.com> Date: 2015-12-05T17:53:39Z add sample functions with seeds --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12138] [SQL] Escape \u in the generated...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10155#issuecomment-162140125 Weird... My code changes are not related to the failed test case in SparkR. ``` count(sampled3) < 3 isn't true ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12138] [SQL] Escape \u in the generated...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10155#issuecomment-162140292 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12138] [SQL] Escape \u in the generated...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10155#issuecomment-162148385 Found a bug in the function `sample` of R. Will submit a PR later. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12158] [R] [SQL] Fix 'sample' functions...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10160#issuecomment-162241856 @davies Could you take a look at this PR? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-12069][SQL] Update documentation w...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10060#discussion_r46624363 --- Diff: docs/sql-programming-guide.md --- @@ -9,18 +9,51 @@ title: Spark SQL and DataFrames # Overview -Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. +Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided +by Spark SQL provide Spark with more about the structure of both the data and the computation being performed. Internally, +Spark SQL uses this extra information to perform extra optimizations. There are several ways to +interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result +the same execution engine is used, independent of which API/language you are using to express the +computation. This unification means that developers can easily switch back and forth between the +various APIs based on which provides the most natural way to express a given transformation. -Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the [Hive Tables](#hive-tables) section. +All of the examples on this page use sample data included in the Spark distribution and can be run in +the `spark-shell`, `pyspark` shell, or `sparkR` shell. -# DataFrames +## SQL -A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. +One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. +Spark SQL can also be used to read data from an existing Hive installation. For more on how to +configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running +SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames). +You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli) +or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server). -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), [Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html). +## DataFrames -All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell. +A DataFrame is a distributed collection of data organized into named columns. It is conceptually +equivalent to a table in a relational database or a data frame in R/Python, but with richer +optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such +as: structured data files, tables in Hive, external databases, or existing RDDs. +The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), +[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), +[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html). + +## Datasets + +A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of +RDDs (strong typing, ability to use powerful lambda functions) with the benifits of Spark SQL's --- End diff -- benifits -> benefits --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org