from:"gatorsmile"

[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...

2015-10-21 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/9216

[SPARK-8658] [SQL] AttributeReference's equals method compares all the 
members

This fix is to change the equals method to check all of the specified 
fields for equality of AttributeReference.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark namedExpressEqual

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9216.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9216


commit 029b5babcc842347d6c55df95b0fa51fff43f0e6
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-10-22T04:08:53Z

Spark-8658




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...

2015-10-22 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9216#issuecomment-150395202
  
My code change expose a new defect: 
  Both rollup and cube are not working correctly no matter whether the 
build include my changes or not. 

Without my changes, the outputs of rollup query are
[3,2,-1]
[3,null,null]
[6,4,-2]
[6,null,null]
[null,null,null]

However, the expected results should be like
[3,2,-1]
[3,null,-1]
[6,4,-2]
[6,null,-2]
[null,null,-3]

I need more time to find the root cause why rollup and cube do not work. 
The current test cases in HiveDataFrameAnalyticsSuite hide the errors because 
both the results of both SQL and DataFrame are wrong, although the results are 
the same. 

Thanks, 

Xiao Li


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11360] [Doc] Loss of nullability when w...

2015-10-27 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/9314

[SPARK-11360] [Doc] Loss of nullability when writing parquet files

This fix is to add one line to explain the current behavior of Spark SQL 
when writing Parquet files. All columns are forced to be nullable for 
compatibility reasons. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark lossNull

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9314.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9314


commit 4a63fad3b432bcb16d0fa3774c86112a2425
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-10-28T01:33:04Z

Document fix: loss of nullability when writing parquet files




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...

2015-10-22 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9216#issuecomment-150475832
  
Hi, @cloud-fan 

Sure. Will do. I am trying to see if I can easily fix it. Anyway, I will 
open a JIRA tonight.

Thanks, 

Xiao Li 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...

2015-10-23 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9216#issuecomment-150486350
  
The JIRA is opened:
https://issues.apache.org/jira/browse/SPARK-11275 
I will continue the investigation on this JIRA issue. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-11 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-155835390
  
@cloud-fan Before discussing the solution details, let us first talk about 
the design issues.  

IMO, the `DataFrame` is a query language, kind of a dialect of SQL. Or, 
maybe, SQL is a dialect of `DataFrame`. We need to formalize it and clearly 
define the concepts of each major classes like `DataFrame` and `Column`. If 
`Column` represents a concept independent of `DataFrame`, could you define what 
it is? If one `Column` with the same ID can appear in different `DataFrame`, 
how to enforce such a "referential integrity" between different `DataFrame`? If 
two `Column` with different ID could represent the same entity, should we keep 
such a relation for generating a better physical plan? 

In the current implementation, each `Column` actually corresponds to an 
expression in logical plans, but we are unable to apply an expression above 
`Column` instances to generate a new expression. So far, `Column` is kind of a 
wrapper, but it is not a subclass of `TreeNode`.   

When more components are built on `DataFrame` to access and operate, we 
have to carefully think about this problem. If possible, I think we need to 
resolve it in the release of Spark 2.0.  

Will answer your design suggestion in a separate post.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-10 Thread gatorsmile

Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/9385


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-10 Thread gatorsmile

GitHub user gatorsmile reopened a pull request:

https://github.com/apache/spark/pull/9385

[SPARK-11433] [SQL] Cleanup the subquery name after eliminating subquery

This fix is to remove the subquery name in qualifiers after eliminating 
subquery.  

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark eliminateSubQ

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9385.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9385


commit db69ccf2c60679c4ca111a618190258d5b5cef62
Author: Xiao Li <xiaoli@xiaos-imac.local>
Date:   2015-10-30T22:12:19Z

cleanup the subquery name after eliminating subquery

commit a26763d758bc58dacf81be171428ede215775532
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-04T00:04:31Z

flatmap->map




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-10 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9385#issuecomment-155605985
  
@marmbrus 

After rechecking the root reason why Expand failed, I still think we should 
cleanup the subquery name after subquery elimination. My current fix needs a 
change to enable a deeper clean of subquery. 

Let me explain what happened in Expand. 

Before subquery elimination, the subquery name "mytable" is shown in all 
the three upper levels (Aggregate, Expand and Project). 
```scala
Aggregate [(a#2 + b#3)#7,b#3,grouping__id#6], [(a#2 + b#3)#7 AS 
_c0#4,b#3,sum(cast((a#2 - b#3) as bigint)) AS _c2#5L]
 Expand [0,1,3], [(a#2 + b#3)#7,b#3], grouping__id#6
  Project [a#2,b#3,(a#2 + b#3) AS (a#2 + b#3)#7]
   Subquery mytable
Project [_1#0 AS a#2,_2#1 AS b#3]
 LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
```

After subquery elimination, the subquery name "mytable" is not removed in 
these three levels. 

 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-08 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/9548

[SPARK-10838][SPARK-11576][SQL][WIP] Incorrect results or exceptions when 
using self-joins

When resolving the attributeReference's ambiguity caused by self joins, the 
current solution only handles the conflicting attributes. However, this does 
not work when the join conditions use the column names that appear in both 
dataFrames since the join conditions are evaluated before resolving the 
ambiguity of conflicting attributes.  Currently, we did not update the 
search-condition. When generating the new expression IDs in the right tree, we 
must update the corresponding columns' expression ID in search condition.

Here, I am trying to propose a solution to resolve this issue. When 
evaluating the join conditions, we record the dataFrame of the search-condition 
columns. Then, when resolving the ambiguity of conflicting attributes, we can 
use this information to know which columns are from the right tree, and then 
update their expression IDs.  

When designing this solution, I am trying to reduce the code changes, and 
thus, I am using quantifiers to record this information. Ideally, I think each 
column needs to clearly correlate with its original source, but this requires a 
lot of code changes, but this will help to optimize the plan in the future. 

Thanks for any suggestion! 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark selfJoinConflictingConditions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9548.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9548


commit 376691af5f639ca4f7ff07cc9e8f572d53e961bf
Author: xiaoli <lixiao1...@gmail.com>
Date:   2015-11-08T18:28:12Z

Spark-10838

commit 7d047136cd710ee0e9ff34aa37c1e6d299165233
Author: xiaoli <lixiao1...@gmail.com>
Date:   2015-11-08T21:07:37Z

Merge branch 'selfJoinCondition' into selfJoinConflictingConditions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-08 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-154881441
  
Since this solution requires adding quantifier comparison into the equation 
of attributeReferences, this will fail a couple test cases in expand. 

We have already identified the bugs in the expand and submitted pull 
requests to resolve this issue. https://github.com/apache/spark/pull/9216 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-09 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-155183305
  
I can't fix the problem without a major code change. The current design of 
dataFrame has a fundamental problem. When using column references, we might hit 
various strange issues if the dataFrame has the columns with the same name and 
expression id. Note that this might occur even if we do not have self joins. 

For example, in the following code, 

```scala
val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2")
val df2 = Seq((1, 4, 0), (2, 1, 0)).toDF("keyCol1", "keyCol3", 
"keyColToDrop")
val df3 = df1.join(df2, df1("keyCol1") === df2("keyCol1"))

val col = df3("keyColToDrop")
val df = df2.drop(col)
df.printSchema() 
```

Above, we can use a column reference of df3 to drop the column in df2. That 
does not make sense, right?

In each column reference, we have to know the original data source. 

@marmbrus @rxin @liancheng 
Should I propose a solution to fix this problem? Does the new Dataset APIs 
resolve this issue?





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-08 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-154911463
  
To fix these failed cases, I will move the dataFrame's hashCode to the 
Column class, instead of directly putting the values to quantifiers. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11360] [Doc] Loss of nullability when w...

2015-11-09 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9314#issuecomment-155334571
  
Got it, thank you! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11360] [Doc] Loss of nullability when w...

2015-11-09 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9314#issuecomment-155309645
  
@marmbrus Should I reopen it? Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11275][SQL] Rollup and Cube Generates t...

2015-11-11 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9419#issuecomment-155973699
  
Thank you, Hao! Will do it in the next few days. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-11637][SQL] Regression in UDF: exceptio...

2015-11-12 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/9683

[Spark-11637][SQL] Regression in UDF: exceptions when using Stars and Alias

When using UDF in Spark SQL, the query failed if star and alias are used at 
the same time. This works in 1.4.x, but 1.5.x failed. 

For example, 
```scala
select hash(*) as x from src 
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark hiveUDFStarAlias

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9683.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9683


commit 8e104094edd8e69f8c44ff8a0fc6d83a2d61dd07
Author: xiaoli <lixiao1...@gmail.com>
Date:   2015-11-13T03:05:21Z

Spark-11637




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-11637][SQL] Regression in UDF: exceptio...

2015-11-12 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9683#issuecomment-156319117
  
The issue has been fixed in https://github.com/apache/spark/pull/9343. 

I will close this PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-13 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9385#issuecomment-156612181
  
Hi, @marmbrus 

Originally, I thought quantifiers are part of identifiers, like schema name 
in traditional RDBMS. Based on your explanation, this is not true. 

I did a code change. Please check if the latest changes make sense. 
`semanticEquals` is used. Now, all the test cases passed. 
https://github.com/gatorsmile/spark/commit/8e72b17561e4cc1a6cce86fc70f6ed968ebf5b38

Just did a merge to the latest master. Thank you for your time.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-16 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9385#issuecomment-157239704
  
Sure. Close it. Thank you for your time! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-16 Thread gatorsmile

Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/9385


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...

2015-11-16 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/9216#discussion_r45011838
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 ---
@@ -194,7 +194,9 @@ case class AttributeReference(
   def sameRef(other: AttributeReference): Boolean = this.exprId == 
other.exprId
 
   override def equals(other: Any): Boolean = other match {
-case ar: AttributeReference => name == ar.name && exprId == ar.exprId 
&& dataType == ar.dataType
+case ar: AttributeReference =>
+  name == ar.name && dataType == ar.dataType && nullable == 
ar.nullable &&
+metadata == ar.metadata && exprId == ar.exprId && qualifiers == 
ar.qualifiers
--- End diff --

sure, will do it tonight.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...

2015-11-15 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9717#issuecomment-156943032
  
Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11275][SQL] Rollup and Cube Generates t...

2015-11-11 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9419#issuecomment-155951403
  
Please let me know if I need to resolve these conflicts. @cloud-fan 
@chenghao-intel @marmbrus @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-11 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-155912100
  
@cloud-fan So far, we do not have an easy fix, but I believe we should 
never return a wrong result for self join. 

Let me post the test case I added. This test case will return an incorrect 
result without any exception:
```scala
test("[SPARK-10838] self join - conflicting attributes in condition - 
incorrect result 2") {
   val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2")
   val df2 = Seq((1, 4), (2, 1)).toDF("keyCol1", "keyCol3")
   val df3 = df1.join(df2, df1("keyCol1") === 
df2("keyCol1")).select(df1("keyCol1"), $"keyCol3")
   checkAnswer(
 df3.join(df1, df3("keyCol3") === df1("keyCol1") && df1("keyCol1") === 
df3("keyCol3")),
 Row(2, 1, 1, 3) :: Nil)
 }
```

Before resolving this problem, what we can do it is to detect it and let 
customers use the workaround you mentioned. The detection condition is simple. 
The incorrect result could happen when the conflicting attributes contain the 
`AttributeReference` that appear in join condition. 

Do you agree @cloud-fan @marmbrus ?

If OK, I will submit another PR for detecting it and issuing an exception 
with a meaningful message to users. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8658] [SQL] [FOLLOW-UP] AttributeRefere...

2015-11-17 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9761#issuecomment-157414057
  
@nongli I saw you have a related discussion with @chenghao-intel . The 
failed test case was introduced in your PR 
https://github.com/apache/spark/pull/9480. I am not sure the root reason why we 
intentionally exclude `name` from `hashCode`, but the original `equals` include 
`name`. It breaks a general principle of hashCode function design:

```
An objectâs hashCode method must take the same fields into account as its 
equals method. 
``` 

Based on my understanding, in a case-sensitive HiveContext, we still should 
detect their differences when the case of `name` is different but the `exprId` 
values are the same 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8658] [SQL] [FOLLOW-UP] AttributeRefere...

2015-11-17 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9761#issuecomment-157434770
  
Ok. I will also add three more lines for covering the new `hashCode` and 
`equals` functions. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11072][SQL] simplify self join handling

2015-11-17 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9081#issuecomment-157440817
  
@cloud-fan I am wondering if this will be merged soon? 

I am not sure if I should fix a couple of self join issues before your 
merge. Or I should not waste time until you merge this PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-10 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9385#issuecomment-15565
  
Hi, @marmbrus 

After digging the root reason why Expand cases failed, I found we  still 
need a deeper clean of subquery after elimination. 

Let me use the following example to explain what happened in Expand. This 
query works well if we do not compare the qualifiers when comparing two 
AttributeReferences. I think this is a bug if merging 
https://github.com/apache/spark/pull/9216, right? 

```scala
val sqlDF = sql("select a, b, sum(a) from mytable group by a, b with 
rollup").explain(true)
```

Before subquery elimination, the subquery name "mytable" is shown in all 
the two upper layers (Aggregate and Expand). 
```scala
Aggregate [a#2,b#3,grouping__id#5], [a#2,b#3,sum(cast(a#2 as bigint)) AS 
_c2#4L]
 Expand [0,1,3], [a#2,b#3], grouping__id#5
  Subquery mytable
   Project [_1#0 AS a#2,_2#1 AS b#3]
LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
```

After subquery elimination, the subquery name "mytable" is not removed in 
these two upper layers. 
```scala
Aggregate [a#2,b#3,grouping__id#5], [a#2,b#3,sum(cast(a#2 as bigint)) AS 
_c2#4L]
 Expand [0,1,3], [a#2,b#3], grouping__id#5
  Project [_1#0 AS a#2,_2#1 AS b#3]
   LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
```

In SparkStrategies, we create an array of Projections for the child 
projection of Expand.

```scala
  case e @ logical.Expand(_, _, _, child) =>
execution.Expand(e.projections, e.output, planLater(child)) :: Nil
```

`e.projections` calls the function `expand()`. Inside the function 
`expand()`, I do not think we should use `semanticEquals` there. 

Let me post the incorrect physical plan

```scala
TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 
as bigint)),mode=Final,isDistinct=false)], output=[a#2,b#3,_c2#11L])
 TungstenExchange hashpartitioning(a#2,b#3,grouping__id#12,5)
  TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 
as bigint)),mode=Partial,isDistinct=false)], 
output=[a#2,b#3,grouping__id#12,currentSum#15L])
   Expand [List(a#2, b#3, 0),List(a#2, b#3, 1),List(a#2, b#3, 3)], 
[a#2,b#3,grouping__id#12]
LocalTableScan [a#2,b#3], [[1,2],[2,4]]
```

For you convenience, below is the correct one:

```scala
TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 
as bigint)),mode=Final,isDistinct=false)], output=[a#2,b#3,_c2#11L])
 TungstenExchange hashpartitioning(a#2,b#3,grouping__id#12,5)
  TungstenAggregate(key=[a#2,b#3,grouping__id#12], functions=[(sum(cast(a#2 
as bigint)),mode=Partial,isDistinct=false)], 
output=[a#2,b#3,grouping__id#12,currentSum#15L])
   Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
[a#2,b#3,grouping__id#12]
LocalTableScan [a#2,b#3], [[1,2],[2,4]]
```

My current fix does not fix this issue yet. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10838][SPARK-11576][SQL][WIP] Incorrect...

2015-11-09 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9548#issuecomment-155226523
  
@marmbrus Thank you for your suggestions! 

That is also like my initial idea. I did a try last night. Unfortunately, I 
hit a problem when adding such a field to `Column` API. In the current design, 
the class `Column` corresponds to the class `Expression`, which includes both 
`AttributeReference` and the other types. For `Column`, it makes sense to have 
such a dataFrame identifier. However, when `Column` is generated from the 
binary expression types (e.g., `gt`), it could have more than one dataFrame 
identifiers. Does that sound good to you? 

When implementing the idea, it becomes more difficult. For example, in the 
following binary operators,

```scala
  def === (other: Any): Column = {
val right = lit(other).expr
EqualTo(expr, right)
  }
```

`EqualTo` is an `Expression`. `expr` and `right` are not `Column`s. Thus, 
when accessing the `Column` generated from `===`, we are unable to know the 
dataframe sources of `expr` and `right` if we do not change 
`AttributeReference`.  

That is why I am thinking this could mean a major code change to 
`DataFrame` and `Column`. Thank you for any further suggestion. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...

2015-11-02 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9419#issuecomment-153148409
  
@hvanhovell Your understanding is right. 

If we merge both grouping and aggregation together, it will introduce extra 
complexity to generate the logical plan for the following case: 
"select a + b, b, sum(a - b), sum(a) from mytable group by a + b, b with 
rollup". 

Of course, in theory, it is doable, but the code will be harder to maintain 
in the future. Extra Project will be collapsed by optimizer. Thus, in analyzer, 
I just introduce the extra Project. 

I am writing unit test cases. Will try to deliver them ASAP. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...

2015-11-02 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9419#issuecomment-153146798
  
@holdenk This is the PR I mentioned in the email. Could you review it too? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...

2015-11-02 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/9419

[SPARK-11275][SQL][WIP] Rollup and Cube Generates the Incorrect Results 
when Aggregation Functions Use Group By Columns

In the current implementation, Rollup and Cube are unable to generate the 
correct results for the following cases:

When the aggregation functions use the group by key columns:
sql("select b, a, sum(a), min(a), min(b+b) from mytable group by a, b 
with rollup").collect()
sql("select a, b, sum(a), min(a), min(b+b) from mytable group by b, a 
with cube").collect()

The problem becomes more complex if the group by clauses have the functions 
whose inputs are also appear in the group by. 
sql("select a + b, b, sum(a - b) from mytable group by a + b, b with 
rollup").collect()
sql("select a + b, b, sum(a - b) from mytable group by a + b, b with 
cube").collect()

The basic solutions are adding extra Projection when the query are part of 
the above situations. The projection will add duplicate values for these 
affected columns with alias names so that the column values will not be removed 
when expand is evaluated during the runtime. 

Working on the test cases. Will add more cases into Hive golden buckets. 
Welcome any comment and suggestion! Thank you!   
 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark rollupCube

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9419


commit b10418e161d5809f3b1de92cf4a33b2f362cd2b4
Author: Xiao Li <xiaoli@xiaos-imac.local>
Date:   2015-11-02T09:05:31Z

Spark-11275

commit 7721442cdf65924af204d39e3b3b7bda6c41dfc6
Author: Xiao Li <xiaoli@xiaos-imac.local>
Date:   2015-11-02T09:51:16Z

syntax cleaning




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...

2015-11-02 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9419#issuecomment-153145967
  
Hi, Rick, 

1) This is a defect identified by me. It blocks my PR. It was introduced in 
the initial implementation. Thus, it is not a regression. 

2) I updated my PR summary with a few query examples. 

3) It is limited to ResolveGroupingAnalytics. Thus, the only affected 
queries are group by queries. I tried to follow the original way to make sure 
the coding styles are the same. Not sure if I need to put more comments into 
the code. 

Thanks, 

Xiao Li




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-10-30 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/9385#discussion_r43559826
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1019,7 +1019,16 @@ class Analyzer(
  * scoping information for attributes and can be removed once analysis is 
complete.
  */
 object EliminateSubQueries extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
+case Project(projectList, child: Subquery) => {
+  Project(
+projectList.flatMap {
+  case ar: AttributeReference if 
ar.qualifiers.contains(child.alias) =>
--- End diff --

Should I use NamedExpression to replace AttributeReference? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11360] [Doc] Loss of nullability when w...

2015-10-30 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9314#issuecomment-152656067
  
@marmbrus : as you suggested, I submitted the pull request. Could you 
review it?

Thanks, 

Xiao Li


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-10-30 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/9385

[SPARK-11433] [SQL] Cleanup the subquery name after eliminating subquery

This fix is to remove the subquery name in qualifiers after eliminating 
subquery.  

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark eliminateSubQ

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9385.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9385


commit db69ccf2c60679c4ca111a618190258d5b5cef62
Author: Xiao Li <xiaoli@xiaos-imac.local>
Date:   2015-10-30T22:12:19Z

cleanup the subquery name after eliminating subquery




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-10-31 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9385#issuecomment-152708351
  
So far, I just observed this strange ghosting values when I read the 
optimized logical tree, but my query did not trigger any issue. 

Based on my understanding, usage of qualifiers is still limited in the 
current code base. It could be a potential issue when we support more complex 
SQL syntax/functions. Thus, I submitted this pull request to resolve this 
issue. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

2015-11-04 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9055#issuecomment-153857289
  
@jameszhouyi 
We hit the same issue. Now, we bypass it by using joins. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4226][SQL]Add subquery (not) in/exists ...

2015-11-04 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9055#issuecomment-153920042
  
@jameszhouyi 
Agree. This is an important feature for any SQL engine. We are also waiting 
for this feature. So far, using joins is an alternative to bypass it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...

2015-11-03 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9419#issuecomment-153451771
  
@chenghao-intel @hvanhovell Unit test cases are added. Will finish the code 
for resolving the comments by @holdenk @rick-ibm 

@rxin @marmbrus @liancheng @yhuai I am wondering if my incremental low-risk 
fix will be merged to Spark 1.6? 

If not, personally, I prefer to fixing all the bugs and improve the 
solution by @aray (Andrew Ray). That solution simplifies the implementation of 
rollup and cube. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-03 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/9385#discussion_r43817697
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1019,7 +1019,16 @@ class Analyzer(
  * scoping information for attributes and can be removed once analysis is 
complete.
  */
 object EliminateSubQueries extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
--- End diff --

Thank you for your comments! If we doing transformUp, subquery will be 
removed at first. Then, Project(projectList, child: Subquery) will not be 
applicable in this case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6231][SQL/DF] Automatically resolve joi...

2015-11-06 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/5919#issuecomment-154612297
  
@rxin @marmbrus This fix is unable to resolve the condition ambiguity for 
nested self join. I also found the self joins could generate incorrect results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-06 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9385#issuecomment-154650945
  
@marmbrus Thanks! 

I will try to change equals to semanticEquals in the pull request 
https://github.com/apache/spark/pull/9216. Then, you can decide if this is a 
right solution. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-06 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9385#issuecomment-154609690
  
@marmbrus I already hit this issue when resolving 
https://issues.apache.org/jira/browse/SPARK-8658. That means, when comparing 
two AttributeReferences, we should not compare their qualifiers. That looks a 
strange fix, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-03 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9385#issuecomment-153529973
  
@cloud-fan @dbtsai , Jenkins did not start the testing. Could you let 
Jenkins to test it? 

Thank you! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-03 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/9385#discussion_r43826123
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1019,7 +1019,16 @@ class Analyzer(
  * scoping information for attributes and can be removed once analysis is 
complete.
  */
 object EliminateSubQueries extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
+case Project(projectList, child: Subquery) => {
+  Project(
+projectList.flatMap {
--- End diff --

Thank you! I did the change based on your suggestion. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8658] [SQL] AttributeReference's equals...

2015-11-03 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9216#issuecomment-153511090
  
@JoshRosen @cloud-fan 
I submitted a pull request for JIRA Spark-11275: 
https://github.com/apache/spark/pull/9419 
Hopefully, after finishing the problem, this one can pass all the tests. 
Thanks! 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-03 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9385#issuecomment-153620226
  
@dbtsai Thank you! 

Please let me know if you need any extra code change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11275][SQL] Rollup and Cube Generates t...

2015-11-03 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/9419#discussion_r43850164
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -232,7 +232,7 @@ class Analyzer(
 // substitute the group by expressions.
 val newGroupByExprs = groupByExprPairs.map(_._2)
--- End diff --

Hi, @chenghao-intel , 

Could you explain it a little bit more? 

So far, this query is correctly processed and returned a correct result. 
Since b is part of an aggregated function, the fix added extra columns for b. 
Below is the generated plan:

`== Analyzed Logical Plan ==
ab: bigint
Aggregate [b#3,grouping__id#12], [sum(cast((a#2 - b#3#13) as bigint)) AS 
ab#4L]
 Expand [0,1], [b#3], grouping__id#12
  Project [a#2,b#3,b#3 AS b#3#13]
   Subquery mytable
Project [_1#0 AS a#2,_2#1 AS b#3]
 LocalRelation [_1#0,_2#1], [[1,2],[2,4],[2,9]]

== Optimized Logical Plan ==
Aggregate [b#3,grouping__id#12], [sum(cast((a#2 - b#3#13) as bigint)) AS 
ab#4L]
 Expand [0,1], [b#3], grouping__id#12
  LocalRelation [a#2,b#3,b#3#13], [[1,2,2],[2,4,4],[2,9,9]]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11275][SQL][WIP] Rollup and Cube Genera...

2015-11-02 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9419#issuecomment-153192515
  
@rick-ibm Will add more comments to explain it. Especially, I will 
emphasize this design will expect the optimizer collapses these two projections 
into a single one.  

@chenghao-intel , could you also review the code changes? Does the solution 
look ok? Really appreciate your original work. It looks very concise to me. 

@holdenk Got it. Will try to follow your suggestions and do more code 
cleaning and resend a request to you again. Thank you! 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11275][SQL] Rollup and Cube Generates t...

2015-11-05 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/9419#discussion_r44107333
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -232,7 +232,7 @@ class Analyzer(
 // substitute the group by expressions.
 val newGroupByExprs = groupByExprPairs.map(_._2)
--- End diff --

@chenghao-intel , a good catch! Thank you!

I fixed this issue if you integrate the latest change. Also added two more 
test cases to cover it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11633] [SQL] HiveContext's Case Insensi...

2015-11-18 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9762#issuecomment-157786406
  
@cloud-fan @marmbrus Will follow your suggestions to update the fix. 
Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11803][SQL] fix Dataset self-join

2015-11-18 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9806#issuecomment-157749632
  
Your code looks pretty clean to me. Let me share my test cases this PR 
failed. 

```
 test("joinWith tuple - self join 1") {
   val ds = Seq(("a", 1), ("b", 2)).toDS()
   ds.joinWith(ds, $"_2" === $"_2").collect()
 }

  test("joinWith tuple - self join 2") {
val ds1 = Seq(("a", 1), ("b", 2)).toDS()
val ds2 = Seq(("a", 1), ("b", 2)).toDS().as("a")
ds1.joinWith(ds2, $"_2" === $"a._2").collect()
  }
```

Do you want me to send me a PR? Or you will fix them?

Thank you! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11803][SQL] fix Dataset self-join

2015-11-18 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9806#issuecomment-157807682
  
Sure. Will do. Thanks!

2015-11-18 10:16 GMT-08:00 Michael Armbrust <notificati...@github.com>:

> LGTM, merging to master and 1.6.
>
> @gatorsmile <https://github.com/gatorsmile> please open JIRAs targeted at
> 1.6.0 for the bugs you have found. (also use checkAnswer when writing
> test cases). Thanks!
>
> â
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/9806#issuecomment-157806994>.
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11633] [SQL] HiveContext's Case Insensi...

2015-11-18 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9762#issuecomment-157857086
  
@marmbrus @cloud-fan Based on your comments, I did the change. Please 
review the new change. 

I also tried the fix after excluding the change in `attributeRewrites`. The 
newly introduced test case still works fine. That means this should be the root 
cause. 

The fix still keeps the extra filter in `attributeRewrites`. I think it can 
avoid extra compare and replacement   in the subsequent transformation. Please 
let me know if you want to remove the filter. 

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...

2015-11-14 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9717#issuecomment-156738825
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11433] [SQL] Cleanup the subquery name ...

2015-11-14 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9385#issuecomment-156730810
  
@marmbrus CachedTableSuite failed due to the same reason. We did not clean 
up the subquery names. Thus, it is unable to give a correct result when 
deciding if Exchange is needed. 

I did the fix by using `semanticEquals`. Please check if the changes 
appropriate. https://github.com/apache/spark/pull/9216

Now, all the test cases passed. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...

2015-11-14 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/9717

[SPARK-9928][SQL] Removal of LogicalLocalTable 

LogicalLocalTable in ExistingRDD.scala is replaced by localRelation in 
LocalRelation.scala?

Do you know any reason why we still keep this class?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark LogicalLocalTable

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9717.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9717


commit 01e4cdfcfc4ac37644165923c6e8eb65fcfdf3ac
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-13T22:50:39Z

Merge remote-tracking branch 'upstream/master'

commit e25b3785b923d44b3d48fe4100c4672d85787318
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-14T02:43:37Z

Merge remote-tracking branch 'upstream/master' into LogicalLocalTable

commit 7555b76633fdeff6dd65c97be41e733cc28ba04c
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-14T02:44:22Z

Merge branch 'master' of https://github.com/gatorsmile/spark into 
LogicalLocalTable

merge.

commit 6835704c273abc13e8eda37f5a10715027e4d17b
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-14T02:50:51Z

Merge remote-tracking branch 'upstream/master'

commit 3a7d4654e3c7e69bafbe14d7c5b6158666e36b0e
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-14T16:59:31Z

Merge remote-tracking branch 'upstream/master' into LogicalLocalTable

commit cb49f8cb79b7f9fcca0b62b0645709c7c8c539dc
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-14T16:59:56Z

Merge branch 'master' of https://github.com/gatorsmile/spark into 
LogicalLocalTable

commit 9180687775649f97763bdbd7c004fe6fc392989c
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-14T17:01:59Z

Merge remote-tracking branch 'upstream/master'

commit 45ef950ff8c6c082e3fb1de84f85329060daf27c
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-14T17:03:36Z

Merge branch 'master' of https://github.com/gatorsmile/spark into 
LogicalLocalTable

commit 195d176da9d4a58650690c2f5cc3ba27883b63ad
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-14T17:45:22Z

SPARK-9928




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...

2015-11-14 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9717#issuecomment-156738584
  
The failure of this test case is not related to the code changes. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...

2015-11-14 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9717#issuecomment-156739572
  
@srowen Could you review the changes? Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9928][SQL] Removal of LogicalLocalTable...

2015-11-14 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/9717#issuecomment-156745623
  
Another case failed due to the same reasons. 
```
[error] Test 
org.apache.spark.ml.util.JavaDefaultReadWriteSuite.testDefaultReadWrite failed: 
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
```

Timing issues? Or introduced by a recent merge? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11633] [SQL] HiveContext's Case Insensi...

2015-11-17 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/9762

[SPARK-11633] [SQL] HiveContext's Case Insensitivity in Self-Join Handling 

When handling self joins, the implementation did not consider the case 
insensitivity of HiveContext. It could cause an exception as shown in the JIRA: 
```
TreeNodeException: Failed to copy node. 
```

The fix is low risk. It avoids unnecessary attribute replacement. It should 
not affect the existing behavior of self joins. Also added the test case to 
cover this case. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark joinMakeCopy

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9762.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9762


commit 01e4cdfcfc4ac37644165923c6e8eb65fcfdf3ac
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-13T22:50:39Z

Merge remote-tracking branch 'upstream/master'

commit 6835704c273abc13e8eda37f5a10715027e4d17b
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-14T02:50:51Z

Merge remote-tracking branch 'upstream/master'

commit 9180687775649f97763bdbd7c004fe6fc392989c
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-14T17:01:59Z

Merge remote-tracking branch 'upstream/master'

commit b38a21ef6146784e4b93ef4ce8c899f1eee14572
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-17T02:30:26Z

SPARK-11633

commit d2b84af8cce7fc2c03c748a2d443c07bad3f0ed1
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-17T02:32:12Z

Merge remote-tracking branch 'upstream/master' into joinMakeCopy

commit a15f267206215352f91f0699d813b0d71b15f11f
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-17T03:40:41Z

scala style fix.

commit 7d48e1e95d39656317235d274b353c8645e3f93d
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-17T04:55:24Z

Merge remote-tracking branch 'upstream/master' into joinMakeCopy




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8658] [SQL] [FOLLOW-UP] AttributeRefere...

2015-11-16 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/9761

[SPARK-8658] [SQL] [FOLLOW-UP] AttributeReference's equals method compares 
all the members

Based on the comment of @cloud-fan , update the AttributeReference's 
hashCode function by including the hashCode of the other attributes including 
name, nullable and qualifiers. 

Here, I am not 100% sure if we should include name in the hashCode 
calculation, since the original hashCode calculation does not include it. 

@marmbrus @cloud-fan Please review if the changes are good.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark hashCodeNamedExpression

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9761.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9761


commit eb63f097df5595cebf09954bcd188a87c5ebfdb0
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-17T07:47:19Z

follow-up: SPARK-8658




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12028] [SQL] get_json_object returns an...

2015-11-27 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/10018

[SPARK-12028] [SQL] get_json_object returns an incorrect result when the 
value is null literals

When calling `get_json_object` for the following two cases, both results 
are `"null"`:

```scala
val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil
val df: DataFrame = tuple.toDF("key", "jstring")
val res = df.select(functions.get_json_object($"jstring", 
"$.f1")).collect()
```
```scala
val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil
val df2: DataFrame = tuple2.toDF("key", "jstring")
val res3 = df2.select(functions.get_json_object($"jstring", 
"$.f1")).collect()
```

Fixed the problem and also added a test case. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark get_json_object

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10018.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10018


commit 06d9eae73e4b40a0451d1d21f7174aacbf05f780
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-27T18:59:31Z

fixed a bug of get_json_object

commit 54edc84f21918c5cb69a0abfc51f680190f27a1f
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-11-27T19:04:13Z

Merge remote-tracking branch 'upstream/master' into get_json_object




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12195] [SQL] Adding BigDecimal, Date an...

2015-12-07 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10188#discussion_r46917559
  
--- Diff: 
sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java ---
@@ -386,6 +389,20 @@ public void testNestedTupleEncoder() {
   }
 
   @Test
+  public void testTypeEncoder() {
--- End diff --

Sure. Thank you! Let me change it now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12091] [PYSPARK] [Minor] Default storag...

2015-12-02 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10092#issuecomment-161372323
  
@mateiz  Thank you for your answer! Will try to do it soon. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12113] [SQL] Add some timing metrics fo...

2015-12-02 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10116#discussion_r46501447
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala 
---
@@ -149,6 +149,32 @@ private[sql] object SQLMetrics {
   }
 
   /**
+   * Create a timing metric that reports duration in millis relative to 
startTime.
+   *
+   * The expected usage pattern is:
+   * On the driver:
+   *   metric = createTimingMetric(..., System.currentTimeMillis)
+   * On each executor
+   *   < Do some work >
+   *   metric += System.currentTimeMillis
+   * The metric will then output the latest value across all the 
executors. This is a proxy for
+   * wall clock latency as it measures when the last executor finished 
this stage.
+   */
+  def createTimingMetric(sc: SparkContext, name: String, startTime: Long): 
LongSQLMetric = {
+val stringValue = (values: Seq[Long]) => {
+  val validValues = values.filter(_ >= startTime)
+  if (validValues.isEmpty) {
+// The clocks between the different machines are not perfectly 
synced so this can happen.
+"0"
--- End diff --

This is a nice feature for performance investigation! 

Should we detect if the machine clocks are synced when starting Spark? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12091] [PYSPARK] Removal of the JAVA-sp...

2015-12-02 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10092#discussion_r46520645
  
--- Diff: python/pyspark/storagelevel.py ---
@@ -49,12 +51,8 @@ def __str__(self):
 
 StorageLevel.DISK_ONLY = StorageLevel(True, False, False, False)
 StorageLevel.DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
-StorageLevel.MEMORY_ONLY = StorageLevel(False, True, False, True)
-StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, False, True, 2)
-StorageLevel.MEMORY_ONLY_SER = StorageLevel(False, True, False, False)
--- End diff --

Agree! Just updated the codes with the deprecated notes. Trying to follow 
the existing PySpark style. Please check if they are good. : )

Not sure if this will be merged to 1.6. The note is still using 1.6. Thank 
you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12091] [PYSPARK] Removal of the JAVA-sp...

2015-12-02 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10092#issuecomment-161515703
  
Just saw the comments and will change the names soon. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...

2015-12-04 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/9358#discussion_r46657105
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala
 ---
@@ -37,3 +37,120 @@ trait Encoder[T] extends Serializable {
   /** A ClassTag that can be used to construct and Array to contain a 
collection of `T`. */
   def clsTag: ClassTag[T]
 }
+
+object Encoder {
+  import scala.reflect.runtime.universe._
+
+  def BOOLEAN: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = true)
+  def BYTE: Encoder[java.lang.Byte] = ExpressionEncoder(flat = true)
+  def SHORT: Encoder[java.lang.Short] = ExpressionEncoder(flat = true)
+  def INT: Encoder[java.lang.Integer] = ExpressionEncoder(flat = true)
+  def LONG: Encoder[java.lang.Long] = ExpressionEncoder(flat = true)
+  def FLOAT: Encoder[java.lang.Float] = ExpressionEncoder(flat = true)
+  def DOUBLE: Encoder[java.lang.Double] = ExpressionEncoder(flat = true)
+  def STRING: Encoder[java.lang.String] = ExpressionEncoder(flat = true)
+
+  def tuple[T1, T2](enc1: Encoder[T1], enc2: Encoder[T2]): Encoder[(T1, 
T2)] = {
+tuple(Seq(enc1, enc2).map(_.asInstanceOf[ExpressionEncoder[_]]))
+  .asInstanceOf[ExpressionEncoder[(T1, T2)]]
+  }
+
+  def tuple[T1, T2, T3](
+  enc1: Encoder[T1],
+  enc2: Encoder[T2],
+  enc3: Encoder[T3]): Encoder[(T1, T2, T3)] = {
+tuple(Seq(enc1, enc2, enc3).map(_.asInstanceOf[ExpressionEncoder[_]]))
+  .asInstanceOf[ExpressionEncoder[(T1, T2, T3)]]
+  }
+
+  def tuple[T1, T2, T3, T4](
+  enc1: Encoder[T1],
+  enc2: Encoder[T2],
+  enc3: Encoder[T3],
+  enc4: Encoder[T4]): Encoder[(T1, T2, T3, T4)] = {
+tuple(Seq(enc1, enc2, enc3, 
enc4).map(_.asInstanceOf[ExpressionEncoder[_]]))
+  .asInstanceOf[ExpressionEncoder[(T1, T2, T3, T4)]]
+  }
+
+  def tuple[T1, T2, T3, T4, T5](
+  enc1: Encoder[T1],
+  enc2: Encoder[T2],
+  enc3: Encoder[T3],
+  enc4: Encoder[T4],
+  enc5: Encoder[T5]): Encoder[(T1, T2, T3, T4, T5)] = {
+tuple(Seq(enc1, enc2, enc3, enc4, 
enc5).map(_.asInstanceOf[ExpressionEncoder[_]]))
+  .asInstanceOf[ExpressionEncoder[(T1, T2, T3, T4, T5)]]
+  }
+
+  private def tuple(encoders: Seq[ExpressionEncoder[_]]): 
ExpressionEncoder[_] = {
--- End diff --

Thank you! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...

2015-12-03 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/9358#discussion_r46650956
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala
 ---
@@ -37,3 +37,120 @@ trait Encoder[T] extends Serializable {
   /** A ClassTag that can be used to construct and Array to contain a 
collection of `T`. */
   def clsTag: ClassTag[T]
 }
+
+object Encoder {
+  import scala.reflect.runtime.universe._
+
+  def BOOLEAN: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = true)
+  def BYTE: Encoder[java.lang.Byte] = ExpressionEncoder(flat = true)
+  def SHORT: Encoder[java.lang.Short] = ExpressionEncoder(flat = true)
+  def INT: Encoder[java.lang.Integer] = ExpressionEncoder(flat = true)
+  def LONG: Encoder[java.lang.Long] = ExpressionEncoder(flat = true)
+  def FLOAT: Encoder[java.lang.Float] = ExpressionEncoder(flat = true)
+  def DOUBLE: Encoder[java.lang.Double] = ExpressionEncoder(flat = true)
+  def STRING: Encoder[java.lang.String] = ExpressionEncoder(flat = true)
+
+  def tuple[T1, T2](enc1: Encoder[T1], enc2: Encoder[T2]): Encoder[(T1, 
T2)] = {
+tuple(Seq(enc1, enc2).map(_.asInstanceOf[ExpressionEncoder[_]]))
+  .asInstanceOf[ExpressionEncoder[(T1, T2)]]
+  }
+
+  def tuple[T1, T2, T3](
+  enc1: Encoder[T1],
+  enc2: Encoder[T2],
+  enc3: Encoder[T3]): Encoder[(T1, T2, T3)] = {
+tuple(Seq(enc1, enc2, enc3).map(_.asInstanceOf[ExpressionEncoder[_]]))
+  .asInstanceOf[ExpressionEncoder[(T1, T2, T3)]]
+  }
+
+  def tuple[T1, T2, T3, T4](
+  enc1: Encoder[T1],
+  enc2: Encoder[T2],
+  enc3: Encoder[T3],
+  enc4: Encoder[T4]): Encoder[(T1, T2, T3, T4)] = {
+tuple(Seq(enc1, enc2, enc3, 
enc4).map(_.asInstanceOf[ExpressionEncoder[_]]))
+  .asInstanceOf[ExpressionEncoder[(T1, T2, T3, T4)]]
+  }
+
+  def tuple[T1, T2, T3, T4, T5](
+  enc1: Encoder[T1],
+  enc2: Encoder[T2],
+  enc3: Encoder[T3],
+  enc4: Encoder[T4],
+  enc5: Encoder[T5]): Encoder[(T1, T2, T3, T4, T5)] = {
+tuple(Seq(enc1, enc2, enc3, enc4, 
enc5).map(_.asInstanceOf[ExpressionEncoder[_]]))
+  .asInstanceOf[ExpressionEncoder[(T1, T2, T3, T4, T5)]]
+  }
+
+  private def tuple(encoders: Seq[ExpressionEncoder[_]]): 
ExpressionEncoder[_] = {
--- End diff --

@cloud-fan , does that mean the limit will be 22? Do you think we should at 
least add it up to Tuple22, which is the limit of Scala?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11269][SQL] Java API support & test cas...

2015-12-04 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/9358#discussion_r46657310
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala
 ---
@@ -37,3 +37,120 @@ trait Encoder[T] extends Serializable {
   /** A ClassTag that can be used to construct and Array to contain a 
collection of `T`. */
   def clsTag: ClassTag[T]
 }
+
+object Encoder {
+  import scala.reflect.runtime.universe._
+
+  def BOOLEAN: Encoder[java.lang.Boolean] = ExpressionEncoder(flat = true)
+  def BYTE: Encoder[java.lang.Byte] = ExpressionEncoder(flat = true)
+  def SHORT: Encoder[java.lang.Short] = ExpressionEncoder(flat = true)
+  def INT: Encoder[java.lang.Integer] = ExpressionEncoder(flat = true)
+  def LONG: Encoder[java.lang.Long] = ExpressionEncoder(flat = true)
+  def FLOAT: Encoder[java.lang.Float] = ExpressionEncoder(flat = true)
+  def DOUBLE: Encoder[java.lang.Double] = ExpressionEncoder(flat = true)
+  def STRING: Encoder[java.lang.String] = ExpressionEncoder(flat = true)
--- End diff --

@cloud-fan Could you share me your idea why we do not add the other basic 
types like DecimalType, DateType and TimestampType? Thank you! 

DecimalType -> java.math.BigDecimal
DateType -> java.sql.Date
TimestampType -> java.sql.Timestamp


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12091] [PYSPARK] Removal of the JAVA-sp...

2015-12-02 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10092#issuecomment-161514977
  
- Removed all the constants whose `deserialized` values are true. 
- Update the comments of StorageLevel
- Change the default storage levels of Kinesis level from 
`MEMORY_AND_DISK_2` to `MEMORY_AND_DISK_SER_2`.

Please verify if my changes are OK. @mateiz @davies Thank you very much!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12091] [PYSPARK] Removal of the JAVA-sp...

2015-12-02 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10092#issuecomment-161522366
  
Based on the comments of @mateiz , the extra changes are made:
- Renaming MEMORY_ONLY_SER to MEMORY_ONLY
- Renaming MEMORY_ONLY_SER_2 to MEMORY_ONLY_2
- Renaming MEMORY_AND_DISK_SER to MEMORY_AND_DISK
- Renaming MEMORY_AND_DISK_SER_2 to MEMORY_AND_DISK_2

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12091] [PYSPARK] Deprecate the JAVA-spe...

2015-12-03 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10092#discussion_r46522595
  
--- Diff: python/pyspark/storagelevel.py ---
@@ -49,12 +51,8 @@ def __str__(self):
 
 StorageLevel.DISK_ONLY = StorageLevel(True, False, False, False)
 StorageLevel.DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
-StorageLevel.MEMORY_ONLY = StorageLevel(False, True, False, True)
-StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, False, True, 2)
-StorageLevel.MEMORY_ONLY_SER = StorageLevel(False, True, False, False)
--- End diff --

Sure. Just changed it. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12158] [R] [SQL] Fix 'sample' functions...

2015-12-06 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10160#issuecomment-162286394
  
@felixcheung @sun-rui Thank you! Based on your comments, I did the changes. 
Please review the changes. : )




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12158] [R] [SQL] Fix 'sample' functions...

2015-12-06 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10160#issuecomment-162328075
  
@felixcheung I am not sure if we need to add a test case for `sample`. 
Normally, using a specific seed is the common way to verify the result of 
`sample`. The existing test case may be enough?
```
  sampled <- sample(df, FALSE, 1.0)
  expect_equal(nrow(collect(sampled)), count(df))
```

If needed, maybe we can add something like the below:
```
repeat {  if (count(sample(df, FALSE, 0.1)) != count(sample(df, FALSE, 
0.1))) { break } }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12164] [SQL] Display the binary/encoded...

2015-12-06 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/10165

[SPARK-12164] [SQL] Display the binary/encoded values

When the dataset is encoded, the existing display looks strange. Decimal 
format is not common when the type is binary. 
```
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);
```
The output is like 
```

+--+
|value  

   |

+--+
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 
114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 
97, 116, -31, 1, 1, -126, 97, 2]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 
114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 
97, 116, -31, 1, 1, -126, 98, 4]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 
114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 
97, 116, -31, 1, 1, -126, 99, 6]|

+--+
```
After the fix, it will be like the below
```

++
|value  
 |

++
|[01 00 6F 72 67 2E 61 70 61 63 68 65 2E 73 70 61 72 6B 2E 73 71 6C 2E 4B 
72 79 6F 43 6C 61 73 73 44 61 74 E1 01 01 82 61 02]|
|[01 00 6F 72 67 2E 61 70 61 63 68 65 2E 73 70 61 72 6B 2E 73 71 6C 2E 4B 
72 79 6F 43 6C 61 73 73 44 61 74 E1 01 01 82 62 04]|
|[01 00 6F 72 67 2E 61 70 61 63 68 65 2E 73 70 61 72 6B 2E 73 71 6C 2E 4B 
72 79 6F 43 6C 61 73 73 44 61 74 E1 01 01 82 63 06]|

++
```

In addition, do we need to add a new method to decode and then display the 
contents?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark binaryOutput

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10165.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10165


commit f63c43519b2e8eeab9428397c519de1032e1ae45
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-12-05T00:50:03Z

Merge remote-tracking branch 'upstream/master' into binaryOutput

commit 8754979da599743112f392250cee5606a3ce8864
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-12-06T17:44:04Z

Displays the encoded content of the Dataset

commit 5d0d64c76772d8d8d1a164be130d61e0abb50352
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-12-06T17:44:56Z

Merge remote-tracking branch 'upstream/master' into binaryOutput




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12158] [R] [SQL] Fix 'sample' functions...

2015-12-06 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10160#issuecomment-162286420
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12150] [SQL] [Minor] Add range API with...

2015-12-04 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/10149

[SPARK-12150] [SQL] [Minor] Add range API without specifying the slice 
number

For usability, add another sqlContext.range() method. Users can specify 
start, end, and step without specifying the slice number. The slice number is 
based on the sparkContext's defaultParallelism. It just makes consistent with 
the RDD range API. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark range

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10149.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10149


commit 8c4bd8351f79db2ce2aebc8a641442ba882295b8
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-12-04T20:23:36Z

range API with a default partition number

commit 6655b9d9515819cf81844c63c7105eb59882be12
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-12-04T20:25:52Z

2.0->1.6

commit 72860c4e93de38da18ee13e46368493d04819094
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-12-04T20:27:02Z

Merge remote-tracking branch 'upstream/master' into range




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12164] [SQL] Display the binary/encoded...

2015-12-06 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10165#issuecomment-162400140
  
I have the exact same question when calling the show function. From the 
perspectives of users, they might not care the encoded values at all when 
calling the function `show`. The results of encoded values look weird to most 
users, I think. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12158] [SparkR] [SQL] Fix 'sample' func...

2015-12-06 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10160#issuecomment-162415274
  
@felixcheung @shivaram Sure, just added that test case. Please review it. 
Thank you! : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12158] [SparkR] [SQL] Fix 'sample' func...

2015-12-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10160#discussion_r46788421
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -677,13 +677,15 @@ setMethod("unique",
 #' collect(sample(df, TRUE, 0.5))
 #'}
 setMethod("sample",
-  # TODO : Figure out how to send integer as java.lang.Long to JVM 
so
--- End diff --

True. Added it back. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12158] [SparkR] [SQL] Fix 'sample' func...

2015-12-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10160#discussion_r46789803
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -692,8 +696,8 @@ setMethod("sample",
 setMethod("sample_frac",
   signature(x = "DataFrame", withReplacement = "logical",
 fraction = "numeric"),
-  function(x, withReplacement, fraction) {
-sample(x, withReplacement, fraction)
+  function(x, withReplacement, fraction, seed) {
+  sample(x, withReplacement, fraction, as.integer(seed))
--- End diff --

Yeah, done! This is my first time to read and write R. : ) Thank you! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12158] [SparkR] [SQL] Fix 'sample' func...

2015-12-06 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10160#issuecomment-162428939
  
@shivaram @felixcheung @sun-rui Thank you everyone! Hopefully, my code 
changes resolve all your concerns. I learned a lot from you! : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12158] [SparkR] [SQL] Fix 'sample' func...

2015-12-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10160#discussion_r46789542
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -692,8 +696,8 @@ setMethod("sample",
 setMethod("sample_frac",
   signature(x = "DataFrame", withReplacement = "logical",
 fraction = "numeric"),
-  function(x, withReplacement, fraction) {
-sample(x, withReplacement, fraction)
+  function(x, withReplacement, fraction, seed) {
+  sample(x, withReplacement, fraction, as.integer(seed))
--- End diff --

Then, you need to change the test case. If we do not use 
`as.integer(seed)`, we need to change the input type. For example, 
```
sampled2 <- sample(df, FALSE, 0.1, 0) 
```
needs to be changed to
```
sampled2 <- sample(df, FALSE, 0.1, 0L) 
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12150] [SQL] [Minor] Add range API with...

2015-12-08 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10149#issuecomment-163093870
  
@marmbrus @cloud-fan This PR changes the external API. Not sure if this 
will be merged or we should revisit it after the release of 1.6? Thank you!  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12164] [SQL] Decode the encoded values ...

2015-12-08 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/10215

[SPARK-12164] [SQL] Decode the encoded values and then display

Based on the suggestions from @marmbrus @cloud-fan in 
https://github.com/apache/spark/pull/10165 , this PR is to print the decoded 
values(user objects) in `Dataset.show`
```
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
ds.show(20, false);
```
The current output is like 
```

+--+
|value  

   |

+--+
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 
114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 
97, 116, -31, 1, 1, -126, 97, 2]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 
114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 
97, 116, -31, 1, 1, -126, 98, 4]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 
114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 
97, 116, -31, 1, 1, -126, 99, 6]|

+--+
```
After the fix, it will be like the below
```
+---+
|value  |
+---+
|KryoClassData(a, 1)|
|KryoClassData(b, 2)|
|KryoClassData(c, 3)|
+---+
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark showDecodedValue

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10215.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10215


commit 1e1ad1970a8bf3d9076165074f18ee7f28ab4acd
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-12-09T04:08:17Z

show decoded values.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

2015-12-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10184#discussion_r47046420
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -429,18 +432,18 @@ class Dataset[T] private[sql](
 
   /**
* (Java-specific)
-   * Returns a [[GroupedDataset]] where the data is grouped by the given 
key function.
+   * Returns a [[GroupedDataset]] where the data is grouped by the given 
key `func`.
* @since 1.6.0
*/
-  def groupBy[K](f: MapFunction[T, K], encoder: Encoder[K]): 
GroupedDataset[K, T] =
-groupBy(f.call(_))(encoder)
+  def groupBy[K](func: MapFunction[T, K], encoder: Encoder[K]): 
GroupedDataset[K, T] =
--- End diff --

Sure, next time, I will be careful. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

2015-12-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10184#discussion_r47046431
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -67,15 +67,21 @@ class Dataset[T] private[sql](
 tEncoder: Encoder[T]) extends Queryable with Serializable {
 
   /**
-   * An unresolved version of the internal encoder for the type of this 
dataset.  This one is marked
-   * implicit so that we can use it when constructing new [[Dataset]] 
objects that have the same
-   * object type (that will be possibly resolved to a different schema).
+   * An unresolved version of the internal encoder for the type of this 
[[Dataset]].  This one is
+   * marked implicit so that we can use it when constructing new 
[[Dataset]] objects that have the
+   * same object type (that will be possibly resolved to a different 
schema).
*/
   private[sql] implicit val unresolvedTEncoder: ExpressionEncoder[T] = 
encoderFor(tEncoder)
 
   /** The encoder for this [[Dataset]] that has been resolved to its 
output schema. */
   private[sql] val resolvedTEncoder: ExpressionEncoder[T] =
-unresolvedTEncoder.resolve(queryExecution.analyzed.output, 
OuterScopes.outerScopes)
+unresolvedTEncoder.resolve(logicalPlan.output, OuterScopes.outerScopes)
+
+  /**
+   * The encoder where the expressions used to construct an object from an 
input row have been
+   * bound to the ordinals of the given schema.
--- End diff --

I see. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12188] [SQL] Code refactoring and comme...

2015-12-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10184#discussion_r47046739
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -67,15 +67,21 @@ class Dataset[T] private[sql](
 tEncoder: Encoder[T]) extends Queryable with Serializable {
 
   /**
-   * An unresolved version of the internal encoder for the type of this 
dataset.  This one is marked
-   * implicit so that we can use it when constructing new [[Dataset]] 
objects that have the same
-   * object type (that will be possibly resolved to a different schema).
+   * An unresolved version of the internal encoder for the type of this 
[[Dataset]].  This one is
+   * marked implicit so that we can use it when constructing new 
[[Dataset]] objects that have the
+   * same object type (that will be possibly resolved to a different 
schema).
*/
   private[sql] implicit val unresolvedTEncoder: ExpressionEncoder[T] = 
encoderFor(tEncoder)
 
   /** The encoder for this [[Dataset]] that has been resolved to its 
output schema. */
   private[sql] val resolvedTEncoder: ExpressionEncoder[T] =
-unresolvedTEncoder.resolve(queryExecution.analyzed.output, 
OuterScopes.outerScopes)
+unresolvedTEncoder.resolve(logicalPlan.output, OuterScopes.outerScopes)
+
+  /**
+   * The encoder where the expressions used to construct an object from an 
input row have been
+   * bound to the ordinals of the given schema.
--- End diff --

Let me add the change in a follow-up PR. : ) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12188] [SQL] [FOLLOW-UP] Code refactori...

2015-12-08 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/10214

[SPARK-12188] [SQL] [FOLLOW-UP] Code refactoring and comment correction in 
Dataset APIs

@marmbrus This PR is to address your comment. Thanks for your review! 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark followup12188

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10214.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10214


commit 145cd5b5e5b0ad4a229e1621acaf26d02d25cd41
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-12-09T03:10:33Z

address the comments.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12164] [SQL] Display the binary/encoded...

2015-12-08 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10165#issuecomment-163093753
  
Thank you! @cloud-fan 

Will this PR be merged to 1.6? Or waiting for another PR for showing 
decoded values? @marmbrus Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12158] [R] [SQL] Fix 'sample' functions...

2015-12-05 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/10160

[SPARK-12158] [R] [SQL] Fix 'sample' functions that break R unit test cases

The existing sample functions miss the parameter 'seed', however, the 
corresponding function interface in `generics` has such a parameter. Thus, 
although the function caller calls the function with the 'seed', we are not 
using it.   

This could cause SparkR unit tests failed. For example, I hit it in another 
PR:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark sampleR

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10160.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10160


commit ec770100452ca1a869058e448b1b41c8efb810d9
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2015-12-05T17:53:39Z

add sample functions with seeds




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12138] [SQL] Escape \u in the generated...

2015-12-04 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10155#issuecomment-162140125
  
Weird... 

My code changes are not related to the failed test case in SparkR. 
```
count(sampled3) < 3 isn't true
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12138] [SQL] Escape \u in the generated...

2015-12-04 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10155#issuecomment-162140292
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12138] [SQL] Escape \u in the generated...

2015-12-04 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10155#issuecomment-162148385
  
Found a bug in the function `sample` of R. Will submit a PR later. Thanks!  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12158] [R] [SQL] Fix 'sample' functions...

2015-12-05 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10160#issuecomment-162241856
  
@davies Could you take a look at this PR? Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-12069][SQL] Update documentation w...

2015-12-03 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10060#discussion_r46624363
  
--- Diff: docs/sql-programming-guide.md ---
@@ -9,18 +9,51 @@ title: Spark SQL and DataFrames
 
 # Overview
 
-Spark SQL is a Spark module for structured data processing. It provides a 
programming abstraction called DataFrames and can also act as distributed SQL 
query engine.
+Spark SQL is a Spark module for structured data processing.  Unlike the 
basic Spark RDD API, the interfaces provided
+by Spark SQL provide Spark with more about the structure of both the data 
and the computation being performed.  Internally,
+Spark SQL uses this extra information to perform extra optimizations.  
There are several ways to
+interact with Spark SQL including SQL, the DataFrames API and the Datasets 
API.  When computing a result
+the same execution engine is used, independent of which API/language you 
are using to express the
+computation.  This unification means that developers can easily switch 
back and forth between the
+various APIs based on which provides the most natural way to express a 
given transformation.
 
-Spark SQL can also be used to read data from an existing Hive 
installation.  For more on how to configure this feature, please refer to the 
[Hive Tables](#hive-tables) section.
+All of the examples on this page use sample data included in the Spark 
distribution and can be run in
+the `spark-shell`, `pyspark` shell, or `sparkR` shell.
 
-# DataFrames
+## SQL
 
-A DataFrame is a distributed collection of data organized into named 
columns. It is conceptually equivalent to a table in a relational database or a 
data frame in R/Python, but with richer optimizations under the hood. 
DataFrames can be constructed from a wide array of sources such as: structured 
data files, tables in Hive, external databases, or existing RDDs.
+One use of Spark SQL is to execute SQL queries written using either a 
basic SQL syntax or HiveQL.
+Spark SQL can also be used to read data from an existing Hive 
installation.  For more on how to
+configure this feature, please refer to the [Hive Tables](#hive-tables) 
section.  When running
+SQL from within another programming language the results will be returned 
as a [DataFrame](#DataFrames).
+You can also interact with the SQL interface using the 
[command-line](#running-the-spark-sql-cli)
+or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
 
-The DataFrame API is available in 
[Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), 
[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), 
[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and 
[R](api/R/index.html).
+## DataFrames
 
-All of the examples on this page use sample data included in the Spark 
distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` 
shell.
+A DataFrame is a distributed collection of data organized into named 
columns. It is conceptually
+equivalent to a table in a relational database or a data frame in 
R/Python, but with richer
+optimizations under the hood. DataFrames can be constructed from a wide 
array of [sources](#data-sources) such
+as: structured data files, tables in Hive, external databases, or existing 
RDDs.
 
+The DataFrame API is available in 
[Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
+[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
+[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and 
[R](api/R/index.html).
+
+## Datasets
+
+A Dataset is a new experimental interface added in Spark 1.6 that tries to 
provide the benefits of
+RDDs (strong typing, ability to use powerful lambda functions) with the 
benifits of Spark SQL's
--- End diff --

benifits -> benefits 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 14035 matches

Mail list logo