[GitHub] spark pull request #14326: [SPARK-3181] [ML] Implement RobustRegression with...

2016-07-23 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/14326

[SPARK-3181] [ML] Implement RobustRegression with huber loss.

## What changes were proposed in this pull request?
The current implementation is a straight forward porting for Python 
scikit-learn ```HuberRegressor```, so it produces the same result with that.
The code is used for discussion and please overpass trivial issues now, 
since I think we may have slightly different idea for our Spark implementation.

Here I listed some major issues should be discussed:
* Objective function.

We use Eq.(6) in [A robust hybrid of lasso and ridge 
regression](http://statweb.stanford.edu/~owen/reports/hhu.pdf) as the objective 
function.

![image](https://cloud.githubusercontent.com/assets/1962026/17076521/02a3f054-5069-11e6-895d-3c904e056ba2.png)
But the convention is different from other Spark ML code such as 
```LinearRegression``` in two aspects:
• The loss is total loss rather than mean loss. We use 
```lossSum/weightSum``` as the mean loss in ```LinearRegression```.
• We do not multiply the loss function and L2 regularization by 1/2. This 
is not a problem since it does not affect the result if we multiply the whole 
formula by a factor.
So should we turn to use the modified objective function like following 
which will be consistent with other Spark ML code?

![image](https://cloud.githubusercontent.com/assets/1962026/17076522/14eceb4e-5069-11e6-84ae-ecfaf3ea12ed.png)
* Implement a new class ```RobustRegression``` or a new loss function for 
```LinearRegression```.

Both ```LinearRegression``` and ```RobustRegression``` accomplish the same 
goal, but the output of ```fit``` will be different: 
```LinearRegressionModel``` and ```RobustRegressionModel```. The former only 
contains ```coefficients```, ```intercept```; but the latter contains 
```coefficients```, ```intercept```, ```scale/sigma``` (and even the outlier 
samples similar to sklearn ```HuberRegressor.outliers_```). It will also 
involve save/load compatibility issue if we combine the two models become one. 
One trick method is we can drop ```scale/sigma``` and make the ```fit``` by 
this huber cost function still output ```LinearRegressionModel```, but I don't 
think it's an appropriate way since it will miss some model attributes. So I 
implemented ```RobustRegression``` in a new class, and we can port this loss 
function to ```LinearRegression``` if needed at later time. 

## How was this patch tested?
Unit tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-3181

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14326.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14326


commit 8fd0ca1954f964e89cf81379fdaff0844afd7253
Author: Yanbo Liang 
Date:   2016-07-23T06:54:58Z

Implement RobustRegression with huber loss.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14326: [SPARK-3181] [ML] Implement RobustRegression with huber ...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14326
  
**[Test build #62747 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62747/consoleFull)**
 for PR 14326 at commit 
[`8fd0ca1`](https://github.com/apache/spark/commit/8fd0ca1954f964e89cf81379fdaff0844afd7253).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #3556: [SPARK-4693] [SQL] PruningPredicates may be wrong ...

2016-07-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/3556#discussion_r71968853
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -194,8 +194,9 @@ private[hive] trait HiveStrategies {
 // Filter out all predicates that only deal with partition keys, 
these are given to the
 // hive table scan operator to be used for partition pruning.
 val partitionKeyIds = AttributeSet(relation.partitionKeys)
-val (pruningPredicates, otherPredicates) = predicates.partition {
-  _.references.subsetOf(partitionKeyIds)
+val (pruningPredicates, otherPredicates) = predicates.partition { 
predicate =>
+  !predicate.references.isEmpty &&
--- End diff --

This line sounds useless in Spark 2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13756: [SPARK-16041][SQL] Disallow Duplicate Columns in partiti...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13756
  
**[Test build #62746 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62746/consoleFull)**
 for PR 13756 at commit 
[`08b5374`](https://github.com/apache/spark/commit/08b5374e827f6680b4e4a00ed700ef689dce22ff).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13756: [SPARK-16041][SQL] Disallow Duplicate Columns in partiti...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13756
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13756: [SPARK-16041][SQL] Disallow Duplicate Columns in partiti...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13756
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62746/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14174
  
**[Test build #62748 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62748/consoleFull)**
 for PR 14174 at commit 
[`bbaf568`](https://github.com/apache/spark/commit/bbaf5680e277d4d79f1710346807c1e4fb25ba93).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14174
  
**[Test build #62749 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62749/consoleFull)**
 for PR 14174 at commit 
[`7131a53`](https://github.com/apache/spark/commit/7131a536fe0605e9e04937e4f3ac1b13e37d7803).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14326: [SPARK-3181] [ML] Implement RobustRegression with huber ...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14326
  
**[Test build #62747 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62747/consoleFull)**
 for PR 14326 at commit 
[`8fd0ca1`](https://github.com/apache/spark/commit/8fd0ca1954f964e89cf81379fdaff0844afd7253).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class RobustRegression @Since(\"2.1.0\") (@Since(\"2.1.0\") override 
val uid: String)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14326: [SPARK-3181] [ML] Implement RobustRegression with huber ...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14326
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62747/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14326: [SPARK-3181] [ML] Implement RobustRegression with huber ...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14326
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14295: [SPARK-16648][SQL] Overrides TreeNode.withNewChildren in...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14295
  
**[Test build #62750 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62750/consoleFull)**
 for PR 14295 at commit 
[`dd73681`](https://github.com/apache/spark/commit/dd7368169e60f84a8262866cda9946dd370aa11d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14295: [SPARK-16648][SQL] Overrides TreeNode.withNewChildren in...

2016-07-23 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/14295
  
Oh, that's a good point, should have realized both of them are affected. 
Updated. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14317: [SPARK-16380][EXAMPLES] Update SQL examples and programm...

2016-07-23 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/14317
  
@JoshRosen Would you mind to have a look at this? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14324: [SPARK-16664][SQL] Fix persist call on Data frames with ...

2016-07-23 Thread breakdawn
Github user breakdawn commented on the issue:

https://github.com/apache/spark/pull/14324
  
8118 cols limit due to janino,  the exception like following, might be 
another story
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
... 25 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1509)
at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:914)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:912)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:912)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:884)
... 29 more


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14320: [SPARK-16416] [Core] force eager creation of loggers to ...

2016-07-23 Thread mikaelstaldal
Github user mikaelstaldal commented on the issue:

https://github.com/apache/spark/pull/14320
  
It is, I just had to apply the same in several places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14174
  
**[Test build #62748 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62748/consoleFull)**
 for PR 14174 at commit 
[`bbaf568`](https://github.com/apache/spark/commit/bbaf5680e277d4d79f1710346807c1e4fb25ba93).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14174
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62748/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14174
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14174
  
**[Test build #62749 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62749/consoleFull)**
 for PR 14174 at commit 
[`7131a53`](https://github.com/apache/spark/commit/7131a536fe0605e9e04937e4f3ac1b13e37d7803).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14174
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62749/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14174
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14295: [SPARK-16648][SQL] Overrides TreeNode.withNewChildren in...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14295
  
**[Test build #62750 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62750/consoleFull)**
 for PR 14295 at commit 
[`dd73681`](https://github.com/apache/spark/commit/dd7368169e60f84a8262866cda9946dd370aa11d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14324: [SPARK-16664][SQL] Fix persist call on Data frames with ...

2016-07-23 Thread lw-lin
Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/14324
  
@breakdawn yes that's a different issue and I'm looking into it.

Regarding what this PR tries to fix, could you run this PR's change against 
[this test 
case](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala#L225)
 to see if there's more needs to be done?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14295: [SPARK-16648][SQL] Overrides TreeNode.withNewChildren in...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14295
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14295: [SPARK-16648][SQL] Overrides TreeNode.withNewChildren in...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14295
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62750/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-23 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/14327

[SPARK-16686][SQL] Project shouldn't be pushed down through Sample if it 
has new output

## What changes were proposed in this pull request?

We push down `Project` through `Sample` in `Optimizer`. However, if the 
projected columns produce new output, they will encounter whole data instead of 
sampled data. It will bring some inconsistency between original plan (Sample 
then Project) and optimized plan (Project then Sample). In the extreme case 
such as attached in the JIRA, if the projected column is an UDF which is 
supposed to not see the sampled out data, the result of UDF will be incorrect.

We shouldn't push down Project through Sample if the Project brings new 
output.

## How was this patch tested?

Jenkins tests.





You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 fix-sample-pushdown

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14327.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14327


commit 9521a5aca87bead3dcfeabd7abe3468194984ea3
Author: Liang-Chi Hsieh 
Date:   2016-07-23T10:13:07Z

Project shouldn't be pushed down through Sample if it has new output.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14327
  
**[Test build #62751 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62751/consoleFull)**
 for PR 14327 at commit 
[`9521a5a`](https://github.com/apache/spark/commit/9521a5aca87bead3dcfeabd7abe3468194984ea3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r71971233
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -422,6 +422,35 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   3, 17, 27, 58, 62)
   }
 
+  test("SPARK-16686: Dataset.sample with seed results shouldn't depend on 
downstream usage") {
+val udfOne = spark.udf.register("udfOne", (n: Int) => {
+  if (n == 1) {
+throw new RuntimeException("udfOne shouldn't see swid=1!")
--- End diff --

Use `require`? generally `RuntimeException` isn't used directly. Really 
minor


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14324: [SPARK-16664][SQL] Fix persist call on Data frame...

2016-07-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14324#discussion_r71971259
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -1571,4 +1571,12 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(joined, Row("x", null, null))
 checkAnswer(joined.filter($"new".isNull), Row("x", null, null))
   }
+
+  test("SPARK-16664: persist with more than 200 columns") {
+val size = 201l
--- End diff --

Nit: write 201L for a long literal; it's too easy to read this as 2011.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14320: [SPARK-16416] [Core] force eager creation of loggers to ...

2016-07-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14320
  
Should just be necessary in the ShutdownHookManager?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...

2016-07-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14216
  
Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14301: [SPARK-16662][PySpark][SQL] fix HiveContext warning bug

2016-07-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14301
  
Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14242: Add a comment

2016-07-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14242
  
@kzhang28 update or close this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14301: [SPARK-16662][PySpark][SQL] fix HiveContext warni...

2016-07-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14301


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary mi...

2016-07-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14216


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12983: [SPARK-15213][PySpark] Unify 'range' usages

2016-07-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/12983
  
Yeah I can see that point; the change is ultimately a no-op. I'm neutral on 
it, not much a python person myself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13986: [SPARK-16617] Upgrade to Avro 1.8.1

2016-07-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/13986
  
Have a look at  `dev/test-dependencies.sh --replace-manifest`. I think the 
big concern is matching the Hadoop dependency, which will be on 1.7.x for 2.x. 
Updating to the latest 1.7.x seems OK. You can also test this change anyway 
after making the deps change to see what happens. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14320: [SPARK-16416] [Core] force eager creation of loggers to ...

2016-07-23 Thread mikaelstaldal
Github user mikaelstaldal commented on the issue:

https://github.com/apache/spark/pull/14320
  
I realized that it is necessary everywhere where you register a shutdown 
hook, if you log from within the shutdown hook.

Another way to solve it would be to refrain to log from within shutdown 
hooks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14328: Close old PRs that should be closed but have not ...

2016-07-23 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/14328

Close old PRs that should be closed but have not been

Closes #11598 
Closes #7278 
Closes #13882 
Closes #12053 
Closes #14125 
Closes #8760 
Closes #12848 
Closes #14224

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark CloseOldPRs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14328.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14328


commit c5a50bd8f0947681f1cd2ceb2e14b6440f4f2ddc
Author: Sean Owen 
Date:   2016-07-23T11:51:20Z

Close old PRs that should be closed but have not been




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14328: [MINOR] Close old PRs that should be closed but have not...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14328
  
**[Test build #62752 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62752/consoleFull)**
 for PR 14328 at commit 
[`c5a50bd`](https://github.com/apache/spark/commit/c5a50bd8f0947681f1cd2ceb2e14b6440f4f2ddc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14280: [SPARK-16515][SQL][FOLLOW-UP] Fix test `script` o...

2016-07-23 Thread lw-lin
Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14280#discussion_r71971600
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -64,14 +67,17 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils 
with TestHiveSingleton {
   import spark.implicits._
 
   test("script") {
-val df = Seq(("x1", "y1", "z1"), ("x2", "y2", "z2")).toDF("c1", "c2", 
"c3")
-df.createOrReplaceTempView("script_table")
-val query1 = sql(
-  """
-|SELECT col1 FROM (from(SELECT c1, c2, c3 FROM script_table) 
tempt_table
-|REDUCE c1, c2, c3 USING 'bash src/test/resources/test_script.sh' 
AS
-|(col1 STRING, col2 STRING)) script_test_table""".stripMargin)
-checkAnswer(query1, Row("x1_y1") :: Row("x2_y2") :: Nil)
+if (testCommandAvailable("bash") && testCommandAvailable("echo | 
sed")) {
+  val df = Seq(("x1", "y1", "z1"), ("x2", "y2", "z2")).toDF("c1", 
"c2", "c3")
+  df.createOrReplaceTempView("script_table")
+  val query1 = sql(
+"""
+  |SELECT col1 FROM (from(SELECT c1, c2, c3 FROM script_table) 
tempt_table
+  |REDUCE c1, c2, c3 USING 'bash 
src/test/resources/test_script.sh' AS
+  |(col1 STRING, col2 STRING)) script_test_table""".stripMargin)
+  checkAnswer(query1, Row("x1_y1") :: Row("x2_y2") :: Nil)
+}
+// else skip this test
--- End diff --

The only change here was the if check; i.e.

if (testCommandAvailable("bash") && testCommandAvailable("echo | sed")) {
  // everything left unchanged
}
// else skip this test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14280: [SPARK-16515][SQL][FOLLOW-UP] Fix test `script` on OS X/...

2016-07-23 Thread lw-lin
Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/14280
  
Maybe this is ready to go?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14324: [SPARK-16664][SQL] Fix persist call on Data frames with ...

2016-07-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14324
  
There are actually _55_ occurrences of this type of problem in the code 
base. I think I will open a PR separately to fix them. It might or might not 
cause a problem in practice in other cases, but many are in examples or tests, 
where we might not observe the consequence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14327
  
**[Test build #62751 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62751/consoleFull)**
 for PR 14327 at commit 
[`9521a5a`](https://github.com/apache/spark/commit/9521a5aca87bead3dcfeabd7abe3468194984ea3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14327
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62751/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14327
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14320: [SPARK-16416] [Core] force eager creation of loggers to ...

2016-07-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14320
  
OK, seems reasonable to me as is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14280: [SPARK-16515][SQL][FOLLOW-UP] Fix test `script` on OS X/...

2016-07-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14280
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r71972546
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -422,6 +422,35 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   3, 17, 27, 58, 62)
   }
 
+  test("SPARK-16686: Dataset.sample with seed results shouldn't depend on 
downstream usage") {
+val udfOne = spark.udf.register("udfOne", (n: Int) => {
+  if (n == 1) {
+throw new RuntimeException("udfOne shouldn't see swid=1!")
--- End diff --

Thanks! I've updated it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14327
  
**[Test build #62753 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62753/consoleFull)**
 for PR 14327 at commit 
[`6d1616d`](https://github.com/apache/spark/commit/6d1616d41cc1158089ac0f38a6402a0fef58b191).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14328: [MINOR] Close old PRs that should be closed but have not...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14328
  
**[Test build #62752 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62752/consoleFull)**
 for PR 14328 at commit 
[`c5a50bd`](https://github.com/apache/spark/commit/c5a50bd8f0947681f1cd2ceb2e14b6440f4f2ddc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14328: [MINOR] Close old PRs that should be closed but have not...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14328
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14328: [MINOR] Close old PRs that should be closed but have not...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62752/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14086: [SPARK-16463][SQL] Support `truncate` option in O...

2016-07-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14086#discussion_r71973330
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala ---
@@ -145,14 +153,24 @@ class JDBCWriteSuite extends SharedSQLContext with 
BeforeAndAfter {
 assert(2 === spark.read.jdbc(url, "TEST.APPENDTEST", new 
Properties()).collect()(0).length)
   }
 
-  test("CREATE then INSERT to truncate") {
+  test("Truncate") {
+JdbcDialects.registerDialect(testH2Dialect)
 val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), 
schema2)
 val df2 = spark.createDataFrame(sparkContext.parallelize(arr1x2), 
schema2)
+val df3 = spark.createDataFrame(sparkContext.parallelize(arr2x3), 
schema3)
 
 df.write.jdbc(url1, "TEST.TRUNCATETEST", properties)
-df2.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.TRUNCATETEST", 
properties)
+df2.write.mode(SaveMode.Overwrite).option("truncate", true)
+  .jdbc(url1, "TEST.TRUNCATETEST", properties)
 assert(1 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", 
properties).count())
 assert(2 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", 
properties).collect()(0).length)
+
+val m = intercept[SparkException] {
--- End diff --

To check my understanding here, this overwrites the table with a different 
schema (new column `seq`). This shows the truncate fails because the schema has 
changed.

I guess it would be nice to test the case where the truncate works at 
least, though, we can't really test whether it truncates vs drops.

Could you for example just repeat the code on line 163-166 here to verify 
that overwriting just results in the same results?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14324: [SPARK-16664][SQL] Fix persist call on Data frames with ...

2016-07-23 Thread breakdawn
Github user breakdawn commented on the issue:

https://github.com/apache/spark/pull/14324
  
@lw-lin umm, thanks for pointing it out. Since the limit is 8117, 1 
will fail, that case needs a update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14327
  
**[Test build #62753 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62753/consoleFull)**
 for PR 14327 at commit 
[`6d1616d`](https://github.com/apache/spark/commit/6d1616d41cc1158089ac0f38a6402a0fef58b191).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14327
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62753/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14327
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14194: [SPARK-16485][DOC][ML] Fixed several inline formatting i...

2016-07-23 Thread lins05
Github user lins05 commented on the issue:

https://github.com/apache/spark/pull/14194
  
@jkbradley Could you please take a look at this simple fix?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14242: Add a comment

2016-07-23 Thread kzhang28
Github user kzhang28 closed the pull request at:

https://github.com/apache/spark/pull/14242


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13986: [SPARK-16617] Upgrade to Avro 1.8.1

2016-07-23 Thread benmccann
Github user benmccann commented on the issue:

https://github.com/apache/spark/pull/13986
  
I'll close for now until Hadoop 3.x. Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13986: [SPARK-16617] Upgrade to Avro 1.8.1

2016-07-23 Thread benmccann
Github user benmccann closed the pull request at:

https://github.com/apache/spark/pull/13986


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14242: Add a comment

2016-07-23 Thread kzhang28
Github user kzhang28 commented on the issue:

https://github.com/apache/spark/pull/14242
  
@srowen I closed it. Thank you for your kind reminder.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14326: [SPARK-3181] [ML] Implement RobustRegression with...

2016-07-23 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14326#discussion_r71975650
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/RobustRegression.scala ---
@@ -0,0 +1,466 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => 
BreezeLBFGS, LBFGSB => BreezeLBFGSB}
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param.{DoubleParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for robust regression.
+ */
+private[regression] trait RobustRegressionParams extends PredictorParams 
with HasRegParam
+  with HasMaxIter with HasTol with HasFitIntercept with HasStandardization 
with HasWeightCol {
+
+  /**
+   * The shape parameter to control the amount of robustness. Must be > 
1.0.
+   * At larger values of M, the huber criterion becomes more similar to 
least squares regression;
+   * for small values of M, the criterion is more similar to L1 regression.
+   * Default is 1.35 to get as much robustness as possible while retaining
+   * 95% statistical efficiency for normally distributed data.
+   */
+  @Since("2.1.0")
+  final val m = new DoubleParam(this, "m", "The shape parameter to control 
the amount of " +
+"robustness. Must be > 1.0.", ParamValidators.gt(1.0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  def getM: Double = $(m)
+}
+
+/**
+ * Robust regression.
+ *
+ * The learning objective is to minimize the huber loss, with 
regularization.
+ *
+ * The robust regression optimizes the squared loss for the samples where
+ * {{{ |\frac{(y - X \beta)}{\sigma}|\leq M }}}
+ * and the absolute loss for the samples where
+ * {{{ |\frac{(y - X \beta)}{\sigma}|\geq M }}},
+ * where \beta and \sigma are parameters to be optimized.
+ *
+ * This supports two types of regularization: None and L2.
+ *
+ * This estimator is different from the R implementation of Robust 
Regression
+ * ([[http://www.ats.ucla.edu/stat/r/dae/rreg.htm]]) because the R 
implementation does a
+ * weighted least squares implementation with weights given to each sample 
on the basis
+ * of how much the residual is greater than a certain threshold.
+ */
+@Since("2.1.0")
+class RobustRegression @Since("2.1.0") (@Since("2.1.0") override val uid: 
String)
+  extends Regressor[Vector, RobustRegression, RobustRegressionModel]
+  with RobustRegressionParams with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("robReg"))
+
+  /**
+   * Sets the value of param [[m]].
+   * Default is 1.35.
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setM(value: Double): this.type = set(m, value)
+  setDefault(m -> 1.35)
+
+  /**
+   * Sets the regularization parameter.
+   * Default is 0.0.
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Sets if we should fit the intercept.
+   * Default is true.
+   * @group setParam
+   */
 

[GitHub] spark issue #14328: [MINOR] Close old PRs that should be closed but have not...

2016-07-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14328
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14318: [SPARK-16690][TEST] rename SQLTestUtils.withTempTable to...

2016-07-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14318
  
Merging in master/2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-23 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/14098
  
@liancheng Thanks! I will review the PR #14317 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14317: [SPARK-16380][EXAMPLES] Update SQL examples and programm...

2016-07-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14317
  
Merging in master/2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14318: [SPARK-16690][TEST] rename SQLTestUtils.withTempT...

2016-07-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14318


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14318: [SPARK-16690][TEST] rename SQLTestUtils.withTempTable to...

2016-07-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14318
  
I'm going to cherry pick this into branch-2.0 to avoid conflicts in bug 
fixes.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14317: [SPARK-16380][EXAMPLES] Update SQL examples and p...

2016-07-23 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14317#discussion_r71976352
  
--- Diff: docs/sql-programming-guide.md ---
@@ -79,7 +79,7 @@ The entry point into all functionality in Spark is the 
[`SparkSession`](api/java
 
 The entry point into all functionality in Spark is the 
[`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.builder`:
 
-{% include_example init_session python/sql.py %}
+{% include_example init_session python/sql/basic.py %}
--- End diff --

The file name is not consistent with Scala and Java version. The file names 
are SparkSQLExample.scala and SparkSQLExample.java. The Hive and Data Source 
examples file names are not consistent either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14317: [SPARK-16380][EXAMPLES] Update SQL examples and p...

2016-07-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14317


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14307: [SPARK-16672][SQL] SQLBuilder should not raise ex...

2016-07-23 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14307#discussion_r71976388
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/SQLBuilderSuite.scala ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.catalyst.SQLBuilder
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.test.SQLTestUtils
+
+class SQLBuilderSuite extends QueryTest with SQLTestUtils with 
TestHiveSingleton {
--- End diff --

LogicalPlanToSQLSuite?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14317: [SPARK-16380][EXAMPLES] Update SQL examples and p...

2016-07-23 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14317#discussion_r71976400
  
--- Diff: examples/src/main/python/sql/basic.py ---
@@ -0,0 +1,194 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import print_function
+
+# $example on:init_session$
+from pyspark.sql import SparkSession
+# $example off:init_session$
+
+# $example on:schema_inferring$
+from pyspark.sql import Row
+# $example off:schema_inferring$
+
+# $example on:programmatic_schema$
+# Import data types
+from pyspark.sql.types import *
+# $example off:programmatic_schema$
+
+"""
+A simple example demonstrating basic Spark SQL features.
+Run with:
+  ./bin/spark-submit examples/src/main/python/sql/basic.py
+"""
+
+
+def basic_df_example(spark):
+# $example on:create_df$
+# spark is an existing SparkSession
+df = spark.read.json("examples/src/main/resources/people.json")
+# Displays the content of the DataFrame to stdout
+df.show()
+# ++---+
+# | age|   name|
+# ++---+
+# |null|Michael|
+# |  30|   Andy|
+# |  19| Justin|
+# ++---+
+# $example off:create_df$
+
+# $example on:untyped_ops$
+# spark, df are from the previous example
+# Print the schema in a tree format
+df.printSchema()
+# root
+# |-- age: long (nullable = true)
+# |-- name: string (nullable = true)
+
+# Select only the "name" column
+df.select("name").show()
+# +---+
+# |   name|
+# +---+
+# |Michael|
+# |   Andy|
+# | Justin|
+# +---+
+
+# Select everybody, but increment the age by 1
+df.select(df['name'], df['age'] + 1).show()
--- End diff --

Do you want to use `col('...')`. I have tested it and it works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14317: [SPARK-16380][EXAMPLES] Update SQL examples and p...

2016-07-23 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14317#discussion_r71976456
  
--- Diff: examples/src/main/python/sql/datasource.py ---
@@ -0,0 +1,154 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import print_function
+
+from pyspark.sql import SparkSession
+# $example on:schema_merging$
+from pyspark.sql import Row
+# $example off:schema_merging$
+
+"""
+A simple example demonstrating Spark SQL data sources.
+Run with:
+  ./bin/spark-submit examples/src/main/python/sql/datasource.py
+"""
+
+
+def basic_datasource_example(spark):
+# $example on:generic_load_save_functions$
+df = spark.read.load("examples/src/main/resources/users.parquet")
+df.select("name", 
"favorite_color").write.save("namesAndFavColors.parquet")
+# $example off:generic_load_save_functions$
+
+# $example on:manual_load_options$
+df = spark.read.load("examples/src/main/resources/people.json", 
format="json")
+df.select("name", "age").write.save("namesAndAges.parquet", 
format="parquet")
+# $example off:manual_load_options$
+
+# $example on:direct_sql$
+df = spark.sql("SELECT * FROM 
parquet.`examples/src/main/resources/users.parquet`")
+# $example off:direct_sql$
+
+
+def parquet_example(spark):
+# $example on:basic_parquet_example$
+peopleDF = spark.read.json("examples/src/main/resources/people.json")
+
+# DataFrames can be saved as Parquet files, maintaining the schema 
information.
+peopleDF.write.parquet("people.parquet")
+
+# Read in the Parquet file created above.
+# Parquet files are self-describing so the schema is preserved.
+# The result of loading a parquet file is also a DataFrame.
+parquetFile = spark.read.parquet("people.parquet")
+
+# Parquet files can also be used to create a temporary view and then 
used in SQL statements.
+parquetFile.createOrReplaceTempView("parquetFile")
+teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 
AND age <= 19")
+teenagers.show()
+# +--+
+# |  name|
+# +--+
+# |Justin|
+# +--+
+# $example off:basic_parquet_example$
+
+
+def parquet_schema_merging_example(spark):
+# $example on:schema_merging$
+# spark is from the previous example.
+# Create a simple DataFrame, stored into a partition directory
+sc = spark.sparkContext
+
+squaresDF = spark.createDataFrame(sc.parallelize(range(1, 6))
+  .map(lambda i: Row(single=i, 
double=i ** 2)))
+squaresDF.write.parquet("data/test_table/key=1")
+
+# Create another DataFrame in a new partition directory,
+# adding a new column and dropping an existing column
+cubesDF = spark.createDataFrame(sc.parallelize(range(6, 11))
+.map(lambda i: Row(single=i, triple=i 
** 3)))
+cubesDF.write.parquet("data/test_table/key=2")
+
+# Read the partitioned table
+mergedDF = spark.read.option("mergeSchema", 
"true").parquet("data/test_table")
+mergedDF.printSchema()
+
+# The final schema consists of all 3 columns in the Parquet files 
together
+# with the partitioning column appeared in the partition directory 
paths.
+# root
+# |-- double: long (nullable = true)
+# |-- single: long (nullable = true)
+# |-- triple: long (nullable = true)
+# |-- key: integer (nullable = true)
+# $example off:schema_merging$
+
+
+def json_dataset_examplg(spark):
+# $example on:json_dataset$
+# spark is from the previous example.
+sc = spark.sparkContext
+
+# A JSON dataset is pointed to by path.
+# The path can be either a single text file or a directory storing 
text files
+path = "examples/src/main/resources/people.json

[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...

2016-07-23 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/14098
  
As #14317 has been merged, I close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14098: [SPARK-16380][SQL][Example]:Update SQL examples a...

2016-07-23 Thread wangmiao1981
Github user wangmiao1981 closed the pull request at:

https://github.com/apache/spark/pull/14098


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14086: [SPARK-16463][SQL] Support `truncate` option in O...

2016-07-23 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14086#discussion_r71976641
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala ---
@@ -145,14 +153,24 @@ class JDBCWriteSuite extends SharedSQLContext with 
BeforeAndAfter {
 assert(2 === spark.read.jdbc(url, "TEST.APPENDTEST", new 
Properties()).collect()(0).length)
   }
 
-  test("CREATE then INSERT to truncate") {
+  test("Truncate") {
+JdbcDialects.registerDialect(testH2Dialect)
 val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), 
schema2)
 val df2 = spark.createDataFrame(sparkContext.parallelize(arr1x2), 
schema2)
+val df3 = spark.createDataFrame(sparkContext.parallelize(arr2x3), 
schema3)
 
 df.write.jdbc(url1, "TEST.TRUNCATETEST", properties)
-df2.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.TRUNCATETEST", 
properties)
+df2.write.mode(SaveMode.Overwrite).option("truncate", true)
+  .jdbc(url1, "TEST.TRUNCATETEST", properties)
 assert(1 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", 
properties).count())
 assert(2 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", 
properties).collect()(0).length)
+
+val m = intercept[SparkException] {
--- End diff --

Sure, that would be better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14182: [SPARK-16444][WIP][SparkR]: Isotonic Regression wrapper ...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #62754 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62754/consoleFull)**
 for PR 14182 at commit 
[`7f68211`](https://github.com/apache/spark/commit/7f68211e362677e3599f4af7d574962b06611ab5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14182: [SPARK-16444][WIP][SparkR]: Isotonic Regression wrapper ...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #62754 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62754/consoleFull)**
 for PR 14182 at commit 
[`7f68211`](https://github.com/apache/spark/commit/7f68211e362677e3599f4af7d574962b06611ab5).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14182: [SPARK-16444][WIP][SparkR]: Isotonic Regression wrapper ...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14182
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14086: [SPARK-16463][SQL] Support `truncate` option in Overwrit...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14086
  
**[Test build #62755 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62755/consoleFull)**
 for PR 14086 at commit 
[`8b452cb`](https://github.com/apache/spark/commit/8b452cb51814ed196a0cd16312074de3ea28330d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14182: [SPARK-16444][WIP][SparkR]: Isotonic Regression wrapper ...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14182
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62754/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14307: [SPARK-16672][SQL] SQLBuilder should not raise ex...

2016-07-23 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14307#discussion_r71976751
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/SQLBuilderSuite.scala ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.catalyst.SQLBuilder
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.test.SQLTestUtils
+
+class SQLBuilderSuite extends QueryTest with SQLTestUtils with 
TestHiveSingleton {
--- End diff --

Oh, I see.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14307: [SPARK-16672][SQL] SQLBuilder should not raise exception...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14307
  
**[Test build #62756 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62756/consoleFull)**
 for PR 14307 at commit 
[`70f5401`](https://github.com/apache/spark/commit/70f5401e5d1a606117f85b1caa6c29724c623dff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

2016-07-23 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14313#discussion_r71977310
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -322,46 +322,134 @@ private[sql] class JDBCRDD(
 }
   }
 
-  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so 
that
-  // we don't have to potentially poke around in the Metadata once for 
every
-  // row.
-  // Is there a better way to do this?  I'd rather be using a type that
-  // contains only the tags I define.
-  abstract class JDBCConversion
-  case object BooleanConversion extends JDBCConversion
-  case object DateConversion extends JDBCConversion
-  case class  DecimalConversion(precision: Int, scale: Int) extends 
JDBCConversion
-  case object DoubleConversion extends JDBCConversion
-  case object FloatConversion extends JDBCConversion
-  case object IntegerConversion extends JDBCConversion
-  case object LongConversion extends JDBCConversion
-  case object BinaryLongConversion extends JDBCConversion
-  case object StringConversion extends JDBCConversion
-  case object TimestampConversion extends JDBCConversion
-  case object BinaryConversion extends JDBCConversion
-  case class ArrayConversion(elementConversion: JDBCConversion) extends 
JDBCConversion
+  // A `JDBCConversion` is responsible for converting a value from 
`ResultSet`
+  // to a value in a field for `InternalRow`.
+  private type JDBCConversion = (ResultSet, Int) => Any
+
+  // This `ArrayElementConversion` is responsible for converting elements 
in
+  // an array from `ResultSet`.
+  private type ArrayElementConversion = (Object) => Any
 
   /**
-   * Maps a StructType to a type tag list.
+   * Maps a StructType to conversions for each type.
*/
   def getConversions(schema: StructType): Array[JDBCConversion] =
 schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
 
   private def getConversions(dt: DataType, metadata: Metadata): 
JDBCConversion = dt match {
-case BooleanType => BooleanConversion
-case DateType => DateConversion
-case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
-case DoubleType => DoubleConversion
-case FloatType => FloatConversion
-case IntegerType => IntegerConversion
-case LongType => if (metadata.contains("binarylong")) 
BinaryLongConversion else LongConversion
-case StringType => StringConversion
-case TimestampType => TimestampConversion
-case BinaryType => BinaryConversion
-case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
+case BooleanType =>
+  (rs: ResultSet, pos: Int) => rs.getBoolean(pos)
+
+case DateType =>
+  (rs: ResultSet, pos: Int) =>
+// DateTimeUtils.fromJavaDate does not handle null value, so we 
need to check it.
+val dateVal = rs.getDate(pos)
+if (dateVal != null) {
--- End diff --

`Option(dateVal).map(...).orNull`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

2016-07-23 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14313#discussion_r71977329
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -322,46 +322,134 @@ private[sql] class JDBCRDD(
 }
   }
 
-  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so 
that
-  // we don't have to potentially poke around in the Metadata once for 
every
-  // row.
-  // Is there a better way to do this?  I'd rather be using a type that
-  // contains only the tags I define.
-  abstract class JDBCConversion
-  case object BooleanConversion extends JDBCConversion
-  case object DateConversion extends JDBCConversion
-  case class  DecimalConversion(precision: Int, scale: Int) extends 
JDBCConversion
-  case object DoubleConversion extends JDBCConversion
-  case object FloatConversion extends JDBCConversion
-  case object IntegerConversion extends JDBCConversion
-  case object LongConversion extends JDBCConversion
-  case object BinaryLongConversion extends JDBCConversion
-  case object StringConversion extends JDBCConversion
-  case object TimestampConversion extends JDBCConversion
-  case object BinaryConversion extends JDBCConversion
-  case class ArrayConversion(elementConversion: JDBCConversion) extends 
JDBCConversion
+  // A `JDBCConversion` is responsible for converting a value from 
`ResultSet`
+  // to a value in a field for `InternalRow`.
+  private type JDBCConversion = (ResultSet, Int) => Any
+
+  // This `ArrayElementConversion` is responsible for converting elements 
in
+  // an array from `ResultSet`.
+  private type ArrayElementConversion = (Object) => Any
 
   /**
-   * Maps a StructType to a type tag list.
+   * Maps a StructType to conversions for each type.
*/
   def getConversions(schema: StructType): Array[JDBCConversion] =
 schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
 
   private def getConversions(dt: DataType, metadata: Metadata): 
JDBCConversion = dt match {
-case BooleanType => BooleanConversion
-case DateType => DateConversion
-case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
-case DoubleType => DoubleConversion
-case FloatType => FloatConversion
-case IntegerType => IntegerConversion
-case LongType => if (metadata.contains("binarylong")) 
BinaryLongConversion else LongConversion
-case StringType => StringConversion
-case TimestampType => TimestampConversion
-case BinaryType => BinaryConversion
-case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
+case BooleanType =>
+  (rs: ResultSet, pos: Int) => rs.getBoolean(pos)
+
+case DateType =>
+  (rs: ResultSet, pos: Int) =>
+// DateTimeUtils.fromJavaDate does not handle null value, so we 
need to check it.
+val dateVal = rs.getDate(pos)
+if (dateVal != null) {
+  DateTimeUtils.fromJavaDate(dateVal)
+} else {
+  null
+}
+
+case DecimalType.Fixed(p, s) =>
+  (rs: ResultSet, pos: Int) =>
+val decimalVal = rs.getBigDecimal(pos)
+if (decimalVal == null) {
--- End diff --

Same as above (plus you're checking equality with `null` opposite to the 
above -- consistency violated)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

2016-07-23 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14313#discussion_r71977337
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -322,46 +322,134 @@ private[sql] class JDBCRDD(
 }
   }
 
-  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so 
that
-  // we don't have to potentially poke around in the Metadata once for 
every
-  // row.
-  // Is there a better way to do this?  I'd rather be using a type that
-  // contains only the tags I define.
-  abstract class JDBCConversion
-  case object BooleanConversion extends JDBCConversion
-  case object DateConversion extends JDBCConversion
-  case class  DecimalConversion(precision: Int, scale: Int) extends 
JDBCConversion
-  case object DoubleConversion extends JDBCConversion
-  case object FloatConversion extends JDBCConversion
-  case object IntegerConversion extends JDBCConversion
-  case object LongConversion extends JDBCConversion
-  case object BinaryLongConversion extends JDBCConversion
-  case object StringConversion extends JDBCConversion
-  case object TimestampConversion extends JDBCConversion
-  case object BinaryConversion extends JDBCConversion
-  case class ArrayConversion(elementConversion: JDBCConversion) extends 
JDBCConversion
+  // A `JDBCConversion` is responsible for converting a value from 
`ResultSet`
+  // to a value in a field for `InternalRow`.
+  private type JDBCConversion = (ResultSet, Int) => Any
+
+  // This `ArrayElementConversion` is responsible for converting elements 
in
+  // an array from `ResultSet`.
+  private type ArrayElementConversion = (Object) => Any
 
   /**
-   * Maps a StructType to a type tag list.
+   * Maps a StructType to conversions for each type.
*/
   def getConversions(schema: StructType): Array[JDBCConversion] =
 schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
 
   private def getConversions(dt: DataType, metadata: Metadata): 
JDBCConversion = dt match {
-case BooleanType => BooleanConversion
-case DateType => DateConversion
-case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
-case DoubleType => DoubleConversion
-case FloatType => FloatConversion
-case IntegerType => IntegerConversion
-case LongType => if (metadata.contains("binarylong")) 
BinaryLongConversion else LongConversion
-case StringType => StringConversion
-case TimestampType => TimestampConversion
-case BinaryType => BinaryConversion
-case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
+case BooleanType =>
+  (rs: ResultSet, pos: Int) => rs.getBoolean(pos)
+
+case DateType =>
+  (rs: ResultSet, pos: Int) =>
+// DateTimeUtils.fromJavaDate does not handle null value, so we 
need to check it.
+val dateVal = rs.getDate(pos)
+if (dateVal != null) {
+  DateTimeUtils.fromJavaDate(dateVal)
+} else {
+  null
+}
+
+case DecimalType.Fixed(p, s) =>
+  (rs: ResultSet, pos: Int) =>
+val decimalVal = rs.getBigDecimal(pos)
+if (decimalVal == null) {
+  null
+} else {
+  Decimal(decimalVal, p, s)
+}
+
+case DoubleType =>
+  (rs: ResultSet, pos: Int) => rs.getDouble(pos)
+
+case FloatType =>
+  (rs: ResultSet, pos: Int) => rs.getFloat(pos)
+
+case IntegerType =>
+  (rs: ResultSet, pos: Int) => rs.getInt(pos)
+
+case LongType if metadata.contains("binarylong") =>
+  (rs: ResultSet, pos: Int) =>
+val bytes = rs.getBytes(pos)
+var ans = 0L
+var j = 0
+while (j < bytes.size) {
+  ans = 256 * ans + (255 & bytes(j))
+  j = j + 1
+}
+ans
+
+case LongType =>
+  (rs: ResultSet, pos: Int) => rs.getLong(pos)
+
+case StringType =>
+  (rs: ResultSet, pos: Int) =>
+// TODO(davies): use getBytes for better performance, if the 
encoding is UTF-8
+UTF8String.fromString(rs.getString(pos))
+
+case TimestampType =>
+  (rs: ResultSet, pos: Int) =>
+val t = rs.getTimestamp(pos)
+if (t != null) {
--- End diff --

same as above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

--

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

2016-07-23 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14313#discussion_r71977344
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -322,46 +322,134 @@ private[sql] class JDBCRDD(
 }
   }
 
-  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so 
that
-  // we don't have to potentially poke around in the Metadata once for 
every
-  // row.
-  // Is there a better way to do this?  I'd rather be using a type that
-  // contains only the tags I define.
-  abstract class JDBCConversion
-  case object BooleanConversion extends JDBCConversion
-  case object DateConversion extends JDBCConversion
-  case class  DecimalConversion(precision: Int, scale: Int) extends 
JDBCConversion
-  case object DoubleConversion extends JDBCConversion
-  case object FloatConversion extends JDBCConversion
-  case object IntegerConversion extends JDBCConversion
-  case object LongConversion extends JDBCConversion
-  case object BinaryLongConversion extends JDBCConversion
-  case object StringConversion extends JDBCConversion
-  case object TimestampConversion extends JDBCConversion
-  case object BinaryConversion extends JDBCConversion
-  case class ArrayConversion(elementConversion: JDBCConversion) extends 
JDBCConversion
+  // A `JDBCConversion` is responsible for converting a value from 
`ResultSet`
+  // to a value in a field for `InternalRow`.
+  private type JDBCConversion = (ResultSet, Int) => Any
+
+  // This `ArrayElementConversion` is responsible for converting elements 
in
+  // an array from `ResultSet`.
+  private type ArrayElementConversion = (Object) => Any
 
   /**
-   * Maps a StructType to a type tag list.
+   * Maps a StructType to conversions for each type.
*/
   def getConversions(schema: StructType): Array[JDBCConversion] =
 schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
 
   private def getConversions(dt: DataType, metadata: Metadata): 
JDBCConversion = dt match {
-case BooleanType => BooleanConversion
-case DateType => DateConversion
-case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
-case DoubleType => DoubleConversion
-case FloatType => FloatConversion
-case IntegerType => IntegerConversion
-case LongType => if (metadata.contains("binarylong")) 
BinaryLongConversion else LongConversion
-case StringType => StringConversion
-case TimestampType => TimestampConversion
-case BinaryType => BinaryConversion
-case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
+case BooleanType =>
+  (rs: ResultSet, pos: Int) => rs.getBoolean(pos)
+
+case DateType =>
+  (rs: ResultSet, pos: Int) =>
+// DateTimeUtils.fromJavaDate does not handle null value, so we 
need to check it.
+val dateVal = rs.getDate(pos)
+if (dateVal != null) {
+  DateTimeUtils.fromJavaDate(dateVal)
+} else {
+  null
+}
+
+case DecimalType.Fixed(p, s) =>
+  (rs: ResultSet, pos: Int) =>
+val decimalVal = rs.getBigDecimal(pos)
+if (decimalVal == null) {
+  null
+} else {
+  Decimal(decimalVal, p, s)
+}
+
+case DoubleType =>
+  (rs: ResultSet, pos: Int) => rs.getDouble(pos)
+
+case FloatType =>
+  (rs: ResultSet, pos: Int) => rs.getFloat(pos)
+
+case IntegerType =>
+  (rs: ResultSet, pos: Int) => rs.getInt(pos)
+
+case LongType if metadata.contains("binarylong") =>
+  (rs: ResultSet, pos: Int) =>
+val bytes = rs.getBytes(pos)
+var ans = 0L
+var j = 0
+while (j < bytes.size) {
+  ans = 256 * ans + (255 & bytes(j))
+  j = j + 1
+}
+ans
+
+case LongType =>
+  (rs: ResultSet, pos: Int) => rs.getLong(pos)
+
+case StringType =>
+  (rs: ResultSet, pos: Int) =>
+// TODO(davies): use getBytes for better performance, if the 
encoding is UTF-8
+UTF8String.fromString(rs.getString(pos))
+
+case TimestampType =>
+  (rs: ResultSet, pos: Int) =>
+val t = rs.getTimestamp(pos)
+if (t != null) {
+  DateTimeUtils.fromJavaTimestamp(t)
+} else {
+  null
+}
+
+case BinaryType =>
+  (rs: ResultSet, pos: Int) => rs.getBytes(pos)
+
+case ArrayType(et, _) =>
+  val elementConversion: ArrayElementConversion = 
getArrayElementConversion(et, metadata)
+  (rs: ResultSet, pos: Int) =>
   

[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...

2016-07-23 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14313#discussion_r71977368
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -407,84 +495,8 @@ private[sql] class JDBCRDD(
 var i = 0
 while (i < conversions.length) {
--- End diff --

Why `while` not `foreach` or similar?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14329: [SPARKR][DOCS] fix broken url in doc

2016-07-23 Thread felixcheung
GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/14329

[SPARKR][DOCS] fix broken url in doc

## What changes were proposed in this pull request?

Fix broken url, also,

sparkR.session.stop Rd should have it in the header

![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png)


Data type section is in the middle of a list of gapply/gapplyCollect 
subsections:

![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png)


## How was this patch tested?

manual test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rdoclinkfix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14329.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14329


commit 40ca13c17e8e97732733e7bc200254459920d2f9
Author: Felix Cheung 
Date:   2016-07-23T20:13:49Z

doc fix

commit 06d8b415a3bce4c997683defce87b4833b56b1a9
Author: Felix Cheung 
Date:   2016-07-23T20:20:21Z

Merge branch 'master' of https://github.com/apache/spark into rdoclinkfix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14329: [SPARKR][DOCS] fix broken url in doc

2016-07-23 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14329
  
@shivaram 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14329: [SPARKR][DOCS] fix broken url in doc

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14329
  
**[Test build #62757 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62757/consoleFull)**
 for PR 14329 at commit 
[`06d8b41`](https://github.com/apache/spark/commit/06d8b415a3bce4c997683defce87b4833b56b1a9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14086: [SPARK-16463][SQL] Support `truncate` option in Overwrit...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14086
  
**[Test build #62755 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62755/consoleFull)**
 for PR 14086 at commit 
[`8b452cb`](https://github.com/apache/spark/commit/8b452cb51814ed196a0cd16312074de3ea28330d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14086: [SPARK-16463][SQL] Support `truncate` option in Overwrit...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14086
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14086: [SPARK-16463][SQL] Support `truncate` option in Overwrit...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14086
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62755/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14307: [SPARK-16672][SQL] SQLBuilder should not raise exception...

2016-07-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14307
  
**[Test build #62756 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62756/consoleFull)**
 for PR 14307 at commit 
[`70f5401`](https://github.com/apache/spark/commit/70f5401e5d1a606117f85b1caa6c29724c623dff).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14307: [SPARK-16672][SQL] SQLBuilder should not raise exception...

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14307
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62756/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14329: [SPARKR][DOCS] fix broken url in doc

2016-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14329
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >