[GitHub] spark pull request #14326: [SPARK-3181] [ML] Implement RobustRegression with...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/14326 [SPARK-3181] [ML] Implement RobustRegression with huber loss. ## What changes were proposed in this pull request? The current implementation is a straight forward porting for Python scikit-learn ```HuberRegressor```, so it produces the same result with that. The code is used for discussion and please overpass trivial issues now, since I think we may have slightly different idea for our Spark implementation. Here I listed some major issues should be discussed: * Objective function. We use Eq.(6) in [A robust hybrid of lasso and ridge regression](http://statweb.stanford.edu/~owen/reports/hhu.pdf) as the objective function. ![image](https://cloud.githubusercontent.com/assets/1962026/17076521/02a3f054-5069-11e6-895d-3c904e056ba2.png) But the convention is different from other Spark ML code such as ```LinearRegression``` in two aspects: ⢠The loss is total loss rather than mean loss. We use ```lossSum/weightSum``` as the mean loss in ```LinearRegression```. ⢠We do not multiply the loss function and L2 regularization by 1/2. This is not a problem since it does not affect the result if we multiply the whole formula by a factor. So should we turn to use the modified objective function like following which will be consistent with other Spark ML code? ![image](https://cloud.githubusercontent.com/assets/1962026/17076522/14eceb4e-5069-11e6-84ae-ecfaf3ea12ed.png) * Implement a new class ```RobustRegression``` or a new loss function for ```LinearRegression```. Both ```LinearRegression``` and ```RobustRegression``` accomplish the same goal, but the output of ```fit``` will be different: ```LinearRegressionModel``` and ```RobustRegressionModel```. The former only contains ```coefficients```, ```intercept```; but the latter contains ```coefficients```, ```intercept```, ```scale/sigma``` (and even the outlier samples similar to sklearn ```HuberRegressor.outliers_```). It will also involve save/load compatibility issue if we combine the two models become one. One trick method is we can drop ```scale/sigma``` and make the ```fit``` by this huber cost function still output ```LinearRegressionModel```, but I don't think it's an appropriate way since it will miss some model attributes. So I implemented ```RobustRegression``` in a new class, and we can port this loss function to ```LinearRegression``` if needed at later time. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-3181 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14326.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14326 commit 8fd0ca1954f964e89cf81379fdaff0844afd7253 Author: Yanbo Liang Date: 2016-07-23T06:54:58Z Implement RobustRegression with huber loss. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14326: [SPARK-3181] [ML] Implement RobustRegression with huber ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14326 **[Test build #62747 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62747/consoleFull)** for PR 14326 at commit [`8fd0ca1`](https://github.com/apache/spark/commit/8fd0ca1954f964e89cf81379fdaff0844afd7253). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #3556: [SPARK-4693] [SQL] PruningPredicates may be wrong ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/3556#discussion_r71968853 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -194,8 +194,9 @@ private[hive] trait HiveStrategies { // Filter out all predicates that only deal with partition keys, these are given to the // hive table scan operator to be used for partition pruning. val partitionKeyIds = AttributeSet(relation.partitionKeys) -val (pruningPredicates, otherPredicates) = predicates.partition { - _.references.subsetOf(partitionKeyIds) +val (pruningPredicates, otherPredicates) = predicates.partition { predicate => + !predicate.references.isEmpty && --- End diff -- This line sounds useless in Spark 2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13756: [SPARK-16041][SQL] Disallow Duplicate Columns in partiti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13756 **[Test build #62746 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62746/consoleFull)** for PR 13756 at commit [`08b5374`](https://github.com/apache/spark/commit/08b5374e827f6680b4e4a00ed700ef689dce22ff). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13756: [SPARK-16041][SQL] Disallow Duplicate Columns in partiti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13756 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13756: [SPARK-16041][SQL] Disallow Duplicate Columns in partiti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13756 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62746/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14174 **[Test build #62748 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62748/consoleFull)** for PR 14174 at commit [`bbaf568`](https://github.com/apache/spark/commit/bbaf5680e277d4d79f1710346807c1e4fb25ba93). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14174 **[Test build #62749 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62749/consoleFull)** for PR 14174 at commit [`7131a53`](https://github.com/apache/spark/commit/7131a536fe0605e9e04937e4f3ac1b13e37d7803). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14326: [SPARK-3181] [ML] Implement RobustRegression with huber ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14326 **[Test build #62747 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62747/consoleFull)** for PR 14326 at commit [`8fd0ca1`](https://github.com/apache/spark/commit/8fd0ca1954f964e89cf81379fdaff0844afd7253). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class RobustRegression @Since(\"2.1.0\") (@Since(\"2.1.0\") override val uid: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14326: [SPARK-3181] [ML] Implement RobustRegression with huber ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14326 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62747/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14326: [SPARK-3181] [ML] Implement RobustRegression with huber ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14326 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14295: [SPARK-16648][SQL] Overrides TreeNode.withNewChildren in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14295 **[Test build #62750 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62750/consoleFull)** for PR 14295 at commit [`dd73681`](https://github.com/apache/spark/commit/dd7368169e60f84a8262866cda9946dd370aa11d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14295: [SPARK-16648][SQL] Overrides TreeNode.withNewChildren in...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14295 Oh, that's a good point, should have realized both of them are affected. Updated. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14317: [SPARK-16380][EXAMPLES] Update SQL examples and programm...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14317 @JoshRosen Would you mind to have a look at this? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14324: [SPARK-16664][SQL] Fix persist call on Data frames with ...
Github user breakdawn commented on the issue: https://github.com/apache/spark/pull/14324 8118 cols limit due to janino, the exception like following, might be another story at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) ... 25 more Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1509) at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644) at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623) at org.codehaus.janino.util.ClassFile.(ClassFile.java:280) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:914) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:912) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:912) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:884) ... 29 more --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14320: [SPARK-16416] [Core] force eager creation of loggers to ...
Github user mikaelstaldal commented on the issue: https://github.com/apache/spark/pull/14320 It is, I just had to apply the same in several places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14174 **[Test build #62748 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62748/consoleFull)** for PR 14174 at commit [`bbaf568`](https://github.com/apache/spark/commit/bbaf5680e277d4d79f1710346807c1e4fb25ba93). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14174 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62748/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14174 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14174 **[Test build #62749 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62749/consoleFull)** for PR 14174 at commit [`7131a53`](https://github.com/apache/spark/commit/7131a536fe0605e9e04937e4f3ac1b13e37d7803). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14174 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62749/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14174 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14295: [SPARK-16648][SQL] Overrides TreeNode.withNewChildren in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14295 **[Test build #62750 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62750/consoleFull)** for PR 14295 at commit [`dd73681`](https://github.com/apache/spark/commit/dd7368169e60f84a8262866cda9946dd370aa11d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14324: [SPARK-16664][SQL] Fix persist call on Data frames with ...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/14324 @breakdawn yes that's a different issue and I'm looking into it. Regarding what this PR tries to fix, could you run this PR's change against [this test case](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala#L225) to see if there's more needs to be done? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14295: [SPARK-16648][SQL] Overrides TreeNode.withNewChildren in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14295 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14295: [SPARK-16648][SQL] Overrides TreeNode.withNewChildren in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14295 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62750/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/14327 [SPARK-16686][SQL] Project shouldn't be pushed down through Sample if it has new output ## What changes were proposed in this pull request? We push down `Project` through `Sample` in `Optimizer`. However, if the projected columns produce new output, they will encounter whole data instead of sampled data. It will bring some inconsistency between original plan (Sample then Project) and optimized plan (Project then Sample). In the extreme case such as attached in the JIRA, if the projected column is an UDF which is supposed to not see the sampled out data, the result of UDF will be incorrect. We shouldn't push down Project through Sample if the Project brings new output. ## How was this patch tested? Jenkins tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 fix-sample-pushdown Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14327.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14327 commit 9521a5aca87bead3dcfeabd7abe3468194984ea3 Author: Liang-Chi Hsieh Date: 2016-07-23T10:13:07Z Project shouldn't be pushed down through Sample if it has new output. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14327 **[Test build #62751 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62751/consoleFull)** for PR 14327 at commit [`9521a5a`](https://github.com/apache/spark/commit/9521a5aca87bead3dcfeabd7abe3468194984ea3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r71971233 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -422,6 +422,35 @@ class DatasetSuite extends QueryTest with SharedSQLContext { 3, 17, 27, 58, 62) } + test("SPARK-16686: Dataset.sample with seed results shouldn't depend on downstream usage") { +val udfOne = spark.udf.register("udfOne", (n: Int) => { + if (n == 1) { +throw new RuntimeException("udfOne shouldn't see swid=1!") --- End diff -- Use `require`? generally `RuntimeException` isn't used directly. Really minor --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14324: [SPARK-16664][SQL] Fix persist call on Data frame...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14324#discussion_r71971259 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -1571,4 +1571,12 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { checkAnswer(joined, Row("x", null, null)) checkAnswer(joined.filter($"new".isNull), Row("x", null, null)) } + + test("SPARK-16664: persist with more than 200 columns") { +val size = 201l --- End diff -- Nit: write 201L for a long literal; it's too easy to read this as 2011. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14320: [SPARK-16416] [Core] force eager creation of loggers to ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14320 Should just be necessary in the ShutdownHookManager? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14216 Merged to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14301: [SPARK-16662][PySpark][SQL] fix HiveContext warning bug
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14301 Merged to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14242: Add a comment
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14242 @kzhang28 update or close this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14301: [SPARK-16662][PySpark][SQL] fix HiveContext warni...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14301 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary mi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14216 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12983: [SPARK-15213][PySpark] Unify 'range' usages
Github user srowen commented on the issue: https://github.com/apache/spark/pull/12983 Yeah I can see that point; the change is ultimately a no-op. I'm neutral on it, not much a python person myself. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13986: [SPARK-16617] Upgrade to Avro 1.8.1
Github user srowen commented on the issue: https://github.com/apache/spark/pull/13986 Have a look at `dev/test-dependencies.sh --replace-manifest`. I think the big concern is matching the Hadoop dependency, which will be on 1.7.x for 2.x. Updating to the latest 1.7.x seems OK. You can also test this change anyway after making the deps change to see what happens. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14320: [SPARK-16416] [Core] force eager creation of loggers to ...
Github user mikaelstaldal commented on the issue: https://github.com/apache/spark/pull/14320 I realized that it is necessary everywhere where you register a shutdown hook, if you log from within the shutdown hook. Another way to solve it would be to refrain to log from within shutdown hooks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14328: Close old PRs that should be closed but have not ...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/14328 Close old PRs that should be closed but have not been Closes #11598 Closes #7278 Closes #13882 Closes #12053 Closes #14125 Closes #8760 Closes #12848 Closes #14224 You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark CloseOldPRs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14328.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14328 commit c5a50bd8f0947681f1cd2ceb2e14b6440f4f2ddc Author: Sean Owen Date: 2016-07-23T11:51:20Z Close old PRs that should be closed but have not been --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14328: [MINOR] Close old PRs that should be closed but have not...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14328 **[Test build #62752 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62752/consoleFull)** for PR 14328 at commit [`c5a50bd`](https://github.com/apache/spark/commit/c5a50bd8f0947681f1cd2ceb2e14b6440f4f2ddc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14280: [SPARK-16515][SQL][FOLLOW-UP] Fix test `script` o...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/14280#discussion_r71971600 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -64,14 +67,17 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { import spark.implicits._ test("script") { -val df = Seq(("x1", "y1", "z1"), ("x2", "y2", "z2")).toDF("c1", "c2", "c3") -df.createOrReplaceTempView("script_table") -val query1 = sql( - """ -|SELECT col1 FROM (from(SELECT c1, c2, c3 FROM script_table) tempt_table -|REDUCE c1, c2, c3 USING 'bash src/test/resources/test_script.sh' AS -|(col1 STRING, col2 STRING)) script_test_table""".stripMargin) -checkAnswer(query1, Row("x1_y1") :: Row("x2_y2") :: Nil) +if (testCommandAvailable("bash") && testCommandAvailable("echo | sed")) { + val df = Seq(("x1", "y1", "z1"), ("x2", "y2", "z2")).toDF("c1", "c2", "c3") + df.createOrReplaceTempView("script_table") + val query1 = sql( +""" + |SELECT col1 FROM (from(SELECT c1, c2, c3 FROM script_table) tempt_table + |REDUCE c1, c2, c3 USING 'bash src/test/resources/test_script.sh' AS + |(col1 STRING, col2 STRING)) script_test_table""".stripMargin) + checkAnswer(query1, Row("x1_y1") :: Row("x2_y2") :: Nil) +} +// else skip this test --- End diff -- The only change here was the if check; i.e. if (testCommandAvailable("bash") && testCommandAvailable("echo | sed")) { // everything left unchanged } // else skip this test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14280: [SPARK-16515][SQL][FOLLOW-UP] Fix test `script` on OS X/...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/14280 Maybe this is ready to go? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14324: [SPARK-16664][SQL] Fix persist call on Data frames with ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14324 There are actually _55_ occurrences of this type of problem in the code base. I think I will open a PR separately to fix them. It might or might not cause a problem in practice in other cases, but many are in examples or tests, where we might not observe the consequence. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14327 **[Test build #62751 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62751/consoleFull)** for PR 14327 at commit [`9521a5a`](https://github.com/apache/spark/commit/9521a5aca87bead3dcfeabd7abe3468194984ea3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14327 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62751/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14327 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14320: [SPARK-16416] [Core] force eager creation of loggers to ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14320 OK, seems reasonable to me as is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14280: [SPARK-16515][SQL][FOLLOW-UP] Fix test `script` on OS X/...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14280 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r71972546 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -422,6 +422,35 @@ class DatasetSuite extends QueryTest with SharedSQLContext { 3, 17, 27, 58, 62) } + test("SPARK-16686: Dataset.sample with seed results shouldn't depend on downstream usage") { +val udfOne = spark.udf.register("udfOne", (n: Int) => { + if (n == 1) { +throw new RuntimeException("udfOne shouldn't see swid=1!") --- End diff -- Thanks! I've updated it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14327 **[Test build #62753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62753/consoleFull)** for PR 14327 at commit [`6d1616d`](https://github.com/apache/spark/commit/6d1616d41cc1158089ac0f38a6402a0fef58b191). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14328: [MINOR] Close old PRs that should be closed but have not...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14328 **[Test build #62752 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62752/consoleFull)** for PR 14328 at commit [`c5a50bd`](https://github.com/apache/spark/commit/c5a50bd8f0947681f1cd2ceb2e14b6440f4f2ddc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14328: [MINOR] Close old PRs that should be closed but have not...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14328 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14328: [MINOR] Close old PRs that should be closed but have not...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14328 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62752/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14086: [SPARK-16463][SQL] Support `truncate` option in O...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14086#discussion_r71973330 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala --- @@ -145,14 +153,24 @@ class JDBCWriteSuite extends SharedSQLContext with BeforeAndAfter { assert(2 === spark.read.jdbc(url, "TEST.APPENDTEST", new Properties()).collect()(0).length) } - test("CREATE then INSERT to truncate") { + test("Truncate") { +JdbcDialects.registerDialect(testH2Dialect) val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2) val df2 = spark.createDataFrame(sparkContext.parallelize(arr1x2), schema2) +val df3 = spark.createDataFrame(sparkContext.parallelize(arr2x3), schema3) df.write.jdbc(url1, "TEST.TRUNCATETEST", properties) -df2.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.TRUNCATETEST", properties) +df2.write.mode(SaveMode.Overwrite).option("truncate", true) + .jdbc(url1, "TEST.TRUNCATETEST", properties) assert(1 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", properties).count()) assert(2 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", properties).collect()(0).length) + +val m = intercept[SparkException] { --- End diff -- To check my understanding here, this overwrites the table with a different schema (new column `seq`). This shows the truncate fails because the schema has changed. I guess it would be nice to test the case where the truncate works at least, though, we can't really test whether it truncates vs drops. Could you for example just repeat the code on line 163-166 here to verify that overwriting just results in the same results? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14324: [SPARK-16664][SQL] Fix persist call on Data frames with ...
Github user breakdawn commented on the issue: https://github.com/apache/spark/pull/14324 @lw-lin umm, thanks for pointing it out. Since the limit is 8117, 1 will fail, that case needs a update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14327 **[Test build #62753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62753/consoleFull)** for PR 14327 at commit [`6d1616d`](https://github.com/apache/spark/commit/6d1616d41cc1158089ac0f38a6402a0fef58b191). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14327 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62753/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14327: [SPARK-16686][SQL] Project shouldn't be pushed down thro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14327 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14194: [SPARK-16485][DOC][ML] Fixed several inline formatting i...
Github user lins05 commented on the issue: https://github.com/apache/spark/pull/14194 @jkbradley Could you please take a look at this simple fix? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14242: Add a comment
Github user kzhang28 closed the pull request at: https://github.com/apache/spark/pull/14242 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13986: [SPARK-16617] Upgrade to Avro 1.8.1
Github user benmccann commented on the issue: https://github.com/apache/spark/pull/13986 I'll close for now until Hadoop 3.x. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13986: [SPARK-16617] Upgrade to Avro 1.8.1
Github user benmccann closed the pull request at: https://github.com/apache/spark/pull/13986 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14242: Add a comment
Github user kzhang28 commented on the issue: https://github.com/apache/spark/pull/14242 @srowen I closed it. Thank you for your kind reminder. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14326: [SPARK-3181] [ML] Implement RobustRegression with...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14326#discussion_r71975650 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/RobustRegression.scala --- @@ -0,0 +1,466 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.collection.mutable + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => BreezeLBFGS, LBFGSB => BreezeLBFGSB} + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.ml.linalg.BLAS._ +import org.apache.spark.ml.param.{DoubleParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.linalg.VectorImplicits._ +import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.storage.StorageLevel + +/** + * Params for robust regression. + */ +private[regression] trait RobustRegressionParams extends PredictorParams with HasRegParam + with HasMaxIter with HasTol with HasFitIntercept with HasStandardization with HasWeightCol { + + /** + * The shape parameter to control the amount of robustness. Must be > 1.0. + * At larger values of M, the huber criterion becomes more similar to least squares regression; + * for small values of M, the criterion is more similar to L1 regression. + * Default is 1.35 to get as much robustness as possible while retaining + * 95% statistical efficiency for normally distributed data. + */ + @Since("2.1.0") + final val m = new DoubleParam(this, "m", "The shape parameter to control the amount of " + +"robustness. Must be > 1.0.", ParamValidators.gt(1.0)) + + /** @group getParam */ + @Since("2.1.0") + def getM: Double = $(m) +} + +/** + * Robust regression. + * + * The learning objective is to minimize the huber loss, with regularization. + * + * The robust regression optimizes the squared loss for the samples where + * {{{ |\frac{(y - X \beta)}{\sigma}|\leq M }}} + * and the absolute loss for the samples where + * {{{ |\frac{(y - X \beta)}{\sigma}|\geq M }}}, + * where \beta and \sigma are parameters to be optimized. + * + * This supports two types of regularization: None and L2. + * + * This estimator is different from the R implementation of Robust Regression + * ([[http://www.ats.ucla.edu/stat/r/dae/rreg.htm]]) because the R implementation does a + * weighted least squares implementation with weights given to each sample on the basis + * of how much the residual is greater than a certain threshold. + */ +@Since("2.1.0") +class RobustRegression @Since("2.1.0") (@Since("2.1.0") override val uid: String) + extends Regressor[Vector, RobustRegression, RobustRegressionModel] + with RobustRegressionParams with Logging { + + @Since("2.1.0") + def this() = this(Identifiable.randomUID("robReg")) + + /** + * Sets the value of param [[m]]. + * Default is 1.35. + * @group setParam + */ + @Since("2.1.0") + def setM(value: Double): this.type = set(m, value) + setDefault(m -> 1.35) + + /** + * Sets the regularization parameter. + * Default is 0.0. + * @group setParam + */ + @Since("2.1.0") + def setRegParam(value: Double): this.type = set(regParam, value) + setDefault(regParam -> 0.0) + + /** + * Sets if we should fit the intercept. + * Default is true. + * @group setParam + */
[GitHub] spark issue #14328: [MINOR] Close old PRs that should be closed but have not...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14328 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14318: [SPARK-16690][TEST] rename SQLTestUtils.withTempTable to...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14318 Merging in master/2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/14098 @liancheng Thanks! I will review the PR #14317 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14317: [SPARK-16380][EXAMPLES] Update SQL examples and programm...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14317 Merging in master/2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14318: [SPARK-16690][TEST] rename SQLTestUtils.withTempT...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14318 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14318: [SPARK-16690][TEST] rename SQLTestUtils.withTempTable to...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14318 I'm going to cherry pick this into branch-2.0 to avoid conflicts in bug fixes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14317: [SPARK-16380][EXAMPLES] Update SQL examples and p...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/14317#discussion_r71976352 --- Diff: docs/sql-programming-guide.md --- @@ -79,7 +79,7 @@ The entry point into all functionality in Spark is the [`SparkSession`](api/java The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`: -{% include_example init_session python/sql.py %} +{% include_example init_session python/sql/basic.py %} --- End diff -- The file name is not consistent with Scala and Java version. The file names are SparkSQLExample.scala and SparkSQLExample.java. The Hive and Data Source examples file names are not consistent either. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14317: [SPARK-16380][EXAMPLES] Update SQL examples and p...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14317 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14307: [SPARK-16672][SQL] SQLBuilder should not raise ex...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14307#discussion_r71976388 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/SQLBuilderSuite.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.catalyst.SQLBuilder +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.test.SQLTestUtils + +class SQLBuilderSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { --- End diff -- LogicalPlanToSQLSuite? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14317: [SPARK-16380][EXAMPLES] Update SQL examples and p...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/14317#discussion_r71976400 --- Diff: examples/src/main/python/sql/basic.py --- @@ -0,0 +1,194 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import print_function + +# $example on:init_session$ +from pyspark.sql import SparkSession +# $example off:init_session$ + +# $example on:schema_inferring$ +from pyspark.sql import Row +# $example off:schema_inferring$ + +# $example on:programmatic_schema$ +# Import data types +from pyspark.sql.types import * +# $example off:programmatic_schema$ + +""" +A simple example demonstrating basic Spark SQL features. +Run with: + ./bin/spark-submit examples/src/main/python/sql/basic.py +""" + + +def basic_df_example(spark): +# $example on:create_df$ +# spark is an existing SparkSession +df = spark.read.json("examples/src/main/resources/people.json") +# Displays the content of the DataFrame to stdout +df.show() +# ++---+ +# | age| name| +# ++---+ +# |null|Michael| +# | 30| Andy| +# | 19| Justin| +# ++---+ +# $example off:create_df$ + +# $example on:untyped_ops$ +# spark, df are from the previous example +# Print the schema in a tree format +df.printSchema() +# root +# |-- age: long (nullable = true) +# |-- name: string (nullable = true) + +# Select only the "name" column +df.select("name").show() +# +---+ +# | name| +# +---+ +# |Michael| +# | Andy| +# | Justin| +# +---+ + +# Select everybody, but increment the age by 1 +df.select(df['name'], df['age'] + 1).show() --- End diff -- Do you want to use `col('...')`. I have tested it and it works. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14317: [SPARK-16380][EXAMPLES] Update SQL examples and p...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/14317#discussion_r71976456 --- Diff: examples/src/main/python/sql/datasource.py --- @@ -0,0 +1,154 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import print_function + +from pyspark.sql import SparkSession +# $example on:schema_merging$ +from pyspark.sql import Row +# $example off:schema_merging$ + +""" +A simple example demonstrating Spark SQL data sources. +Run with: + ./bin/spark-submit examples/src/main/python/sql/datasource.py +""" + + +def basic_datasource_example(spark): +# $example on:generic_load_save_functions$ +df = spark.read.load("examples/src/main/resources/users.parquet") +df.select("name", "favorite_color").write.save("namesAndFavColors.parquet") +# $example off:generic_load_save_functions$ + +# $example on:manual_load_options$ +df = spark.read.load("examples/src/main/resources/people.json", format="json") +df.select("name", "age").write.save("namesAndAges.parquet", format="parquet") +# $example off:manual_load_options$ + +# $example on:direct_sql$ +df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`") +# $example off:direct_sql$ + + +def parquet_example(spark): +# $example on:basic_parquet_example$ +peopleDF = spark.read.json("examples/src/main/resources/people.json") + +# DataFrames can be saved as Parquet files, maintaining the schema information. +peopleDF.write.parquet("people.parquet") + +# Read in the Parquet file created above. +# Parquet files are self-describing so the schema is preserved. +# The result of loading a parquet file is also a DataFrame. +parquetFile = spark.read.parquet("people.parquet") + +# Parquet files can also be used to create a temporary view and then used in SQL statements. +parquetFile.createOrReplaceTempView("parquetFile") +teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19") +teenagers.show() +# +--+ +# | name| +# +--+ +# |Justin| +# +--+ +# $example off:basic_parquet_example$ + + +def parquet_schema_merging_example(spark): +# $example on:schema_merging$ +# spark is from the previous example. +# Create a simple DataFrame, stored into a partition directory +sc = spark.sparkContext + +squaresDF = spark.createDataFrame(sc.parallelize(range(1, 6)) + .map(lambda i: Row(single=i, double=i ** 2))) +squaresDF.write.parquet("data/test_table/key=1") + +# Create another DataFrame in a new partition directory, +# adding a new column and dropping an existing column +cubesDF = spark.createDataFrame(sc.parallelize(range(6, 11)) +.map(lambda i: Row(single=i, triple=i ** 3))) +cubesDF.write.parquet("data/test_table/key=2") + +# Read the partitioned table +mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table") +mergedDF.printSchema() + +# The final schema consists of all 3 columns in the Parquet files together +# with the partitioning column appeared in the partition directory paths. +# root +# |-- double: long (nullable = true) +# |-- single: long (nullable = true) +# |-- triple: long (nullable = true) +# |-- key: integer (nullable = true) +# $example off:schema_merging$ + + +def json_dataset_examplg(spark): +# $example on:json_dataset$ +# spark is from the previous example. +sc = spark.sparkContext + +# A JSON dataset is pointed to by path. +# The path can be either a single text file or a directory storing text files +path = "examples/src/main/resources/people.json
[GitHub] spark issue #14098: [SPARK-16380][SQL][Example]:Update SQL examples and prog...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/14098 As #14317 has been merged, I close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14098: [SPARK-16380][SQL][Example]:Update SQL examples a...
Github user wangmiao1981 closed the pull request at: https://github.com/apache/spark/pull/14098 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14086: [SPARK-16463][SQL] Support `truncate` option in O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14086#discussion_r71976641 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala --- @@ -145,14 +153,24 @@ class JDBCWriteSuite extends SharedSQLContext with BeforeAndAfter { assert(2 === spark.read.jdbc(url, "TEST.APPENDTEST", new Properties()).collect()(0).length) } - test("CREATE then INSERT to truncate") { + test("Truncate") { +JdbcDialects.registerDialect(testH2Dialect) val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2) val df2 = spark.createDataFrame(sparkContext.parallelize(arr1x2), schema2) +val df3 = spark.createDataFrame(sparkContext.parallelize(arr2x3), schema3) df.write.jdbc(url1, "TEST.TRUNCATETEST", properties) -df2.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.TRUNCATETEST", properties) +df2.write.mode(SaveMode.Overwrite).option("truncate", true) + .jdbc(url1, "TEST.TRUNCATETEST", properties) assert(1 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", properties).count()) assert(2 === spark.read.jdbc(url1, "TEST.TRUNCATETEST", properties).collect()(0).length) + +val m = intercept[SparkException] { --- End diff -- Sure, that would be better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][WIP][SparkR]: Isotonic Regression wrapper ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14182 **[Test build #62754 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62754/consoleFull)** for PR 14182 at commit [`7f68211`](https://github.com/apache/spark/commit/7f68211e362677e3599f4af7d574962b06611ab5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][WIP][SparkR]: Isotonic Regression wrapper ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14182 **[Test build #62754 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62754/consoleFull)** for PR 14182 at commit [`7f68211`](https://github.com/apache/spark/commit/7f68211e362677e3599f4af7d574962b06611ab5). * This patch **fails R style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][WIP][SparkR]: Isotonic Regression wrapper ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14182 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14086: [SPARK-16463][SQL] Support `truncate` option in Overwrit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14086 **[Test build #62755 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62755/consoleFull)** for PR 14086 at commit [`8b452cb`](https://github.com/apache/spark/commit/8b452cb51814ed196a0cd16312074de3ea28330d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][WIP][SparkR]: Isotonic Regression wrapper ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14182 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62754/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14307: [SPARK-16672][SQL] SQLBuilder should not raise ex...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14307#discussion_r71976751 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/SQLBuilderSuite.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.catalyst.SQLBuilder +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.test.SQLTestUtils + +class SQLBuilderSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { --- End diff -- Oh, I see. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14307: [SPARK-16672][SQL] SQLBuilder should not raise exception...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14307 **[Test build #62756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62756/consoleFull)** for PR 14307 at commit [`70f5401`](https://github.com/apache/spark/commit/70f5401e5d1a606117f85b1caa6c29724c623dff). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r71977310 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -322,46 +322,134 @@ private[sql] class JDBCRDD( } } - // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that - // we don't have to potentially poke around in the Metadata once for every - // row. - // Is there a better way to do this? I'd rather be using a type that - // contains only the tags I define. - abstract class JDBCConversion - case object BooleanConversion extends JDBCConversion - case object DateConversion extends JDBCConversion - case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion - case object DoubleConversion extends JDBCConversion - case object FloatConversion extends JDBCConversion - case object IntegerConversion extends JDBCConversion - case object LongConversion extends JDBCConversion - case object BinaryLongConversion extends JDBCConversion - case object StringConversion extends JDBCConversion - case object TimestampConversion extends JDBCConversion - case object BinaryConversion extends JDBCConversion - case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion + // A `JDBCConversion` is responsible for converting a value from `ResultSet` + // to a value in a field for `InternalRow`. + private type JDBCConversion = (ResultSet, Int) => Any + + // This `ArrayElementConversion` is responsible for converting elements in + // an array from `ResultSet`. + private type ArrayElementConversion = (Object) => Any /** - * Maps a StructType to a type tag list. + * Maps a StructType to conversions for each type. */ def getConversions(schema: StructType): Array[JDBCConversion] = schema.fields.map(sf => getConversions(sf.dataType, sf.metadata)) private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match { -case BooleanType => BooleanConversion -case DateType => DateConversion -case DecimalType.Fixed(p, s) => DecimalConversion(p, s) -case DoubleType => DoubleConversion -case FloatType => FloatConversion -case IntegerType => IntegerConversion -case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion -case StringType => StringConversion -case TimestampType => TimestampConversion -case BinaryType => BinaryConversion -case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata)) +case BooleanType => + (rs: ResultSet, pos: Int) => rs.getBoolean(pos) + +case DateType => + (rs: ResultSet, pos: Int) => +// DateTimeUtils.fromJavaDate does not handle null value, so we need to check it. +val dateVal = rs.getDate(pos) +if (dateVal != null) { --- End diff -- `Option(dateVal).map(...).orNull`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r71977329 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -322,46 +322,134 @@ private[sql] class JDBCRDD( } } - // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that - // we don't have to potentially poke around in the Metadata once for every - // row. - // Is there a better way to do this? I'd rather be using a type that - // contains only the tags I define. - abstract class JDBCConversion - case object BooleanConversion extends JDBCConversion - case object DateConversion extends JDBCConversion - case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion - case object DoubleConversion extends JDBCConversion - case object FloatConversion extends JDBCConversion - case object IntegerConversion extends JDBCConversion - case object LongConversion extends JDBCConversion - case object BinaryLongConversion extends JDBCConversion - case object StringConversion extends JDBCConversion - case object TimestampConversion extends JDBCConversion - case object BinaryConversion extends JDBCConversion - case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion + // A `JDBCConversion` is responsible for converting a value from `ResultSet` + // to a value in a field for `InternalRow`. + private type JDBCConversion = (ResultSet, Int) => Any + + // This `ArrayElementConversion` is responsible for converting elements in + // an array from `ResultSet`. + private type ArrayElementConversion = (Object) => Any /** - * Maps a StructType to a type tag list. + * Maps a StructType to conversions for each type. */ def getConversions(schema: StructType): Array[JDBCConversion] = schema.fields.map(sf => getConversions(sf.dataType, sf.metadata)) private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match { -case BooleanType => BooleanConversion -case DateType => DateConversion -case DecimalType.Fixed(p, s) => DecimalConversion(p, s) -case DoubleType => DoubleConversion -case FloatType => FloatConversion -case IntegerType => IntegerConversion -case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion -case StringType => StringConversion -case TimestampType => TimestampConversion -case BinaryType => BinaryConversion -case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata)) +case BooleanType => + (rs: ResultSet, pos: Int) => rs.getBoolean(pos) + +case DateType => + (rs: ResultSet, pos: Int) => +// DateTimeUtils.fromJavaDate does not handle null value, so we need to check it. +val dateVal = rs.getDate(pos) +if (dateVal != null) { + DateTimeUtils.fromJavaDate(dateVal) +} else { + null +} + +case DecimalType.Fixed(p, s) => + (rs: ResultSet, pos: Int) => +val decimalVal = rs.getBigDecimal(pos) +if (decimalVal == null) { --- End diff -- Same as above (plus you're checking equality with `null` opposite to the above -- consistency violated) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r71977337 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -322,46 +322,134 @@ private[sql] class JDBCRDD( } } - // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that - // we don't have to potentially poke around in the Metadata once for every - // row. - // Is there a better way to do this? I'd rather be using a type that - // contains only the tags I define. - abstract class JDBCConversion - case object BooleanConversion extends JDBCConversion - case object DateConversion extends JDBCConversion - case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion - case object DoubleConversion extends JDBCConversion - case object FloatConversion extends JDBCConversion - case object IntegerConversion extends JDBCConversion - case object LongConversion extends JDBCConversion - case object BinaryLongConversion extends JDBCConversion - case object StringConversion extends JDBCConversion - case object TimestampConversion extends JDBCConversion - case object BinaryConversion extends JDBCConversion - case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion + // A `JDBCConversion` is responsible for converting a value from `ResultSet` + // to a value in a field for `InternalRow`. + private type JDBCConversion = (ResultSet, Int) => Any + + // This `ArrayElementConversion` is responsible for converting elements in + // an array from `ResultSet`. + private type ArrayElementConversion = (Object) => Any /** - * Maps a StructType to a type tag list. + * Maps a StructType to conversions for each type. */ def getConversions(schema: StructType): Array[JDBCConversion] = schema.fields.map(sf => getConversions(sf.dataType, sf.metadata)) private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match { -case BooleanType => BooleanConversion -case DateType => DateConversion -case DecimalType.Fixed(p, s) => DecimalConversion(p, s) -case DoubleType => DoubleConversion -case FloatType => FloatConversion -case IntegerType => IntegerConversion -case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion -case StringType => StringConversion -case TimestampType => TimestampConversion -case BinaryType => BinaryConversion -case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata)) +case BooleanType => + (rs: ResultSet, pos: Int) => rs.getBoolean(pos) + +case DateType => + (rs: ResultSet, pos: Int) => +// DateTimeUtils.fromJavaDate does not handle null value, so we need to check it. +val dateVal = rs.getDate(pos) +if (dateVal != null) { + DateTimeUtils.fromJavaDate(dateVal) +} else { + null +} + +case DecimalType.Fixed(p, s) => + (rs: ResultSet, pos: Int) => +val decimalVal = rs.getBigDecimal(pos) +if (decimalVal == null) { + null +} else { + Decimal(decimalVal, p, s) +} + +case DoubleType => + (rs: ResultSet, pos: Int) => rs.getDouble(pos) + +case FloatType => + (rs: ResultSet, pos: Int) => rs.getFloat(pos) + +case IntegerType => + (rs: ResultSet, pos: Int) => rs.getInt(pos) + +case LongType if metadata.contains("binarylong") => + (rs: ResultSet, pos: Int) => +val bytes = rs.getBytes(pos) +var ans = 0L +var j = 0 +while (j < bytes.size) { + ans = 256 * ans + (255 & bytes(j)) + j = j + 1 +} +ans + +case LongType => + (rs: ResultSet, pos: Int) => rs.getLong(pos) + +case StringType => + (rs: ResultSet, pos: Int) => +// TODO(davies): use getBytes for better performance, if the encoding is UTF-8 +UTF8String.fromString(rs.getString(pos)) + +case TimestampType => + (rs: ResultSet, pos: Int) => +val t = rs.getTimestamp(pos) +if (t != null) { --- End diff -- same as above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r71977344 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -322,46 +322,134 @@ private[sql] class JDBCRDD( } } - // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that - // we don't have to potentially poke around in the Metadata once for every - // row. - // Is there a better way to do this? I'd rather be using a type that - // contains only the tags I define. - abstract class JDBCConversion - case object BooleanConversion extends JDBCConversion - case object DateConversion extends JDBCConversion - case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion - case object DoubleConversion extends JDBCConversion - case object FloatConversion extends JDBCConversion - case object IntegerConversion extends JDBCConversion - case object LongConversion extends JDBCConversion - case object BinaryLongConversion extends JDBCConversion - case object StringConversion extends JDBCConversion - case object TimestampConversion extends JDBCConversion - case object BinaryConversion extends JDBCConversion - case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion + // A `JDBCConversion` is responsible for converting a value from `ResultSet` + // to a value in a field for `InternalRow`. + private type JDBCConversion = (ResultSet, Int) => Any + + // This `ArrayElementConversion` is responsible for converting elements in + // an array from `ResultSet`. + private type ArrayElementConversion = (Object) => Any /** - * Maps a StructType to a type tag list. + * Maps a StructType to conversions for each type. */ def getConversions(schema: StructType): Array[JDBCConversion] = schema.fields.map(sf => getConversions(sf.dataType, sf.metadata)) private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match { -case BooleanType => BooleanConversion -case DateType => DateConversion -case DecimalType.Fixed(p, s) => DecimalConversion(p, s) -case DoubleType => DoubleConversion -case FloatType => FloatConversion -case IntegerType => IntegerConversion -case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion -case StringType => StringConversion -case TimestampType => TimestampConversion -case BinaryType => BinaryConversion -case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata)) +case BooleanType => + (rs: ResultSet, pos: Int) => rs.getBoolean(pos) + +case DateType => + (rs: ResultSet, pos: Int) => +// DateTimeUtils.fromJavaDate does not handle null value, so we need to check it. +val dateVal = rs.getDate(pos) +if (dateVal != null) { + DateTimeUtils.fromJavaDate(dateVal) +} else { + null +} + +case DecimalType.Fixed(p, s) => + (rs: ResultSet, pos: Int) => +val decimalVal = rs.getBigDecimal(pos) +if (decimalVal == null) { + null +} else { + Decimal(decimalVal, p, s) +} + +case DoubleType => + (rs: ResultSet, pos: Int) => rs.getDouble(pos) + +case FloatType => + (rs: ResultSet, pos: Int) => rs.getFloat(pos) + +case IntegerType => + (rs: ResultSet, pos: Int) => rs.getInt(pos) + +case LongType if metadata.contains("binarylong") => + (rs: ResultSet, pos: Int) => +val bytes = rs.getBytes(pos) +var ans = 0L +var j = 0 +while (j < bytes.size) { + ans = 256 * ans + (255 & bytes(j)) + j = j + 1 +} +ans + +case LongType => + (rs: ResultSet, pos: Int) => rs.getLong(pos) + +case StringType => + (rs: ResultSet, pos: Int) => +// TODO(davies): use getBytes for better performance, if the encoding is UTF-8 +UTF8String.fromString(rs.getString(pos)) + +case TimestampType => + (rs: ResultSet, pos: Int) => +val t = rs.getTimestamp(pos) +if (t != null) { + DateTimeUtils.fromJavaTimestamp(t) +} else { + null +} + +case BinaryType => + (rs: ResultSet, pos: Int) => rs.getBytes(pos) + +case ArrayType(et, _) => + val elementConversion: ArrayElementConversion = getArrayElementConversion(et, metadata) + (rs: ResultSet, pos: Int) =>
[GitHub] spark pull request #14313: [SPARK-16674][SQL] Avoid per-record type dispatch...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14313#discussion_r71977368 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -407,84 +495,8 @@ private[sql] class JDBCRDD( var i = 0 while (i < conversions.length) { --- End diff -- Why `while` not `foreach` or similar? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14329: [SPARKR][DOCS] fix broken url in doc
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/14329 [SPARKR][DOCS] fix broken url in doc ## What changes were proposed in this pull request? Fix broken url, also, sparkR.session.stop Rd should have it in the header ![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png) Data type section is in the middle of a list of gapply/gapplyCollect subsections: ![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png) ## How was this patch tested? manual test You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rdoclinkfix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14329.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14329 commit 40ca13c17e8e97732733e7bc200254459920d2f9 Author: Felix Cheung Date: 2016-07-23T20:13:49Z doc fix commit 06d8b415a3bce4c997683defce87b4833b56b1a9 Author: Felix Cheung Date: 2016-07-23T20:20:21Z Merge branch 'master' of https://github.com/apache/spark into rdoclinkfix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14329: [SPARKR][DOCS] fix broken url in doc
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14329 @shivaram --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14329: [SPARKR][DOCS] fix broken url in doc
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14329 **[Test build #62757 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62757/consoleFull)** for PR 14329 at commit [`06d8b41`](https://github.com/apache/spark/commit/06d8b415a3bce4c997683defce87b4833b56b1a9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14086: [SPARK-16463][SQL] Support `truncate` option in Overwrit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14086 **[Test build #62755 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62755/consoleFull)** for PR 14086 at commit [`8b452cb`](https://github.com/apache/spark/commit/8b452cb51814ed196a0cd16312074de3ea28330d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14086: [SPARK-16463][SQL] Support `truncate` option in Overwrit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14086 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14086: [SPARK-16463][SQL] Support `truncate` option in Overwrit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14086 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62755/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14307: [SPARK-16672][SQL] SQLBuilder should not raise exception...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14307 **[Test build #62756 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62756/consoleFull)** for PR 14307 at commit [`70f5401`](https://github.com/apache/spark/commit/70f5401e5d1a606117f85b1caa6c29724c623dff). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14307: [SPARK-16672][SQL] SQLBuilder should not raise exception...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14307 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62756/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14329: [SPARKR][DOCS] fix broken url in doc
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14329 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org