[GitHub] spark issue #18953: [SPARK-20682][SQL] Implement new ORC data source based o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18953 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18953: [SPARK-20682][SQL] Implement new ORC data source based o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18953 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80710/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18953: [SPARK-20682][SQL] Implement new ORC data source based o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18953 **[Test build #80710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80710/testReport)** for PR 18953 at commit [`22dbe35`](https://github.com/apache/spark/commit/22dbe358041605d6afc9d510f29802ce1c0fb7b3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18956 Interesting, existing `PullupCorrelatedPredicates` produces unresolved plan. I'll figure out the reason. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18956 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18956 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80718/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18956 **[Test build #80718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80718/testReport)** for PR 18956 at commit [`c99011d`](https://github.com/apache/spark/commit/c99011ddbf60ae104cb91c578d56c971e6b87c86). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18956 **[Test build #80717 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80717/testReport)** for PR 18956 at commit [`9170ceb`](https://github.com/apache/spark/commit/9170ceb69fda3ae6a064b1941cd380ee7a2a13ed). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18955: [SPARK-21743][SQL] top-most limit should not cause memor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18955 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80713/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18953: [SPARK-20682][SQL] Implement new ORC data source ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/18953#discussion_r133368809 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala --- @@ -343,7 +343,7 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest { } } - test("SPARK-8501: Avoids discovery schema from empty ORC files") { + ignore("SPARK-8501: Avoids discovery schema from empty ORC files") { --- End diff -- This only happens on old Hive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18955: [SPARK-21743][SQL] top-most limit should not cause memor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18955 **[Test build #80713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80713/testReport)** for PR 18955 at commit [`67ac3aa`](https://github.com/apache/spark/commit/67ac3aa37ad7762f3d95c7e3f4900ba47124583b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18956 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80717/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18955: [SPARK-21743][SQL] top-most limit should not cause memor...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18955 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18953: [SPARK-20682][SQL] Implement new ORC data source ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/18953#discussion_r133368613 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -47,11 +47,11 @@ import org.apache.spark.util.SerializableConfiguration * `FileFormat` for reading ORC files. If this is moved or renamed, please update * `DataSource`'s backwardCompatibilityMap. */ -class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable { +class OrcFileFormatOld extends FileFormat with DataSourceRegister with Serializable { --- End diff -- This change of name will be reverted after review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18956 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18953: [SPARK-20682][SQL] Implement new ORC data source ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/18953#discussion_r133368561 --- Diff: sql/hive/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister --- @@ -1,2 +1,2 @@ -org.apache.spark.sql.hive.orc.OrcFileFormat +org.apache.spark.sql.hive.orc.OrcFileFormatOld --- End diff -- This will be reverted after review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18956 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80715/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18956 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18956 **[Test build #80715 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80715/testReport)** for PR 18956 at commit [`21d86ba`](https://github.com/apache/spark/commit/21d86bac80790d0b994df79b5e27a7d2d354e90f). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18953: [SPARK-20682][SQL] Implement new ORC data source based o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18953 **[Test build #80721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80721/testReport)** for PR 18953 at commit [`07778ed`](https://github.com/apache/spark/commit/07778ed449bbf7ce2f1b5e8258e6ef58475b289c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18953: [SPARK-20682][SQL] Implement new ORC data source based o...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18953 Rebased to the master since #18640 is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18958 **[Test build #80720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80720/testReport)** for PR 18958 at commit [`cd0de39`](https://github.com/apache/spark/commit/cd0de397bba202cd5173e8aee0fc0bec2615295c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18640: [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18640 Thank you, @gatorsmile !!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/18958 cc @cloud-fan @BryanCutler --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/18958 [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce MutableColumnVector. ## What changes were proposed in this pull request? This is a refactoring of `ColumnVector` hierarchy and related classes. 1. make `ColumnVector` read-only 2. introduce `MutableColumnVector` with write interface 3. remove `ReadOnlyColumnVector` ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-21745 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18958.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18958 commit e4e22412c5ab23766a6908ec9e1a7931bcd52a54 Author: Takuya UESHIN Date: 2017-08-15T04:09:16Z Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce MutableColumnVector. commit cd0de397bba202cd5173e8aee0fc0bec2615295c Author: Takuya UESHIN Date: 2017-08-15T04:38:32Z Modify VectorizedHashMapGenerator to use OnHeapColumnVector directly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18315: [SPARK-21108] [ML] [WIP] convert LinearSVC to aggregator...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18315 **[Test build #80719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80719/testReport)** for PR 18315 at commit [`94e0250`](https://github.com/apache/spark/commit/94e025055a7755460cb83afe375d11a99dda8c0c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18640: [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18640 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18315: [SPARK-21108] [ML] [WIP] convert LinearSVC to aggregator...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/18315 @hhbyyh Would you mind to remove ```WIP``` in the PR title if it's applicable. I'll take a look soon. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18640: [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18640 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18315: [SPARK-21108] [ML] [WIP] convert LinearSVC to aggregator...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/18315 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/17862 cc @WeichenXu123 What do you think about this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18926 The current codes around what this PR changes look not quite clean to me too and we should clean around this. But I think this PR itself is quite well-formed with the fix that is valid, simple and targeted with tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18640: [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18640 Thank you so much, @rxin , @cloud-fan , @sameeragarwal , @mridulm , @viirya ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18956 **[Test build #80718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80718/testReport)** for PR 18956 at commit [`c99011d`](https://github.com/apache/spark/commit/c99011ddbf60ae104cb91c578d56c971e6b87c86). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18926 To be honest, the current codes do not look good to me. Since this does not make the code worse, I will not revert it back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18956: [SPARK-21726][SQL] Check for structural integrity...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18956#discussion_r133360995 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -37,6 +37,12 @@ import org.apache.spark.sql.types._ abstract class Optimizer(sessionCatalog: SessionCatalog) extends RuleExecutor[LogicalPlan] { + // Check for structural integrity of the plan in test mode. Currently we only check if a plan is + // still resolved after the execution of each rule. + override protected def planChecker: Option[LogicalPlan => Boolean] = Some( --- End diff -- Thanks. I will update it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r133360674 --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala --- @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg.{Vectors, VectorUDT} +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.Row +import org.apache.spark.sql.types.{IntegerType, StructField, StructType} + + +class ClusteringEvaluatorSuite + extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + import testImplicits._ + + val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0), --- End diff -- I think we can't put test data in resource file, as resource file will be packaged in the final jar file. What about randomly generated some small data in Python and hard code them here? Just like what we did in [```GaussianMixtureSuite``` ](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala#L195). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18640: [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18640 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18957: [SPARK-21744][CORE] Add retry logic for new broadcast in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18957 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r133360284 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.annotation.Experimental +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions.{avg, col, udf} +import org.apache.spark.sql.types.IntegerType + +/** + * Evaluator for clustering results. + * At the moment, the supported metrics are: + * squaredSilhouette: silhouette measure using the squared Euclidean distance; + * cosineSilhouette: silhouette measure using the cosine distance. + * The implementation follows the proposal explained + * https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view";> + * in this document. + */ +@Experimental +class ClusteringEvaluator (val uid: String) + extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + + def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette")) + + override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap) + + override def isLargerBetter: Boolean = true + + /** @group setParam */ + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + /** @group setParam */ + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** + * param for metric name in evaluation + * (supports `"squaredSilhouette"` (default)) + * @group param + */ + val metricName: Param[String] = { +val allowedParams = ParamValidators.inArray(Array("squaredSilhouette")) --- End diff -- Yeah, I think we can add a new param for the distance metric in the future. As MLlib only support _squared Euclidean distance_ , we can ignore this param and add annotation in the API to clarify it currently. You can check MLlib ```KMeans```, there is no param to set distance metric. cc @jkbradley @MLnick @hhbyyh --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18957: [SPARK-21744][CORE] Add retry logic for new broad...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/18957 [SPARK-21744][CORE] Add retry logic for new broadcast in BroadcastManager ## What changes were proposed in this pull request? When driver submit new stage and there is a bad disk before spark,then driver may will exit caused by exception below: `Job aborted due to stage failure: Task serialization failed: java.io.IOException: Failed to create local dir in /home/work/hdd5/yarn/xxx/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b. org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73) org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173) org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391) org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801) org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629) org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987) org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) scala.Option.foreach(Option.scala:236) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482) org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)` We can add retry logic when create broadcast to lower the probability of this scenario occurrence。And there is no side-effect for normal scenario. ## How was this patch tested? Unit test in BroadcastSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/imporve-newbroadcast Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18957.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18957 commit 9083304b4b42357dc2717151db28882e01245838 Author: zhoukang Date: 2017-08-16T05:08:35Z [SPARK][CORE] Add retry logic for new broadcast in BroadcastManager --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18956 **[Test build #80717 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80717/testReport)** for PR 18956 at commit [`9170ceb`](https://github.com/apache/spark/commit/9170ceb69fda3ae6a064b1941cd380ee7a2a13ed). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18956: [SPARK-21726][SQL] Check for structural integrity...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/18956#discussion_r133360047 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -37,6 +37,12 @@ import org.apache.spark.sql.types._ abstract class Optimizer(sessionCatalog: SessionCatalog) extends RuleExecutor[LogicalPlan] { + // Check for structural integrity of the plan in test mode. Currently we only check if a plan is + // still resolved after the execution of each rule. + override protected def planChecker: Option[LogicalPlan => Boolean] = Some( --- End diff -- can we move the checking of whether this is a test in here, then this method simply returns boolean. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18956 **[Test build #80715 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80715/testReport)** for PR 18956 at commit [`21d86ba`](https://github.com/apache/spark/commit/21d86bac80790d0b994df79b5e27a7d2d354e90f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18855: [SPARK-3151] [Block Manager] DiskStore.getBytes fails fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18855 **[Test build #80716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80716/testReport)** for PR 18855 at commit [`732073c`](https://github.com/apache/spark/commit/732073c5c73d4c12cc1059314c25f1ae94fc4469). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18955: [SPARK-21743][SQL] top-most limit should not caus...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18955#discussion_r133359698 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2658,4 +2658,9 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { checkAnswer(sql("SELECT __auto_generated_subquery_name.i from (SELECT i FROM v)"), Row(1)) } } + + test("SPARK-21743: top-most limit should not cause memory leak") { +// In unit test, Spark will fail the query if memory leak detected. --- End diff -- The test did not fail, but I saw the warning message: > 22:05:07.455 WARN org.apache.spark.executor.Executor: Managed memory leak detected; size = 33554432 bytes, TID = 2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18956: [SPARK-21726][SQL] Check for structural integrity...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/18956 [SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer in test mode. ## What changes were proposed in this pull request? We have many optimization rules now in `Optimzer`. Right now we don't have any checks in the optimizer to check for the structural integrity of the plan (e.g. resolved). When debugging, it is difficult to identify which rules return invalid plans. It would be great if in test mode, we can check whether a plan is still resolved after the execution of each rule, so we can catch rules that return invalid plans. ## How was this patch tested? Added tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-21726 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18956.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18956 commit 21d86bac80790d0b994df79b5e27a7d2d354e90f Author: Liang-Chi Hsieh Date: 2017-08-16T04:53:49Z Check for structural integrity of the plan in Optimzer in test mode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18492: [SPARK-19326] Speculated task attempts do not get launch...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18492 **[Test build #80714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80714/testReport)** for PR 18492 at commit [`8b8b128`](https://github.com/apache/spark/commit/8b8b12820b3bcdf57488558be08a64c3acca3053). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18955: [SPARK-21743][SQL] top-most limit should not cause memor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18955 **[Test build #80713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80713/testReport)** for PR 18955 at commit [`67ac3aa`](https://github.com/apache/spark/commit/67ac3aa37ad7762f3d95c7e3f4900ba47124583b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18955: [SPARK-21743][SQL] top-most limit should not cause memor...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18955 cc @gengliangwang @sameeragarwal @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18955: [SPARK-21743][SQL] top-most limit should not caus...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/18955 [SPARK-21743][SQL] top-most limit should not cause memory leak ## What changes were proposed in this pull request? For top-most limit, we will use a special operator to execute it: `CollectLimitExec`. `CollectLimitExec` will retrieve `n`(which is the limit) rows from each partition of the child plan output, see https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L311. It's very likely that we don't exhaust the child plan output. This is fine when whole-stage-codegen is off, as child plan will release the resource via task completion listener. However, when whole-stage codegen is on, the resource can only be released if all output is consumed. To fix this memory leak, one simple approach is, when `CollectLimitExec` retrieve `n` rows from child plan output, child plan output should only have `n` rows, then the output is exhausted and resource is released. This can be done by wrapping child plan with `LocalLimit` ## How was this patch tested? a regression test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark leak Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18955.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18955 commit 67ac3aa37ad7762f3d95c7e3f4900ba47124583b Author: Wenchen Fan Date: 2017-08-16T04:27:03Z top-most limit should not cause memory leak --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18492: [SPARK-19326] Speculated task attempts do not get launch...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18492 **[Test build #80712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80712/testReport)** for PR 18492 at commit [`f7cdad9`](https://github.com/apache/spark/commit/f7cdad9bfdc58a758dda69aa0204d3f5115897b2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18954: [SPARK-17654] [SQL] Enable creating hive bucketed tables
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18954 **[Test build #80711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80711/testReport)** for PR 18954 at commit [`4b009a9`](https://github.com/apache/spark/commit/4b009a909768f2d8066fb58a45d1c54378fa8ff9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18954: [SPARK-17654] [SQL] Enable creating hive bucketed tables
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/18954 Jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18954: [SPARK-17654] [SQL] Enable creating hive bucketed...
GitHub user tejasapatil opened a pull request: https://github.com/apache/spark/pull/18954 [SPARK-17654] [SQL] Enable creating hive bucketed tables ## What changes were proposed in this pull request? ### Semantics: - If the Hive table is bucketed, then INSERT node expect the child distribution to be based on the hash of the bucket columns. Else it would be empty. (Just to compare with Spark native bucketing : the required distribution is not enforced even if the table is bucketed or not... this saves the shuffle in comparison with hive). - Sort ordering for INSERT node over Hive bucketed table is determined as follows: | Table type | Normal table | Bucketed table | | - | - | - | | non-partitioned insert | Nil | sort columns | | static partition | Nil | sort columns | | dynamic partitions | partition columns | (partition columns + bucketId + sort columns) | Just to compare how sort ordering is expressed for Spark native bucketing: | Table type | Normal table | Bucketed table | | - | - | - | | sort ordering | partition columns | (partition columns + bucketId + sort columns) | Why is there a difference ? With hive, since there bucketed insertions would need a shuffle, sort ordering can be relaxed for both non-partitioned and static partition cases. Every RDD partition would get rows corresponding to a single bucket so those can be written to corresponding output file after sort. In case of dynamic partitions, the rows need to be routed to appropriate partition which makes it similar to Spark's constraints. - Only `Overwrite` mode is allowed for hive bucketed tables as any other mode will break the bucketing guarantees of the table. This is a difference wrt how Spark bucketing works. - With the PR, if there are no files created for empty buckets, the query will fail. Will support creation of empty files in coming iteration. This is a difference wrt how Spark bucketing works as it does NOT need files for empty buckets. ### Summary of changes done: - `ClusteredDistribution` and `HashPartitioning` are modified to store the hashing function used. - `RunnableCommand`'s' can now express the required distribution and ordering. This is used by `ExecutedCommandExec` which run these commands - The good thing about this is that I could remove the logic for enforcing sort ordering inside `FileFormatWriter` which felt out of place. Ideally, this kinda adding of physical nodes should be done within the planner which is what happens with this PR. - `InsertIntoHiveTable` enforces both distribution and sort ordering - `InsertIntoHadoopFsRelationCommand` enforces sort ordering ONLY (and not the distribution) - Fixed a bug due to which any alter commands to bucketed table (eg. updating stats) would wipe out the bucketing spec from metastore. This made insertions to bucketed table non-idempotent operation. ## How was this patch tested? - Added new unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/tejasapatil/spark bucket_write Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18954.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18954 commit 43fae74ff017959edbffa1cbd1405f58c5abe279 Author: Tejas Patil Date: 2017-08-03T22:57:54Z bucketed writer implementation commit 4b009a909768f2d8066fb58a45d1c54378fa8ff9 Author: Tejas Patil Date: 2017-08-15T23:27:06Z Move `requiredOrdering` into RunnableCommand instead of `FileFormatWriter` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18953: [SPARK-20682][SQL] Implement new ORC data source based o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18953 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18953: [SPARK-20682][SQL] Implement new ORC data source based o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18953 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80707/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18953: [SPARK-20682][SQL] Implement new ORC data source based o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18953 **[Test build #80707 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80707/testReport)** for PR 18953 at commit [`051ed1f`](https://github.com/apache/spark/commit/051ed1fd86ee1354d1e650b1cf51a41db2d83619). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class OrcFileFormatOld extends FileFormat with DataSourceRegister with Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18492: [SPARK-19326] Speculated task attempts do not get...
Github user janewangfb commented on a diff in the pull request: https://github.com/apache/spark/pull/18492#discussion_r133355548 --- Diff: core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala --- @@ -291,6 +297,16 @@ private[spark] trait SparkListenerInterface { def onBlockUpdated(blockUpdated: SparkListenerBlockUpdated): Unit /** + * Called when a speculative task is submitted + */ + def onSpeculativeTaskSubmitted(speculativeTask: SparkListenerSpeculativeTaskSubmitted): Unit + + /** + * Called when an extra executor is needed + */ + def onExtraExecutorNeeded(): Unit --- End diff -- @cloud-fan after thoughts, yes, I think we can get rid of extraExecutorNeeded event and handle it in ExecutorAllocationManager.scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18953: [SPARK-20682][SQL] Implement new ORC data source based o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18953 **[Test build #80710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80710/testReport)** for PR 18953 at commit [`22dbe35`](https://github.com/apache/spark/commit/22dbe358041605d6afc9d510f29802ce1c0fb7b3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18896: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18896 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80708/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18896: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18896 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18896: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18896 **[Test build #80708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80708/testReport)** for PR 18896 at commit [`2eda876`](https://github.com/apache/spark/commit/2eda87658e655f9f4424d7ac621fd44ca6d0f0ed). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18951: [SPARK-21738] Thriftserver doesn't cancel jobs when sess...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18951 LGTM cc @cloud-fan @jiangxb1987 @wangyum @debugger87 @jerryshao --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18640: [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18640 Hi, @cloud-fan , @rxin , @sameeragarwal and @mridulm . Could you merge this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][SQL]The wholestage codegen will be much sl...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18810 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18949: [SPARK-12961][CORE][FOLLOW-UP] Remove wrapper code for S...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/18949 @viirya aha, ok. thanks. (btw, since the comment is still important, we better keep it in code comment, maybe). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16763: [SPARK-19422][ML][WIP] Cache input data in algorithms
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16763 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16763: [SPARK-19422][ML][WIP] Cache input data in algorithms
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16763 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80709/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16763: [SPARK-19422][ML][WIP] Cache input data in algorithms
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16763 **[Test build #80709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80709/testReport)** for PR 16763 at commit [`1742c15`](https://github.com/apache/spark/commit/1742c15275b16f732adf5c55b89fb445a09886e7). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18949: [SPARK-12961][CORE][FOLLOW-UP] Remove wrapper code for S...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18949 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18949: [SPARK-12961][CORE][FOLLOW-UP] Remove wrapper code for S...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18949 @maropu There is another reason we leave the workaround in place: https://github.com/apache/spark/pull/11524#issuecomment-192409933 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @hhbyyh I rewrite the impl, and now all `NaN` and `missingValue` will be transform to `null` at first, then current methods are used. For columns only containing `null`, `null` is returned for `avg(col)`, and `Array.empty[Double]` is returned for `median` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16763: [SPARK-19422][ML][WIP] Cache input data in algorithms
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16763 **[Test build #80709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80709/testReport)** for PR 16763 at commit [`1742c15`](https://github.com/apache/spark/commit/1742c15275b16f732adf5c55b89fb445a09886e7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16763: [SPARK-19422][ML][WIP] Cache input data in algorithms
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/16763 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18798 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/18798 Merged into master, thanks for all. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18896: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18896 **[Test build #80708 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80708/testReport)** for PR 18896 at commit [`2eda876`](https://github.com/apache/spark/commit/2eda87658e655f9f4424d7ac621fd44ca6d0f0ed). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133347400 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2034,4 +2034,25 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } } } + + test("SPARK-21677: json_tuple throws NullPointException when column is null as string type") { --- End diff -- The end-to-end test at L2047 may not be able to move to `JsonExpressionsSuite`. We can have some unit test cases similar to L2039 in `JsonExpressionsSuite` as @gatorsmile suggested. It is also good to have similar end-to-end tests in `json-functions.sql`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18926 Merged to master. Please open JIRAs / PRs related with the discussion above if anyone is willing to proceed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18926 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18926 I am merging this as it looks there is an explicit objection for the current change itself and it looks the issue is fixed by this. To summarize the discussion here: - Cleaning up type checking logics, if possible. - Supporting "mixed" types. For example, `long` in Python 2 by casting. Another idea might be just wrapping it with `Column` for different types. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18950: [SPARK-20589][Core][Scheduler] Allow limiting tas...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/18950#discussion_r133344532 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -602,6 +604,21 @@ private[spark] class ExecutorAllocationManager( // place the executors. private val stageIdToExecutorPlacementHints = new mutable.HashMap[Int, (Int, Map[String, Int])] +override def onJobStart(jobStart: SparkListenerJobStart): Unit = { + val jobGroupId = if (jobStart.properties != null) { +jobStart.properties.getProperty(SparkContext.SPARK_JOB_GROUP_ID) + } else { +"" + } + val maxConcurrentTasks = conf.getInt(s"spark.job.$jobGroupId.maxConcurrentTasks", +Int.MaxValue) + + logInfo(s"Setting maximum concurrent tasks for group: ${jobGroupId} to $maxConcurrentTasks") + allocationManager.synchronized { +allocationManager.maxConcurrentTasks = maxConcurrentTasks --- End diff -- Ummm... what? It is entirely possible to set a job group, spawn a bunch of threads that will eventually create jobs in that job group, then set another job group and spawn more threads that will be creating jobs in this new group simultaneously with jobs being created in the prior group. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][SQL]The wholestage codegen will be much sl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18810 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][SQL]The wholestage codegen will be much sl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18810 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80703/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][SQL]The wholestage codegen will be much sl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18810 **[Test build #80703 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80703/testReport)** for PR 18810 at commit [`44ce894`](https://github.com/apache/spark/commit/44ce894fdc311febbac04fb70448c0081d0f4253). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18953: [SPARK-20682][SQL] Implement new ORC data source based o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18953 **[Test build #80707 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80707/testReport)** for PR 18953 at commit [`051ed1f`](https://github.com/apache/spark/commit/051ed1fd86ee1354d1e650b1cf51a41db2d83619). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18953: [SPARK-20682][SQL] Implement new ORC data source ...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/18953 [SPARK-20682][SQL] Implement new ORC data source based on Apache ORC ## What changes were proposed in this pull request? Since #17924, #17943, and #17980 are a little large PRs, this is a minimized version for next review excluding the followings. This PR still include #18640. I will rebase after #18640 is merged. - `OrcReadBenchmark.scala` - `OrcColumnarBatchReader.scala` - New ORC Test suites in `sql/core` This PR shows new ORC datasource replaces the old ORC datasource completely. After review, I will remove the change on old ORC datasource. We will allow to choose one of them in #17980 . ## How was this patch tested? Pass the Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-20682-3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18953.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18953 commit 051ed1fd86ee1354d1e650b1cf51a41db2d83619 Author: Dongjoon Hyun Date: 2017-08-16T01:32:37Z [SPARK-20682][SQL] Implement new ORC data source based on Apache ORC --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12646 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80706/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12646 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12646 **[Test build #80706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80706/testReport)** for PR 12646 at commit [`5e155bd`](https://github.com/apache/spark/commit/5e155bd80276373aa9a79d69efdbaad1fc3e8d14). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18887: [SPARK-20642][core] Store FsHistoryProvider listing data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18887 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80701/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18887: [SPARK-20642][core] Store FsHistoryProvider listing data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18887 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18887: [SPARK-20642][core] Store FsHistoryProvider listing data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18887 **[Test build #80701 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80701/testReport)** for PR 18887 at commit [`519dab0`](https://github.com/apache/spark/commit/519dab056964dae71309f65bcadee8ec08366284). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18896: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/18896 LGTM except for making the test's title more descriptive. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18488: [SPARK-21255][SQL][WIP] Fixed NPE when creating encoder ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80700/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18488: [SPARK-21255][SQL][WIP] Fixed NPE when creating encoder ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18488 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18488: [SPARK-21255][SQL][WIP] Fixed NPE when creating encoder ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18488 **[Test build #80700 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80700/testReport)** for PR 18488 at commit [`fbdc599`](https://github.com/apache/spark/commit/fbdc599b57711eef21da36a19bfb2e2ae4063344). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 info] Main Scala API documentation successful. [error] (spark/javaunidoc:doc) javadoc returned nonzero exit code [error] Total time: 95 s, completed Aug 15, 2017 4:59:59 PM [error] running /home/jenkins/workspace/SparkPullRequestBuilder/build/sbt -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive unidoc ; received return code 1 It seems irrelevant. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18923: [SPARK-21710][StSt] Fix OOM on ConsoleSink with l...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/18923#discussion_r15831 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala --- @@ -49,7 +49,7 @@ class ConsoleSink(options: Map[String, String]) extends Sink with Logging { println("---") // scalastyle:off println data.sparkSession.createDataFrame( - data.sparkSession.sparkContext.parallelize(data.collect()), data.schema) --- End diff -- I think we also need to consume all data to change the internal states in stateful operators. How about this: ```Scala val encoder = data.exprEnc.resolveAndBind( data.logicalPlan.output, data.sparkSession.sessionState.analyzer) val numRowsToFetch = numRowsToShow + 1 val takeResult = data.queryExecution.toRdd.mapPartitions { iter => var numFetched = 0 val v = ArrayBuffer[Row]() while (numFetched < numRowsToFetch && iter.hasNext) { v += encoder.fromRow(iter.next()) numFetched += 1 } // Consume all data to update internal states in stateful operators. while (iter.hasNext) { iter.next() } v.iterator }.collect().toSeq.take(numRowsToFetch) data.sparkSession.createDataFrame( data.sparkSession.sparkContext.parallelize(takeResult), data.schema).show(numRowsToShow, isTruncated) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org