[GitHub] spark issue #19119: [SPARK-21845] [SQL] Make codegen fallback of expressions...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19119 **[Test build #81377 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81377/testReport)** for PR 19119 at commit [`b96da49`](https://github.com/apache/spark/commit/b96da49aa0893f8bf34da2a2c111499fdbad7b5a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19119: [SPARK-21845] [SQL] Make codegen fallback of expr...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/19119 [SPARK-21845] [SQL] Make codegen fallback of expressions configurable ## What changes were proposed in this pull request? We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases. ## How was this patch tested? Added test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark fallbackCodegen Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19119.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19119 commit 9f3a1cd0c2e95bd6239facf3584d590a88aafce4 Author: gatorsmile Date: 2017-08-26T22:16:02Z fix. commit 26dcbd6310507ad96d5cf5b124cd93e6b788a6ea Author: gatorsmile Date: 2017-08-29T19:48:07Z fix commit 4f39a5220f6b3d318ad0c07eeec54aed5dad20e8 Author: gatorsmile Date: 2017-08-29T19:56:37Z fix commit b96da49aa0893f8bf34da2a2c111499fdbad7b5a Author: gatorsmile Date: 2017-09-04T06:39:49Z fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19112: [SPARK-21901][SS] Define toString for StateOperat...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19112#discussion_r136752724 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala --- @@ -200,7 +202,7 @@ class SourceProgress protected[sql]( */ @InterfaceStability.Evolving class SinkProgress protected[sql]( -val description: String) extends Serializable { --- End diff -- Sorry, I'm not in a position to say that. Maybe, committers will review this and give a direction. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19118: [SPARK-21882][CORE] OutputMetrics doesn't count written ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19118 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18975 @tejasapatil So far, we disallow overwriting a path that is also being read from. Atomicity is not guaranteed even if we use the staging directory. Atomicity still cannot be achieved if we use moving/renaming the directory, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #81376 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81376/testReport)** for PR 17014 at commit [`f8fa957`](https://github.com/apache/spark/commit/f8fa9573a1b40ff236e9c52cf429e2742c8f2bd0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19118: [SPARK-21882][CORE] OutputMetrics doesn't count w...
GitHub user awarrior opened a pull request: https://github.com/apache/spark/pull/19118 [SPARK-21882][CORE] OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function spark-21882 ## What changes were proposed in this pull request? Switch the initialization order of HadoopOutputMetrics and SparkHadoopWriter ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/awarrior/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19118.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19118 commit 0f0c3b1c91b4f06c7e48874b8f6329c5c1c1b3ce Author: Jarvis Date: 2017-09-04T06:21:13Z Update SparkHadoopWriter.scala spark-21882 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19112: [SPARK-21901][SS] Define toString for StateOperat...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/19112#discussion_r136750244 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala --- @@ -200,7 +202,7 @@ class SourceProgress protected[sql]( */ @InterfaceStability.Evolving class SinkProgress protected[sql]( -val description: String) extends Serializable { --- End diff -- Should I fix the other places then? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19117: [SPARK-21904] [SQL] Rename tempTables to tempViews in Se...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19117 **[Test build #81375 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81375/testReport)** for PR 19117 at commit [`02815e7`](https://github.com/apache/spark/commit/02815e7faae23a32b04c7af08c826f4428c60f5c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19117: [SPARK-21904] [SQL] Rename tempTables to tempViews in Se...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19117 @cloud-fan @sameeragarwal --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19117: [SPARK-21904] [SQL] Rename tempTables to tempView...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/19117 [SPARK-21904] [SQL] Rename tempTables to tempViews in SessionCatalog ### What changes were proposed in this pull request? `tempTables` is not right. To be consistent, we need to rename the internal variable names/comments to tempViews in SessionCatalog too. ### How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark renameTempTablesToTempViews Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19117.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19117 commit 02815e7faae23a32b04c7af08c826f4428c60f5c Author: gatorsmile Date: 2017-09-04T05:39:53Z rename temp tables to temp views. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17014 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #81373 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81373/testReport)** for PR 17014 at commit [`971e52c`](https://github.com/apache/spark/commit/971e52c4a18b4261d82ac14fefa6bb849367562c). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17014 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81373/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19115: [SPARK-21882][CORE] OutputMetrics doesn't count written ...
Github user awarrior commented on the issue: https://github.com/apache/spark/pull/19115 ok, thx --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19115: [SPARK-21882][CORE] OutputMetrics doesn't count w...
Github user awarrior closed the pull request at: https://github.com/apache/spark/pull/19115 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19086 **[Test build #81374 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81374/testReport)** for PR 19086 at commit [`2adc205`](https://github.com/apache/spark/commit/2adc20526a30cadb08d673ac7fdf219f35d8268a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #81373 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81373/testReport)** for PR 17014 at commit [`971e52c`](https://github.com/apache/spark/commit/971e52c4a18b4261d82ac14fefa6bb849367562c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 Sure, current behavior is hive behavior. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19086 Please hold it. It means it is a behavior change. Let me consider it more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 yes, correct --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19086 > use db2; > alter table db1.t2 rename to t1; After this PR, it is renamed to `db2.t1`, right? Before this PR, it is renamed to `db1.t1`, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19115: [SPARK-21882][CORE] OutputMetrics doesn't count written ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19115 Please create a new pr against master branch and close this one. If the issue doesn't exist in master branch, then consider backporting that fix to 2.2 branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19086#discussion_r136747432 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -569,46 +569,51 @@ class SessionCatalog( /** * Rename a table. * - * If a database is specified in `oldName`, this will rename the table in that database. + * If the database specified in `newName` is different from the one specified in `oldName`, + * It will result in moving table across databases. + * * If no database is specified, this will first attempt to rename a temporary table with - * the same name, then, if that does not exist, rename the table in the current database. + * the same name, then, if that does not exist, current database will be used for locating + * `oldName` or `newName`. --- End diff -- So should I make the comment here unchanged? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19086#discussion_r136747159 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -418,6 +439,42 @@ abstract class SessionCatalogSuite extends AnalysisTest { } } + test("rename global temp view") { +withBasicCatalog { catalog => + val globalTempTable = Range(1, 10, 2, 10) + catalog.createGlobalTempView("tbl1", globalTempTable, overrideIfExists = false) + assert(catalog.getGlobalTempView("tbl1") == Option(globalTempTable)) + assert(catalog.externalCatalog.listTables("db2").toSet == Set("tbl1", "tbl2")) + // If database is not specified, global temp view will not be renamed + catalog.setCurrentDatabase("db1") + val thrown1 = intercept[AnalysisException] { +catalog.renameTable(TableIdentifier("tbl1"), TableIdentifier("tblone")) + } + assert(thrown1.getMessage.contains("Table or view 'tbl1' not found in database 'db1'")) + catalog.setCurrentDatabase("db2") + catalog.renameTable(TableIdentifier("tbl1"), TableIdentifier("tblone")) + assert(catalog.getGlobalTempView("tbl1") == Option(globalTempTable)) + assert(catalog.externalCatalog.listTables("db2").toSet == Set("tblone", "tbl2")) + // Moving global temp view to another database is forbidden + val thrown2 = intercept[AnalysisException] { +catalog.renameTable( + TableIdentifier("tbl1", Some("global_temp")), TableIdentifier("tbl3", Some("db2"))) + } + assert(thrown2.getMessage.contains("Cannot change database of table 'tbl1'")) + // Moving table from regular database to be a global temp view is forbidden + val thrown3 = intercept[AnalysisException] { +catalog.renameTable( + TableIdentifier("tbl2", Some("db2")), TableIdentifier("tbltwo", Some("global_temp"))) + } --- End diff -- Normally, we put `.getMessage` here. Thus, we can make the next line shorter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #81372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81372/testReport)** for PR 17014 at commit [`df4d263`](https://github.com/apache/spark/commit/df4d263d21c3e4413b4d74aaabcd3a10348a0001). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17014 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81372/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17014 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136746833 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala --- @@ -89,6 +89,8 @@ case class ExpandExec( child.asInstanceOf[CodegenSupport].inputRDDs() } + override protected def doConsumeInChainOfFunc: Boolean = false --- End diff -- yea, probably we might need to describe more about exceptional cases we can't use this optimization like `HashAggregateExec` in https://github.com/apache/spark/pull/18931/files#diff-28cb12941b992ff680c277c651b59aa0R204 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #81372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81372/testReport)** for PR 17014 at commit [`df4d263`](https://github.com/apache/spark/commit/df4d263d21c3e4413b4d74aaabcd3a10348a0001). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136746468 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala --- @@ -177,6 +177,8 @@ case class SortExec( """.stripMargin.trim } + override protected def doConsumeInChainOfFunc: Boolean = false --- End diff -- yea, other guys might give good suggestions on the naming... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19086 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19086 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81370/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19086 **[Test build #81370 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81370/testReport)** for PR 19086 at commit [`d4a0f97`](https://github.com/apache/spark/commit/d4a0f97b9600dd661d81a79e71019a781eb812ff). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19086#discussion_r136745765 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -569,46 +569,51 @@ class SessionCatalog( /** * Rename a table. * - * If a database is specified in `oldName`, this will rename the table in that database. + * If the database specified in `newName` is different from the one specified in `oldName`, + * It will result in moving table across databases. + * * If no database is specified, this will first attempt to rename a temporary table with - * the same name, then, if that does not exist, rename the table in the current database. + * the same name, then, if that does not exist, current database will be used for locating + * `oldName` or `newName`. --- End diff -- This is wrong, right? This is for `temp view` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136745459 --- Diff: project/plugins.sbt --- @@ -7,8 +7,7 @@ addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "5.1.0") // sbt 1.0.0 support: https://github.com/jrudolph/sbt-dependency-graph/issues/134 addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.8.2") -// need to make changes to uptake sbt 1.0 support in "org.scalastyle" %% "scalastyle-sbt-plugin" % "0.9.0" --- End diff -- Looks we are okay to remove this - https://github.com/scalastyle/scalastyle-sbt-plugin/blob/v1.0.0/README.md --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19116 cc @srowen and @vanzin, could you take a look please when you have some time? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136745412 --- Diff: scalastyle-config.xml --- @@ -268,10 +268,7 @@ This file is divided into 3 sections: - -^Override$ -override modifier should be used instead of @java.lang.Override. - + --- End diff -- This removes the workaround we used and use `org.scalastyle.scalariform.OverrideJavaChecker` - `https://github.com/scalastyle/scalastyle/issues/214`. I manually tested with `@Override`: Before: ``` [error] ...:435:3: override modifier should be used instead of @java.lang.Override. ``` After: ``` [error] ...:435:2: override keyword should be used instead of @java.lang.Override ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136745559 --- Diff: project/SparkBuild.scala --- @@ -163,14 +163,15 @@ object SparkBuild extends PomBuild { val configUrlV = scalastyleConfigUrl.in(config).value val streamsV = streams.in(config).value val failOnErrorV = true +val failOnWarningV = true val scalastyleTargetV = scalastyleTarget.in(config).value val configRefreshHoursV = scalastyleConfigRefreshHours.in(config).value val targetV = target.in(config).value val configCacheFileV = scalastyleConfigUrlCacheFile.in(config).value logger.info(s"Running scalastyle on ${name.value} in ${config.name}") -Tasks.doScalastyle(args, configV, configUrlV, failOnErrorV, scalaSourceV, scalastyleTargetV, - streamsV, configRefreshHoursV, targetV, configCacheFileV) +Tasks.doScalastyle(args, configV, configUrlV, failOnErrorV, failOnWarningV, scalaSourceV, --- End diff -- This change looks required - https://github.com/scalastyle/scalastyle-sbt-plugin/commit/cbbb9c9d1c33b9e90bf187edb0dd2d6b4eceaa21 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19086#discussion_r136745597 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -569,46 +569,51 @@ class SessionCatalog( /** * Rename a table. * - * If a database is specified in `oldName`, this will rename the table in that database. + * If the database specified in `newName` is different from the one specified in `oldName`, + * It will result in moving table across databases. --- End diff -- `It will result` -> `it results` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19116 **[Test build #81371 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81371/testReport)** for PR 19116 at commit [`2447fd0`](https://github.com/apache/spark/commit/2447fd0e152ace4dd074a92bd0d3cdc638b09b1a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/19116 [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0. ## What changes were proposed in this pull request? 1.0.0 fixes an issue with import order, explicit type for public methods, line length limitation and comment validation: ``` [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16: Are you sure you want to println? If yes, wrap the code block with [error] // scalastyle:off println [error] println(...) [error] // scalastyle:on println [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49: File line length exceeds 100 characters [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21: Are you sure you want to println? If yes, wrap the code block with [error] // scalastyle:off println [error] println(...) [error] // scalastyle:on println [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2: Insert a space after the start of the comment [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43: JavaDStream should come before JavaDStreamLike. ``` This PR also fixes the workaround added in SPARK-16877 for `org.scalastyle.scalariform.OverrideJavaChecker` feature, added from 0.9.0. ## How was this patch tested? Manually tested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark scalastyle-1.0.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19116.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19116 commit 2447fd0e152ace4dd074a92bd0d3cdc638b09b1a Author: hyukjinkwon Date: 2017-09-04T04:26:19Z Upgrade scalastyle to 1.0.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 hmm... that's interesting. So I found performance gap between dataframe codegen aggregation and the simple RDD aggregation. I will discuss with SQL team for this later. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @WeichenXu123 No, I only cache the DataFrame. And the RDD-Version is [here](https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465). I use the same testsuit above to test those impls. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19113 How about the other popular open source projects? Do you know whether which projects are using Univocity 2.5? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 @zhengruifeng `KMeans` regarded as a bugfix(SPARK-21799) because the double-cache issue is introduced in 2.2 and cause perf regression. Other algos also have the same issue, but the issue exists in those algos for a long time and they related to `Predictor` so it is not so easy to fix, we can leave them in here and do more discussion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19113 Any performance measure from 2.2 to 2.5? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18865 Yea, it is and that's what we discussed - https://github.com/apache/spark/pull/18865#discussion_r131887972. In that way, I was thinking https://github.com/apache/spark/pull/18865#issuecomment-326858521 case could make sense as no columns are selected. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18869: [SPARK-21654][SQL] Complement SQL predicates expr...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18869 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18869 Thanks @HyukjinKwon @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18869 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18869 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18865 But doesn't this behavior that `_corrupt_record` content depends on the selected json fields is designed at the first? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18865 > the usage can cause weird results too > `_corrupt_record` should return all the records that Spark SQL fail to parse I think another point should be, this issue still exists even if we disallow selecting `_corrupt_record` alone here. If we select few columns together with `_corrupt_record`, this case will still exist and results won't be consistent. For example, ``` echo '{"fieldA": 1, "fieldB": 2} {"fieldA": 3, "fieldB": 4} {"fieldA": "5", "fieldB": 6}' >/tmp/sample.json ``` ```scala val file = "/tmp/sample.json" val dfFromFile = spark.read.schema("fieldA BYTE, fieldB BYTE, _corrupt_record STRING").json(file) dfFromFile.select($"fieldA", $"_corrupt_record").show() dfFromFile.select($"fieldB", $"_corrupt_record").show() ``` ``` scala> dfFromFile.select($"fieldA", $"_corrupt_record").show() +--++ |fieldA| _corrupt_record| +--++ | 1|null| | 3|null| | null| {"fieldA": "5", ...| +--++ scala> dfFromFile.select($"fieldB", $"_corrupt_record").show() +--+---+ |fieldB|_corrupt_record| +--+---+ | 2| null| | 4| null| | 6| null| +--+---+ ``` If we should disallow, I think we should rather deprecate this option first with some warnings, or explain this behaviour in the documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136743435 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala --- @@ -177,6 +177,8 @@ case class SortExec( """.stripMargin.trim } + override protected def doConsumeInChainOfFunc: Boolean = false --- End diff -- I revised this variable name in times, but didn't find a good name to convey its meaning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19115: [SPARK-21882][CORE] OutputMetrics doesn't count written ...
Github user awarrior commented on the issue: https://github.com/apache/spark/pull/19115 @jerryshao hi~ I have modified this PR. But this patch just work in 2.2.0 (some changes apply now). I want to confirm whether I need to create a new PR. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19113 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19113 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81368/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19113 **[Test build #81368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81368/testReport)** for PR 19113 at commit [`fa7eb51`](https://github.com/apache/spark/commit/fa7eb514cfd91bb405fd74680b08d5865911e3f0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18865 BTW, this change should be put into the migration guide of Spark SQL. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18865 From the viewpoints of the end users of Spark, `dfFromFile.select($"_corrupt_record").show()` might not return all the expected records. ``_corrupt_record`` should return all the records that Spark SQL fail to parse. If we are unable to output the expected results, we need to disallow users to do it. cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136742515 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala --- @@ -89,6 +89,8 @@ case class ExpandExec( child.asInstanceOf[CodegenSupport].inputRDDs() } + override protected def doConsumeInChainOfFunc: Boolean = false --- End diff -- The `doConsume` produces something like: |for (int $i = 0; $i < ${projections.length}; $i ++) { | switch ($i) { |${cases.mkString("\n").trim} | } | $numOutput.add(1); | ${consume(ctx, outputColumns)} |} So the consume logic of its parent node is actually wrapped in a local for-loop. It has the same effect as not chain the next consume. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136742331 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan { ctx.freshNamePrefix = parent.variablePrefix val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) + +// Under certain conditions, we can put the logic to consume the rows of this operator into +// another function. So we can prevent a generated function too long to be optimized by JIT. +// The conditions: +// 1. The parent uses all variables in output. we can't defer variable evaluation when consume +//in another function. +// 2. The output variables are not empty. If it's empty, we don't bother to do that. +// 3. We don't use row variable. The construction of row uses deferred variable evaluation. We +//can't do it. +// 4. The number of output variables must less than maximum number of parameters in Java method +//declaration. +val requireAllOutput = output.forall(parent.usedInputs.contains(_)) +val consumeFunc = + if (row == null && outputVars.nonEmpty && requireAllOutput && outputVars.length < 255) { +constructDoConsumeFunction(ctx, inputVars) + } else { +parent.doConsume(ctx, inputVars, rowVar) + } s""" |${ctx.registerComment(s"CONSUME: ${parent.simpleString}")} |$evaluated - |${parent.doConsume(ctx, inputVars, rowVar)} + |$consumeFunc + """.stripMargin + } + + /** + * To prevent concatenated function growing too long to be optimized by JIT. Instead of inlining, + * we may put the consume logic of parent operator into a function and set this flag to `true`. + * The parent operator can know if its consume logic is inlined or in separated function. + */ + private var doConsumeInFunc: Boolean = false + + /** + * Returning true means we have at least one consume logic from child operator or this operator is + * separated in a function. If this is `true`, this operator shouldn't use `continue` statement to + * continue on next row, because its generated codes aren't enclosed in main while-loop. + * + * For example, we have generated codes for a query plan like: + * Op1Exec + * Op2Exec + * Op3Exec + * + * If we put the consume code of Op2Exec into a separated function, the generated codes are like: + * while (...) { + * ... // logic of Op3Exec. + * Op2Exec_doConsume(...); + * } + * private boolean Op2Exec_doConsume(...) { + * ... // logic of Op2Exec to consume rows. + * } + * For now, `doConsumeInChainOfFunc` of Op2Exec will be `true`. + * + * Notice for some operators like `HashAggregateExec`, it doesn't chain previous consume functions + * but begins with its produce framework. We should override `doConsumeInChainOfFunc` to return + * `false`. + */ + protected def doConsumeInChainOfFunc: Boolean = { +val codegenChildren = children.map(_.asInstanceOf[CodegenSupport]) +doConsumeInFunc || codegenChildren.exists(_.doConsumeInChainOfFunc) + } + + /** + * The actual java statement this operator should use if there is a need to continue on next row + * in its `doConsume` codes. + * + * while (...) { + * ... // logic of Op3Exec. + * Op2Exec_doConsume(...); + * } + * private boolean Op2Exec_doConsume(...) { + * ... // logic of Op2Exec to consume rows. + * continue; // Wrong. We can't use continue with the while-loop. + * } + * In above code, we can't use `continue` in `Op2Exec_doConsume`. + * + * Instead, we do something like: + * while (...) { + * ... // logic of Op3Exec. + * boolean continueForLoop = Op2Exec_doConsume(...); + * if (continueForLoop) continue; + * } + * private boolean Op2Exec_doConsume(...) { + * ... // logic of Op2Exec to consume rows. + * return true; // When we need to do continue, we return true. + * } + */ + protected def continueStatementInDoConsume: String = if (doConsumeInChainOfFunc) { +"return true;"; --- End diff -- Thanks. I'll fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136742282 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan { ctx.freshNamePrefix = parent.variablePrefix val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) + +// Under certain conditions, we can put the logic to consume the rows of this operator into +// another function. So we can prevent a generated function too long to be optimized by JIT. +// The conditions: +// 1. The parent uses all variables in output. we can't defer variable evaluation when consume +//in another function. +// 2. The output variables are not empty. If it's empty, we don't bother to do that. +// 3. We don't use row variable. The construction of row uses deferred variable evaluation. We +//can't do it. +// 4. The number of output variables must less than maximum number of parameters in Java method +//declaration. +val requireAllOutput = output.forall(parent.usedInputs.contains(_)) +val consumeFunc = + if (row == null && outputVars.nonEmpty && requireAllOutput && outputVars.length < 255) { +constructDoConsumeFunction(ctx, inputVars) --- End diff -- I was thinking to check it. But the whole-stage codegen is a non-breaking processing which produce/consume calls are embeded together. You don't have a break to check the function length here. Actually I think it should have no negative effect to split consume functions always. From the benchmarking numbers, looks it shows no harm to normal queries. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136742019 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan { ctx.freshNamePrefix = parent.variablePrefix val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) + +// Under certain conditions, we can put the logic to consume the rows of this operator into +// another function. So we can prevent a generated function too long to be optimized by JIT. +// The conditions: +// 1. The parent uses all variables in output. we can't defer variable evaluation when consume +//in another function. +// 2. The output variables are not empty. If it's empty, we don't bother to do that. +// 3. We don't use row variable. The construction of row uses deferred variable evaluation. We --- End diff -- The same reason as above. The variables used to evaluate the row can be out of scope because row construction is deferred until it is used actually. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136741637 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan { ctx.freshNamePrefix = parent.variablePrefix val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) + +// Under certain conditions, we can put the logic to consume the rows of this operator into +// another function. So we can prevent a generated function too long to be optimized by JIT. +// The conditions: +// 1. The parent uses all variables in output. we can't defer variable evaluation when consume --- End diff -- E.g., a `ProjectExec` node doesn't necessarily evaluate all its output variables before continuing `doConsume` of its parent node. It can defer the evaluation until the variables are needed in the parent node's consume logic. Once a variable's evaluation is deferred, and if we create a consume function, the variable will be evaluated in the function. But now the references of this variable is out of scope. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19050: [SPARK-21835][SQL] RewritePredicateSubquery should not p...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19050 ping @cloud-fan @hvanhovell This blocks the #18956 going forward, can you help review this change? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18869 @HyukjinKwon @gatorsmile Any more comments on this? The added tests should be enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136740379 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef } } } + + test("insert overwrite to dir from hive metastore table") { +withTempDir { dir => + val path = dir.toURI.getPath + + sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src where key < 10") + + sql( +s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM src where key < 10 + """.stripMargin) + + // use orc data source to check the data of path is right. + withTempView("orc_source") { +sql( + s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + +checkAnswer( + sql("select * from orc_source"), + sql("select * from src where key < 10").collect()) + } +} + } + + test("insert overwrite to local dir from temp table") { +withTempView("test_insert_table") { + spark.range(10).selectExpr("id", "id AS str").createOrReplaceTempView("test_insert_table") + + withTempDir { dir => +val path = dir.toURI.getPath + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' + |SELECT * FROM test_insert_table + """.stripMargin) + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM test_insert_table + """.stripMargin) + +// use orc data source to check the data of path is right. +withTempView("orc_source") { + sql( +s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + + checkAnswer( +sql("select * from orc_source"), +sql("select * from test_insert_table").collect()) +} + } +} + } + + test("insert overwrite to dir from temp table") { +withTempView("test_insert_table") { + spark.range(10).selectExpr("id", "id AS str").createOrReplaceTempView("test_insert_table") + + withTempDir { dir => +val pathUri = dir.toURI + +sql( + s""" + |INSERT OVERWRITE DIRECTORY '${pathUri}' + |ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' + |SELECT * FROM test_insert_table + """.stripMargin) + +sql( + s""" + |INSERT OVERWRITE DIRECTORY '${pathUri}' + |STORED AS orc + |SELECT * FROM test_insert_table + """.stripMargin) + +// use orc data source to check the data of path is right. +withTempView("orc_source") { + sql( +s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + + checkAnswer( +sql("select * from orc_source"), --- End diff -- You do not need the temp view `orc_source `. Below is an example. ```Scala checkAnswer( spark.read.orc(dir.getCanonicalPath), sql("select * from test_insert_table")) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19114: Update PairRDDFunctions.scala
Github user awarrior closed the pull request at: https://github.com/apache/spark/pull/19114 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136739979 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef } } } + + test("insert overwrite to dir from hive metastore table") { +withTempDir { dir => + val path = dir.toURI.getPath + + sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src where key < 10") + + sql( +s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM src where key < 10 + """.stripMargin) + + // use orc data source to check the data of path is right. + withTempView("orc_source") { +sql( + s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + +checkAnswer( + sql("select * from orc_source"), + sql("select * from src where key < 10").collect()) + } +} + } + + test("insert overwrite to local dir from temp table") { +withTempView("test_insert_table") { + spark.range(10).selectExpr("id", "id AS str").createOrReplaceTempView("test_insert_table") + + withTempDir { dir => +val path = dir.toURI.getPath + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' + |SELECT * FROM test_insert_table + """.stripMargin) + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM test_insert_table + """.stripMargin) + +// use orc data source to check the data of path is right. +withTempView("orc_source") { + sql( +s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + + checkAnswer( +sql("select * from orc_source"), +sql("select * from test_insert_table").collect()) --- End diff -- The same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136739964 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef } } } + + test("insert overwrite to dir from hive metastore table") { +withTempDir { dir => + val path = dir.toURI.getPath + + sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src where key < 10") + + sql( +s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM src where key < 10 + """.stripMargin) + + // use orc data source to check the data of path is right. + withTempView("orc_source") { +sql( + s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + +checkAnswer( + sql("select * from orc_source"), + sql("select * from src where key < 10").collect()) --- End diff -- Nit: `.collect()` can be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136739990 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef } } } + + test("insert overwrite to dir from hive metastore table") { +withTempDir { dir => + val path = dir.toURI.getPath + + sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src where key < 10") + + sql( +s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM src where key < 10 + """.stripMargin) + + // use orc data source to check the data of path is right. + withTempView("orc_source") { +sql( + s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + +checkAnswer( + sql("select * from orc_source"), + sql("select * from src where key < 10").collect()) + } +} + } + + test("insert overwrite to local dir from temp table") { +withTempView("test_insert_table") { + spark.range(10).selectExpr("id", "id AS str").createOrReplaceTempView("test_insert_table") + + withTempDir { dir => +val path = dir.toURI.getPath + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' + |SELECT * FROM test_insert_table + """.stripMargin) + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM test_insert_table + """.stripMargin) + +// use orc data source to check the data of path is right. +withTempView("orc_source") { + sql( +s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + + checkAnswer( +sql("select * from orc_source"), +sql("select * from test_insert_table").collect()) +} + } +} + } + + test("insert overwrite to dir from temp table") { +withTempView("test_insert_table") { + spark.range(10).selectExpr("id", "id AS str").createOrReplaceTempView("test_insert_table") + + withTempDir { dir => +val pathUri = dir.toURI + +sql( + s""" + |INSERT OVERWRITE DIRECTORY '${pathUri}' + |ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' + |SELECT * FROM test_insert_table + """.stripMargin) + +sql( + s""" + |INSERT OVERWRITE DIRECTORY '${pathUri}' + |STORED AS orc + |SELECT * FROM test_insert_table + """.stripMargin) + +// use orc data source to check the data of path is right. +withTempView("orc_source") { + sql( +s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + + checkAnswer( +sql("select * from orc_source"), +sql("select * from test_insert_table").collect()) --- End diff -- The same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 +1 for using Dataframe-based version code. @zhengruifeng One thing I want to confirm is that, I check your testing code, both RDD-based version and Dataframe-based version code will both cost on deserialization: ``` ... val df = spark.createDataFrame(rows, struct) df.persist() df.count() ... // do `imputer.fit` ``` when running `imputer.fit`, it will extract the required columns from the cached input dataframe, and then you compare the perf between `RDD.aggregate` and `dataframe avg`, they both need to deserialize data from input and then do computation, and `dataframe avg` will take advantage of codegen and should be faster. But here the test show that RDD version is slower than Dataframe version, it is not very reasonable, so I want to confirm: in your RDD version testing, do you cache again when get `RDD` from the input `Dataframe`? If not, your testing has no problem, I will guess there exists other performance issue in SQL layer and cc @cloud-fan to take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19086 **[Test build #81370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81370/testReport)** for PR 19086 at commit [`d4a0f97`](https://github.com/apache/spark/commit/d4a0f97b9600dd661d81a79e71019a781eb812ff). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 @gatorsmile I updated, let me known if there's still comments not resolved. Thanks again for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18865 @HyukjinKwon 's provided use case looks pretty fair. The corrupt record is the whole line which doesn't follow the json format. It is kind of different to the corrupt record case that some json fields can't be correctly converted to desired data type. This two kind of corrupt records can be mixed in one json file. E.g., echo '{"field": 1 {"field" 2} {"field": 3} {"field": "4"}' >/tmp/sample.json scala> dfFromFile.show(false) +-+---+ |field|_corrupt_record| +-+---+ |null |{"field": 1| |null | {"field" 2} | |3|null | |null |{"field": "4"} | +-+---+ scala> dfFromFile.select($"_corrupt_record").show() +---+ |_corrupt_record| +---+ |{"field": 1| |{"field" 2}| | null| | null| +---+ At least we should clearly explain the difference in the error message. Maybe something like: The query to execute now requires only `_corrupt_record` in effect after optimization. When there are corrupt records due to json field conversion error, those corrupt records might not correctly generated in the end, because now no other json fields are required along actually. In order to obtain most accurate result, we recommend users to cache or save the dataset before the queries requiring only `_corrupt_record`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136739126 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala --- @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.language.existentials + +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.hive.common.FileUtils +import org.apache.hadoop.hive.ql.plan.TableDesc +import org.apache.hadoop.hive.serde.serdeConstants +import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe +import org.apache.hadoop.mapred._ + +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.hive.client.HiveClientImpl + +/** + * Command for writing the results of `query` to file system. + * + * The syntax of using this command in SQL is: + * {{{ + * INSERT OVERWRITE [LOCAL] DIRECTORY + * path + * [ROW FORMAT row_format] + * [STORED AS file_format] + * SELECT ... + * }}} + * + * @param isLocal whether the path specified in `storage` is a local directory + * @param storage storage format used to describe how the query result is stored. + * @param query the logical plan representing data to write to + * @param overwrite whthere overwrites existing directory + */ +case class InsertIntoHiveDirCommand( +isLocal: Boolean, +storage: CatalogStorageFormat, +query: LogicalPlan, +overwrite: Boolean) extends SaveAsHiveFile { + + override def children: Seq[LogicalPlan] = query :: Nil + + override def run(sparkSession: SparkSession, children: Seq[SparkPlan]): Seq[Row] = { +assert(children.length == 1) +assert(storage.locationUri.nonEmpty) + +//val Array(cols, types) = children.head.output.foldLeft(Array("", "")) { case (r, a) => +// r(0) = r(0) + a.name + "," +// r(1) = r(1) + a.dataType.catalogString + ":" +// r +//} +// +//val properties = new Properties() +//properties.put("columns", cols.dropRight(1)) +//properties.put("columns.types", types.dropRight(1)) +//properties.put(serdeConstants.SERIALIZATION_LIB, +// storage.serde.getOrElse(classOf[LazySimpleSerDe].getName)) +// +//import scala.collection.JavaConverters._ +//properties.putAll(storage.properties.asJava) +// +//val tableDesc = new TableDesc( +// Utils.classForName(storage.inputFormat.get).asInstanceOf[Class[_ <: InputFormat[_, _]]], +// Utils.classForName(storage.outputFormat.get), +// properties +//) --- End diff -- Yes. The following dummy table looks fine to me. Could you remove the above lines? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19079: [SPARK-21859][CORE] Fix SparkFiles.get failed on ...
Github user lgrcyanny closed the pull request at: https://github.com/apache/spark/pull/19079 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19102: [SPARK-21859][CORE] Fix SparkFiles.get failed on ...
Github user lgrcyanny commented on a diff in the pull request: https://github.com/apache/spark/pull/19102#discussion_r136738525 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -496,7 +496,7 @@ object SparkSubmit extends CommandLineUtils with Logging { sysProp = "spark.executor.memory"), OptionAssigner(args.totalExecutorCores, STANDALONE | MESOS, ALL_DEPLOY_MODES, sysProp = "spark.cores.max"), - OptionAssigner(args.files, LOCAL | STANDALONE | MESOS, ALL_DEPLOY_MODES, + OptionAssigner(args.files, ALL_CLUSTER_MGRS, ALL_DEPLOY_MODES, --- End diff -- For yarn-client mode, --files are are already added to "spark.yarn.dist.files", I agree with you that, just addFile in SparkContext for yarn-client mode for "spark.yarn.dist.files". BTW, I will fix the doc for Spark-Submit either. Thanks @vanzin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19115: Update PairRDDFunctions.scala
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19115 @awarrior please follow the [doc](https://spark.apache.org/contributing.html) to submit patch. You need to change the PR title like other PRs by adding JIRA id and component tag. Add the details of your problem in PR description, not just a simple JIRA link. Submit PR against master branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136738263 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -241,11 +241,21 @@ query : ctes? queryNoWith ; -insertInto +insertIntoTable --- End diff -- These can be combined to ``` insertInto : INSERT OVERWRITE TABLE tableIdentifier (partitionSpec (IF NOT EXISTS)?)? #insertIntoTable | INSERT INTO TABLE? tableIdentifier partitionSpec? #insertIntoTable | INSERT OVERWRITE LOCAL? DIRECTORY path=STRING rowFormat? createFileFormat?#insertOverwriteHiveDir | INSERT OVERWRITE LOCAL? DIRECTORY (path=STRING)? tableProvider (OPTIONS options=tablePropertyList)? #insertOverwriteDir ; ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136738182 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -178,11 +179,50 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging } /** - * Add an INSERT INTO [TABLE]/INSERT OVERWRITE TABLE operation to the logical plan. + * Parameters used for writing query to a directory: (isLocal, CatalogStorageFormat, provider). + */ + type InsertDirParams = (Boolean, CatalogStorageFormat, Option[String]) + + /** + * Add an + * INSERT INTO [TABLE] or + * INSERT OVERWRITE TABLE or + * INSERT OVERWRITE [LOCAL] DIRECTORY + * operation to logical plan */ private def withInsertInto( ctx: InsertIntoContext, query: LogicalPlan): LogicalPlan = withOrigin(ctx) { +assert(ctx.children.size == 1) --- End diff -- It can be rewritten to ```Scala private def withInsertInto( ctx: InsertIntoContext, query: LogicalPlan): LogicalPlan = withOrigin(ctx) { ctx match { case c: InsertIntoTableContext => withInsertIntoTable(c, query) case dir: InsertOverwriteDirContext => val (isLocal, storage, provider) = visitInsertOverwriteDir(dir) InsertIntoDir(isLocal, storage, provider, query, overwrite = true) case hiveDir: InsertOverwriteHiveDirContext => val (isLocal, storage, provider) = visitInsertOverwriteHiveDir(hiveDir) InsertIntoDir(isLocal, storage, provider, query, overwrite = true) } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 @WeichenXu123 @jkbradley I am curious about why `ml.Kmeans` is special that it needs a separate PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18538 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81369/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18538 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18538 **[Test build #81369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81369/testReport)** for PR 18538 at commit [`9abe9e5`](https://github.com/apache/spark/commit/9abe9e560ae12405a480eab325f7a707e8cb1f14). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19115: Update PairRDDFunctions.scala
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19115 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19114: Update PairRDDFunctions.scala
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19114 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19115: Update PairRDDFunctions.scala
GitHub user awarrior opened a pull request: https://github.com/apache/spark/pull/19115 Update PairRDDFunctions.scala [https://issues.apache.org/jira/browse/SPARK-21882](url) You can merge this pull request into a Git repository by running: $ git pull https://github.com/awarrior/spark branch-2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19115.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19115 commit a096970b2f2cfa497a96870ebd26f83a106b4e07 Author: Jarvis Date: 2017-09-04T02:48:35Z Update PairRDDFunctions.scala spark-21882 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17014: [SPARK-18608][ML] Fix double-caching in ML algori...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/17014#discussion_r136737427 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -304,16 +304,14 @@ class KMeans @Since("1.5.0") ( override def fit(dataset: Dataset[_]): KMeansModel = { transformSchema(dataset.schema, logging = true) -val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE +val handlePersistence = dataset.storageLevel == StorageLevel.NONE val instances: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map { case Row(point: Vector) => OldVectors.fromML(point) } -if (handlePersistence) { - instances.persist(StorageLevel.MEMORY_AND_DISK) -} +if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) --- End diff -- OK, I will revert those small changes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19114: Update PairRDDFunctions.scala
GitHub user awarrior opened a pull request: https://github.com/apache/spark/pull/19114 Update PairRDDFunctions.scala [https://issues.apache.org/jira/browse/SPARK-21882](url) You can merge this pull request into a Git repository by running: $ git pull https://github.com/awarrior/spark branch-1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19114.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19114 commit e7e42802b07c5148ba02761af1edd2ee81d6ef95 Author: Jarvis Date: 2017-09-04T02:52:01Z Update PairRDDFunctions.scala spark-21882 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18957: [SPARK-21744][CORE] Add retry logic for new broad...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/18957 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19111 I found `NaiveBayes` also possible to fail. Founded here #18538 . Hope this can works! https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81316/console ``` 2. Error: spark.naiveBayes (@test_mllib_classification.R#482) -- java.lang.IllegalArgumentException: requirement failed: The input column stridx_5a965aeed344 should have at least two distinct values. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.feature.OneHotEncoder$$anonfun$5.apply(OneHotEncoder.scala:113) ... 1: spark.naiveBayes(traindf, clicked ~ ., smoothing = 0) at /home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/tests/fulltests/test_mllib_classification.R:482 2: spark.naiveBayes(traindf, clicked ~ ., smoothing = 0) 3: .local(data, formula, ...) 4: callJStatic("org.apache.spark.ml.r.NaiveBayesWrapper", "fit", formula, data@sdf, smoothing, handleInvalid) 5: invokeJava(isStatic = TRUE, className, methodName, ...) 6: handleErrors(returnStatus, conn) 7: stop(readString(conn)) DONE === Error: Test failures Execution halted ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18538 **[Test build #81369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81369/testReport)** for PR 18538 at commit [`9abe9e5`](https://github.com/apache/spark/commit/9abe9e560ae12405a480eab325f7a707e8cb1f14). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18538 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19079: [SPARK-21859][CORE] Fix SparkFiles.get failed on driver ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19079 Please close this PR @lgrcyanny thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19113 **[Test build #81368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81368/testReport)** for PR 19113 at commit [`fa7eb51`](https://github.com/apache/spark/commit/fa7eb514cfd91bb405fd74680b08d5865911e3f0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19113: [SPARK-20978][SQL] Bump up Univocity version to 2...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/19113 [SPARK-20978][SQL] Bump up Univocity version to 2.5.4 ## What changes were proposed in this pull request? There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below: ```scala val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS()) df.show() ``` **Before** ``` java.lang.NullPointerException at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89) at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) ... ``` **After** ``` +---+++ | a| b|unparsed| +---+++ | a|null| a| +---+++ ``` It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this. ## How was this patch tested? Unit test added in `CSVSuite.scala`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark bump-up-univocity Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19113.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19113 commit fa7eb514cfd91bb405fd74680b08d5865911e3f0 Author: hyukjinkwon Date: 2017-08-26T03:52:40Z Bump up Univocity version to 2.5.4 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17862 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org