[GitHub] spark pull request #15421: [SPARK-17811] SparkR cannot parallelize data.fram...
Github user falaki commented on a diff in the pull request: https://github.com/apache/spark/pull/15421#discussion_r82940884 --- Diff: core/src/main/scala/org/apache/spark/api/r/SerDe.scala --- @@ -125,15 +125,34 @@ private[spark] object SerDe { } def readDate(in: DataInputStream): Date = { -Date.valueOf(readString(in)) +try { + val inStr = readString(in) + if (inStr == "NA") { +null + } else { +Date.valueOf(inStr) + } +} catch { + // On windows we get NegativeArraySizeException for NAs in R --- End diff -- No. I will revert this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15447: [SPARK-14804][Graphx] Graph vertexRDD/EdgeRDD checkpoint...
Github user apivovarov commented on the issue: https://github.com/apache/spark/pull/15447 Related PRs https://github.com/apache/spark/pull/15396 https://github.com/apache/spark/pull/12576 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15447: [SPARK-14804][Graphx] Graph vertexRDD/EdgeRDD checkpoint...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15447 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66792/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15447: [SPARK-14804][Graphx] Graph vertexRDD/EdgeRDD che...
GitHub user apivovarov opened a pull request: https://github.com/apache/spark/pull/15447 [SPARK-14804][Graphx] Graph vertexRDD/EdgeRDD checkpoint results Clas⦠EdgeRDD/VertexRDD wraps partitionsRDD e.g. `EdgeRDDImpl.checkpoint()` calls `partitionsRDD.checkpoint()` EdgeRDD/VertexRDD `isCheckpointed()` method should be implemented the same way - it should call `partitionsRDD.isCheckpointed` You can merge this pull request into a Git repository by running: $ git pull https://github.com/apivovarov/spark 14804 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15447.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15447 commit b123b68589d59d65db6210f1792a48d7f94e09bb Author: Alexander PivovarovDate: 2016-10-12T05:48:37Z [SPARK-14804][Graphx] Graph vertexRDD/EdgeRDD checkpoint results ClassCastException --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15445: [SPARK-17817][PySpark][FOLLOWUP] PySpark RDD Repartition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15445 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66789/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15445: [SPARK-17817][PySpark][FOLLOWUP] PySpark RDD Repartition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15445 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15375 **[Test build #66792 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66792/consoleFull)** for PR 15375 at commit [`836e874`](https://github.com/apache/spark/commit/836e8745c346c59f78958e10aec1c6f9537242b9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15445: [SPARK-17817][PySpark][FOLLOWUP] PySpark RDD Repartition...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15445 **[Test build #66789 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66789/consoleFull)** for PR 15445 at commit [`be6d153`](https://github.com/apache/spark/commit/be6d1537e9bbd2cc2484e4d8da9d901b16725c97). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/9766 **[Test build #66794 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66794/consoleFull)** for PR 9766 at commit [`45a9b7a`](https://github.com/apache/spark/commit/45a9b7af6afbb2ab1287cc41fafbaa1de823eafa). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/9766 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66794/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/9766 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/9766 **[Test build #66794 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66794/consoleFull)** for PR 9766 at commit [`45a9b7a`](https://github.com/apache/spark/commit/45a9b7af6afbb2ab1287cc41fafbaa1de823eafa). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15230: [SPARK-17657] [SQL] Disallow Users to Change Tabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15230#discussion_r82940270 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -225,6 +225,11 @@ case class AlterTableSetPropertiesCommand( val catalog = sparkSession.sessionState.catalog val table = catalog.getTableMetadata(tableName) DDLUtils.verifyAlterTableType(catalog, table, isView) +// Not allowed to switch the table type. +if (properties.contains("EXTERNAL")) { --- End diff -- This is officially documented in the Hive document, as shown in the [link](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL): `TBLPROPERTIES ("EXTERNAL"="TRUE") in release 0.6.0+ (HIVE-1329) â Change a managed table to an external table and vice versa for "FALSE".` This is the only property users are not allowed to change. The other Hive-specific properties are still allowed to change, because Hive also allows it. For the our Spark-reserved properties, users are not allowed to change. See the function call `verifyTableProperties` in `[alterTable](https://github.com/apache/spark/blob/b9a147181d5e38d9abed0c7215f4c5cb695f579c/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L393)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15173: [SPARK-15698][SQL][Streaming][Follw-up]Fix FileStream so...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/15173 @zsxwing Why was not this merge to 2.0? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15439: [SPARK-17880][DOC] The url linking to `Accumulato...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15439 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15427 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15427 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66790/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15427 **[Test build #66790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66790/consoleFull)** for PR 15427 at commit [`81339dc`](https://github.com/apache/spark/commit/81339dc429104633ee28cf078f643b5050564557). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15439: [SPARK-17880][DOC] The url linking to `AccumulatorV2` in...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15439 Thanks - merging in master/2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15440: Fix hadoop.version in building-spark.md
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15440 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15440: Fix hadoop.version in building-spark.md
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15440 Thanks - merging in master/branch-2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15434: [SPARK-17873][SQL] ALTER TABLE RENAME TO should a...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15434#discussion_r82938529 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -459,11 +459,20 @@ class SessionCatalog( * If a database is specified in `oldName`, this will rename the table in that database. * If no database is specified, this will first attempt to rename a temporary table with * the same name, then, if that does not exist, rename the table in the current database. + * + * This assumes the database specified in `newName` matches the one in `oldName`. */ - def renameTable(oldName: TableIdentifier, newName: String): Unit = synchronized { + def renameTable(oldName: TableIdentifier, newName: TableIdentifier): Unit = synchronized { val db = formatDatabaseName(oldName.database.getOrElse(currentDb)) +newName.database.map(formatDatabaseName).foreach { newDb => --- End diff -- see PR description, we should use the database of source table, so that users can just write `db.tbl1 RENAME TO tbl2`. This is different from Hive, as we don't support move table from one database to another. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/15423#discussion_r82938410 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala --- @@ -207,6 +208,7 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext { // Returns true if the plan is supposed to be sorted. def isSorted(plan: LogicalPlan): Boolean = plan match { case _: Join | _: Aggregate | _: Generate | _: Sample | _: Distinct => false + case _: ShowColumnsCommand => true --- End diff -- @cloud-fan @viirya Thanks :-) I will change it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15434: [SPARK-17873][SQL] ALTER TABLE RENAME TO should allow us...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15434 Just FYI. Hive allows the following changes: ```SQL ALTER TABLE db1.tbl RENAME TO db2.tbl2 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15406: [Spark-17745][ml][PySpark] update NB python api - add we...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15406 We should add weights to the doctests to demonstrate them and make sure they're working. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15423#discussion_r82937473 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala --- @@ -207,6 +208,7 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext { // Returns true if the plan is supposed to be sorted. def isSorted(plan: LogicalPlan): Boolean = plan match { case _: Join | _: Aggregate | _: Generate | _: Sample | _: Distinct => false + case _: ShowColumnsCommand => true --- End diff -- +1 as mentioned in previous comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/15307 @marmbrus Could you take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11610: [SPARK-13777] [ML] Remove constant features from trainin...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/11610 This problem should be handled by https://github.com/apache/spark/pull/15394 if it is merged. It seems this is no longer active, and we are pursuing alternative solutions. Shall we close this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15423: [SPARK-17860][SQL] SHOW COLUMN's database conflic...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15423#discussion_r82937255 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala --- @@ -207,6 +208,7 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext { // Returns true if the plan is supposed to be sorted. def isSorted(plan: LogicalPlan): Boolean = plan match { case _: Join | _: Aggregate | _: Generate | _: Sample | _: Distinct => false + case _: ShowColumnsCommand => true --- End diff -- marking `ShowColumnsCommand` as sorted is more weird, I'd like to leave the result sorted. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9008: [SPARK-9478] [ml] Add class weights to Random Forest
Github user sethah commented on the issue: https://github.com/apache/spark/pull/9008 @rotationsymmetry Could you please close this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/15375 @falaki @felixcheung The DirectKafkaStreamSuite is a known flaky test. Nothing in this patch should affect Kafka. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82931901 --- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.types._ + +class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + import testImplicits._ + + class MockPredictor(override val uid: String) --- End diff -- move into companion object. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82932068 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -121,10 +122,18 @@ abstract class Predictor[ * and put it in an RDD with strong types. */ protected def extractLabeledPoints(dataset: Dataset[_]): RDD[LabeledPoint] = { -dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map { +dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map { case Row(label: Double, features: Vector) => LabeledPoint(label, features) } } + + /** + * Return the given DataFrame, with [[labelCol]] casted to DoubleType. + */ +protected def castDataSet(dataset: Dataset[_]): DataFrame = { --- End diff -- let's just put this logic directly in `fit` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82935295 --- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.types._ + +class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + import testImplicits._ + + class MockPredictor(override val uid: String) +extends Predictor[Vector, MockPredictor, MockPredictionModel] { + +override def train(dataset: Dataset[_]): MockPredictionModel = { + require(dataset.schema("label").dataType == DoubleType) + new MockPredictionModel(uid) +} + +override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra) + } + + class MockPredictionModel(override val uid: String) +extends PredictionModel[Vector, MockPredictionModel] { + +override def predict(features: Vector): Double = 1.0 --- End diff -- `override def predict(features: Vector): Double = throw new NotImplementedError()` We can do this for everything except `train`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82932894 --- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.types._ + +class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + import testImplicits._ + + class MockPredictor(override val uid: String) +extends Predictor[Vector, MockPredictor, MockPredictionModel] { + +override def train(dataset: Dataset[_]): MockPredictionModel = { + require(dataset.schema("label").dataType == DoubleType) + new MockPredictionModel(uid) +} + +override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra) + } + + class MockPredictionModel(override val uid: String) +extends PredictionModel[Vector, MockPredictionModel] { + +override def predict(features: Vector): Double = 1.0 + +override def copy(extra: ParamMap): MockPredictionModel = defaultCopy(extra) + } + + test("should support all NumericType labels and not support other types") { +val predictor = new MockPredictor("mock") +MLTestingUtils.checkNumericTypes[MockPredictionModel, MockPredictor]( --- End diff -- Why don't we just cycle through the types here and call `fit`. I think it's a bit confusing the way it is now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82932799 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala --- @@ -117,7 +117,7 @@ object MLTestingUtils extends SparkFunSuite { Seq(ShortType, LongType, IntegerType, FloatType, ByteType, DoubleType, DecimalType(10, 0)) types.map { t => val castDF = df.select(col(labelColName).cast(t), col(featuresColName)) -t -> TreeTests.setMetadata(castDF, 2, labelColName, featuresColName) +t -> TreeTests.setMetadata(castDF, 0, labelColName, featuresColName) --- End diff -- What is this for? If the intent is to force `getNumClasses` to infer the number of classes, then you're no longer testing the not inferred case. Further, the point of this PR is to eliminate the need to do that since it is not a robust solution, IMO. Also, I'd like to remove the dependence on `TreeTests` here (and `genRegressionDF`) and just explicitly set the attributes in the functions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15172: [SPARK-13331] AES support for over-the-wire encryption
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15172 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15172: [SPARK-13331] AES support for over-the-wire encryption
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15172 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66786/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15172: [SPARK-13331] AES support for over-the-wire encryption
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15172 **[Test build #66786 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66786/consoleFull)** for PR 15172 at commit [`46b52e6`](https://github.com/apache/spark/commit/46b52e63918376dcf5dde0359fdfe1efa2456dfd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15307 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15172: [SPARK-13331] AES support for over-the-wire encryption
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15172 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15307 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66784/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15172: [SPARK-13331] AES support for over-the-wire encryption
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15172 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66785/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15307: [SPARK-17731][SQL][STREAMING] Metrics for structured str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15307 **[Test build #66784 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66784/consoleFull)** for PR 15307 at commit [`35bf508`](https://github.com/apache/spark/commit/35bf5089f0d79ba0ba007ca9983a75616f1a553d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15172: [SPARK-13331] AES support for over-the-wire encryption
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15172 **[Test build #66785 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66785/consoleFull)** for PR 15172 at commit [`0bf663f`](https://github.com/apache/spark/commit/0bf663f0d8a71b2944d4030dc0ef95e36ee35471). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15446: [SPARK-17882][SparkR] Fix swallowed exception in RBacken...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15446 @shivaram yes I just noticed it during my debugging and fixed it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15335: [SPARK-17769][Core][Scheduler]Some FetchFailure r...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/15335#discussion_r82933318 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1255,27 +1255,46 @@ class DAGScheduler( s"longer running") } - if (disallowStageRetryForTest) { -abortStage(failedStage, "Fetch failure will not retry stage due to testing config", - None) - } else if (failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId)) { -abortStage(failedStage, s"$failedStage (${failedStage.name}) " + - s"has failed the maximum allowable number of " + - s"times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. " + - s"Most recent failure reason: ${failureMessage}", None) - } else { -if (failedStages.isEmpty) { - // Don't schedule an event to resubmit failed stages if failed isn't empty, because - // in that case the event will already have been scheduled. - // TODO: Cancel running tasks in the stage - logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " + -s"$failedStage (${failedStage.name}) due to fetch failure") - messageScheduler.schedule(new Runnable { -override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages) - }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS) + val shouldAbortStage = +failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId) || +disallowStageRetryForTest + + if (shouldAbortStage) { +val abortMessage = if (disallowStageRetryForTest) { + "Fetch failure will not retry stage due to testing config" +} else { + s"""$failedStage (${failedStage.name}) + |has failed the maximum allowable number of + |times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. + |Most recent failure reason: $failureMessage""".stripMargin.replaceAll("\n", " ") } +abortStage(failedStage, abortMessage, None) + } else { // update failedStages and make sure a ResubmitFailedStages event is enqueued +// TODO: Cancel running tasks in the failed stage -- cf. SPARK-17064 +val noResubmitEnqueued = !failedStages.contains(failedStage) --- End diff -- I think I was worried about the opposite problem -- perhaps we add `mapStage` to `failedStages`, but fail to fire a `Resubmit` event. Maybe too many negatives to think through this clearly -- my intention was *more* logging & resubmission, not less. I suppose I was thinking of it as: ```scala val addedToFailedStages = failedStages.add(failedStage) | failedStages.add(mapStage) if (addedToFailedStage) { logStuff() resubmit() } ``` the point being, to avoid another case of the bug which started this all -- you add to `failedStages`, but fail to ever `Resubmit`. I was thinking of something more like this (though as you'll see, this case is fine). Say you have two jobs submitted concurrently, which share the first few stages. A -> B -> C and A -> B -> D. There is an executor failure while they are both running their independent parts, C & D, concurrently. The failure is detected in C first, so it marks B & C as failed. Later on, the failure is detected in D, it marks B & D as failed. If the first resubmit was already processed, its fine, B is already running, and we mark D as waiting on D. Similarly, its fine if the resubmit wasn't processed yet when the failure is detected in D-- then when the resubmit is processed, we resubmit all 3 stages. I think it also works out even if stage A needs to get resubmitted as well -- its handled in the same call that does the resubmit for B, when it checks for missing parents. (In fact, thinking through these cases makes me think we don't even need to resubmit the `mapStage` at all -- the `failedStage` will submit itself on its resubmit, since it will notice its parents aren't ready. Which is why there isn't a case where this check would really mater.) Anyway, the point is not that I could show you of a case were we *do* need to make sure there is a resubmit. The point is that I'm *not* sure that we do *not* need it, which is why I thought it was better to err on the side of over-logging / resubmitting --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at
[GitHub] spark pull request #15422: [SPARK-17850][Core]Add a flag to ignore corrupt f...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/15422#discussion_r82932947 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -588,6 +588,12 @@ object SQLConf { .doubleConf .createWithDefault(0.05) + val IGNORE_CORRUPT_FILES = SQLConfigBuilder("spark.sql.files.ignoreCorruptFiles") +.doc("Whether to ignore corrupt files. If true, the Spark jobs will continue to run when " + + "encountering corrupt files and contents that have been read will still be returned.") +.booleanConf +.createWithDefault(false) + --- End diff -- Curious why we are duplicating the parameter in sql namespace. Wont spark.files.ignoreCorruptFiles not do ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15422: [SPARK-17850][Core]Add a flag to ignore corrupt f...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/15422#discussion_r82933077 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -170,4 +170,9 @@ package object config { .doc("Port to use for the block managed on the driver.") .fallbackConf(BLOCK_MANAGER_PORT) + private[spark] val IGNORE_CORRUPT_FILES = ConfigBuilder("spark.files.ignoreCorruptFiles") +.doc("Whether to ignore corrupt files. If true, the Spark jobs will continue to run when " + + "encountering corrupt files and contents that have been read will still be returned.") +.booleanConf +.createWithDefault(false) --- End diff -- So either way we will have a behavioral change - if NewHadoopRDD vs HadoopRDD. IMO that is fine, given that we are standardizing on the behavior and this is something which was a corner case anyway. Setting default to false makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15422: [SPARK-17850][Core]Add a flag to ignore corrupt f...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/15422#discussion_r82932992 --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala --- @@ -179,7 +183,16 @@ class NewHadoopRDD[K, V]( override def hasNext: Boolean = { if (!finished && !havePair) { - finished = !reader.nextKeyValue + try { +finished = !reader.nextKeyValue + } catch { +case e: IOException => + if (ignoreCorruptFiles) { +finished = true + } else { +throw e + } + } --- End diff -- Thanks for changing this too ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15422: [SPARK-17850][Core]Add a flag to ignore corrupt f...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/15422#discussion_r82932645 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -253,8 +256,12 @@ class HadoopRDD[K, V]( try { finished = !reader.next(key, value) } catch { - case eof: EOFException => -finished = true + case e: IOException => +if (ignoreCorruptFiles) { + finished = true +} else { + throw e +} --- End diff -- nit: case e: IOException if ignoreCorruptFiles => would have been more concise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15444: [SPARK-17870][MLLIB][ML]Change statistic to pValue for S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15444 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66787/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15444: [SPARK-17870][MLLIB][ML]Change statistic to pValue for S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15444 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15444: [SPARK-17870][MLLIB][ML]Change statistic to pValue for S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15444 **[Test build #66787 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66787/consoleFull)** for PR 15444 at commit [`b98ccdf`](https://github.com/apache/spark/commit/b98ccdfd696cb89cb4793a140c87c498ce5c3086). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/9766 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/9766 **[Test build #66793 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66793/consoleFull)** for PR 9766 at commit [`dc6d5f9`](https://github.com/apache/spark/commit/dc6d5f927d93566ee1c3b935db864f2e517bc7e0). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/9766 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66793/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/9766 **[Test build #66793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66793/consoleFull)** for PR 9766 at commit [`dc6d5f9`](https://github.com/apache/spark/commit/dc6d5f927d93566ee1c3b935db864f2e517bc7e0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15443: [SPARK-17881] [SQL] Aggregation function for generating ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15443 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66782/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15443: [SPARK-17881] [SQL] Aggregation function for generating ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15443 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15443: [SPARK-17881] [SQL] Aggregation function for generating ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15443 **[Test build #66782 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66782/consoleFull)** for PR 15443 at commit [`a843920`](https://github.com/apache/spark/commit/a843920983914de7efd21608b8f0e39c70b210d7). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class StringHistogram(` * ` case class StringHistogramInfo(` * ` class StringHistogramInfoSerializer ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15375 **[Test build #66792 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66792/consoleFull)** for PR 15375 at commit [`836e874`](https://github.com/apache/spark/commit/836e8745c346c59f78958e10aec1c6f9537242b9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15398: [SPARK-17647][SQL] Fix backslash escaping in 'LIK...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15398#discussion_r82931395 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala --- @@ -25,26 +25,25 @@ object StringUtils { // replace the _ with .{1} exactly match 1 time of any character // replace the % with .*, match 0 or more times with any character - def escapeLikeRegex(v: String): String = { -if (!v.isEmpty) { - "(?s)" + (' ' +: v.init).zip(v).flatMap { -case (prev, '\\') => "" -case ('\\', c) => - c match { -case '_' => "_" -case '%' => "%" -case _ => Pattern.quote("\\" + c) - } -case (prev, c) => - c match { -case '_' => "." -case '%' => ".*" -case _ => Pattern.quote(Character.toString(c)) - } - }.mkString -} else { - v + def escapeLikeRegex(str: String): String = { +val builder = new StringBuilder() +var escaping = false +for (next <- str) { + if (escaping) { +builder ++= Pattern.quote(Character.toString(next)) --- End diff -- `\Q\\E\Qa\E` is correct. But doesn't it become `\Qa\E` in this change? For `\\a`, the prefixing `\\` will go the next branch and enable `escaping`. Then the next char `a` will be quoted here. So it becomes `\Qa\E`. BTW, before this change, it will be `\Q\a\E`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15446: [SPARK-17882][SparkR] Fix swallowed exception in RBacken...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15446 cc @falaki Is this also a part of #15375 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15446: [SPARK-17882][SparkR] Fix swallowed exception in RBacken...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15446 Thanks @jrshust for the PR. Jenkins, ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15335: [SPARK-17769][Core][Scheduler]Some FetchFailure r...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/15335#discussion_r82931294 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1255,27 +1255,46 @@ class DAGScheduler( s"longer running") } - if (disallowStageRetryForTest) { -abortStage(failedStage, "Fetch failure will not retry stage due to testing config", - None) - } else if (failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId)) { -abortStage(failedStage, s"$failedStage (${failedStage.name}) " + - s"has failed the maximum allowable number of " + - s"times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. " + - s"Most recent failure reason: ${failureMessage}", None) - } else { -if (failedStages.isEmpty) { - // Don't schedule an event to resubmit failed stages if failed isn't empty, because - // in that case the event will already have been scheduled. - // TODO: Cancel running tasks in the stage - logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " + -s"$failedStage (${failedStage.name}) due to fetch failure") - messageScheduler.schedule(new Runnable { -override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages) - }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS) + val shouldAbortStage = +failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId) || +disallowStageRetryForTest + + if (shouldAbortStage) { +val abortMessage = if (disallowStageRetryForTest) { + "Fetch failure will not retry stage due to testing config" +} else { + s"""$failedStage (${failedStage.name}) + |has failed the maximum allowable number of + |times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. + |Most recent failure reason: $failureMessage""".stripMargin.replaceAll("\n", " ") } +abortStage(failedStage, abortMessage, None) + } else { // update failedStages and make sure a ResubmitFailedStages event is enqueued +// TODO: Cancel running tasks in the failed stage -- cf. SPARK-17064 +val noResubmitEnqueued = !failedStages.contains(failedStage) failedStages += failedStage failedStages += mapStage +if (noResubmitEnqueued) { + // We expect one executor failure to trigger many FetchFailures in rapid succession, + // but all of those task failures can typically be handled by a single resubmission of + // the failed stage. We avoid flooding the scheduler's event queue with resubmit + // messages by checking whether a resubmit is already in the event queue for the + // failed stage. If there is already a resubmit enqueued for a different failed + // stage, that event would also be sufficient to handle the current failed stage, but + // producing a resubmit for each failed stage makes debugging and logging a little + // simpler while not producing an overwhelming number of scheduler events. + logInfo( +s"Resubmitting $mapStage (${mapStage.name}) and " + +s"$failedStage (${failedStage.name}) due to fetch failure" + ) + messageScheduler.schedule( --- End diff -- yeah probably a separate PR, sorry this was just an opportunity for me to rant :) And sorry if I worded it poorly, but I was not suggesting the one w/ "Periodically" as a better comment -- in fact I think its a *bad* comment, just wanted to mention it was another description which used to be there long ago. This was my suggestion: ``` If we get one fetch-failure, we often get more fetch failures across multiple executors. We will get better parallelism when we resubmit the mapStage if we can resubmit when we know about as many of those failures as possible. So this is a heuristic to add a small delay to see if we gather a few more failures before we resubmit. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail:
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/9766 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/9766 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66791/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790][SPARKR] Support for parallelizing R data.f...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15375 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/9766 **[Test build #66791 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66791/consoleFull)** for PR 9766 at commit [`9de8c0e`](https://github.com/apache/spark/commit/9de8c0e7c0a2108b519c8adce7af5162578b04c9). * This patch **fails RAT tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15427: [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15427 **[Test build #66790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66790/consoleFull)** for PR 15427 at commit [`81339dc`](https://github.com/apache/spark/commit/81339dc429104633ee28cf078f643b5050564557). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/9766 **[Test build #66791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66791/consoleFull)** for PR 9766 at commit [`9de8c0e`](https://github.com/apache/spark/commit/9de8c0e7c0a2108b519c8adce7af5162578b04c9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15295 Merging to master! Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15295 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15445: [SPARK-17817][PySpark][FOLLOWUP] PySpark RDD Repartition...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15445 **[Test build #66789 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66789/consoleFull)** for PR 15445 at commit [`be6d153`](https://github.com/apache/spark/commit/be6d1537e9bbd2cc2484e4d8da9d901b16725c97). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15446: [SPARK-17882][SPARKR] Fix swallowed exception in RBacken...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15446 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15446: [SPARK-17882][SPARKR] Fix swallowed exception in ...
GitHub user jrshust opened a pull request: https://github.com/apache/spark/pull/15446 [SPARK-17882][SPARKR] Fix swallowed exception in RBackendHandler ## What changes were proposed in this pull request? Log exception that is swallowed in handleMethodCall. This allows invoked Java issues to be easily debugged when using SparkR. ## How was this patch tested? Manual tests to verify the logged exception shows up. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jrshust/spark rbackend-logging Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15446.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15446 commit 083f57a16c7153364f8686a28f24afa917e33219 Author: James ShusterDate: 2016-10-12T03:19:11Z log exception object --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15295 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15389#discussion_r82930615 --- Diff: python/pyspark/rdd.py --- @@ -2029,7 +2028,15 @@ def coalesce(self, numPartitions, shuffle=False): >>> sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(1).glom().collect() [[1, 2, 3, 4, 5]] """ -jrdd = self._jrdd.coalesce(numPartitions, shuffle) +if shuffle: +# In Scala's repartition code, we will distribute elements evenly across output +# partitions. However, the RDD from Python is serialized as a single binary data, +# so the distribution fails and produces highly skewed partitions. We need to +# convert it to a RDD of java object before repartitioning. +data_java_rdd = self._to_java_object_rdd().coalesce(numPartitions, shuffle) --- End diff -- @davies The followup is at #15445. Can you take a look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15445: [SPARK-17817][PySpark][FOLLOWUP] PySpark RDD Repa...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/15445 [SPARK-17817][PySpark][FOLLOWUP] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes ## What changes were proposed in this pull request? This change is a followup for #15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too. ## How was this patch tested? Jenkins tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 repartition-batch-size Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15445.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15445 commit 60e2abd9616016dce8e5dc2faf5c75be8e07335f Author: Liang-Chi HsiehDate: 2016-10-07T04:59:37Z Decrease the batch size for repartition. commit be6d1537e9bbd2cc2484e4d8da9d901b16725c97 Author: Liang-Chi Hsieh Date: 2016-10-12T03:08:38Z Merge remote-tracking branch 'upstream/master' into repartition-batch-size --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12064 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12064 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66788/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12064 **[Test build #66788 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66788/consoleFull)** for PR 12064 at commit [`cdd829a`](https://github.com/apache/spark/commit/cdd829aa56663c8bdb36c85c8599a99fb2fbf643). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12064 **[Test build #66788 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66788/consoleFull)** for PR 12064 at commit [`cdd829a`](https://github.com/apache/spark/commit/cdd829aa56663c8bdb36c85c8599a99fb2fbf643). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15444: [SPARK-17870][MLLIB][ML]Change statistic to pValue for S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15444 **[Test build #66787 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66787/consoleFull)** for PR 15444 at commit [`b98ccdf`](https://github.com/apache/spark/commit/b98ccdfd696cb89cb4793a140c87c498ce5c3086). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15389#discussion_r82929167 --- Diff: python/pyspark/rdd.py --- @@ -2029,7 +2028,15 @@ def coalesce(self, numPartitions, shuffle=False): >>> sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(1).glom().collect() [[1, 2, 3, 4, 5]] """ -jrdd = self._jrdd.coalesce(numPartitions, shuffle) +if shuffle: +# In Scala's repartition code, we will distribute elements evenly across output +# partitions. However, the RDD from Python is serialized as a single binary data, +# so the distribution fails and produces highly skewed partitions. We need to +# convert it to a RDD of java object before repartitioning. +data_java_rdd = self._to_java_object_rdd().coalesce(numPartitions, shuffle) --- End diff -- @davies Thank you! I do a simple benchmark as above with decreasing the batch size, I don't see an improvement in running time. I.e., import time num_partitions = 2 a = sc.parallelize(range(int(1e6)), 2) start = time.time() l = a.repartition(num_partitions).glom().map(len).collect() end = time.time() print(end - start) Before: 419.447577953 _to_java_object_rdd(): 421.916361094 decreasing the batch size: 423.712255955 Maybe it depends how is expensive actually converting to java object case by case. Is it generally faster than _to_java_object_rdd()? I would open a followup for this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15431: [SPARK-15153] [ML] [SparkR] Fix SparkR spark.naiveBayes ...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/15431 @jkbradley I agree it's not necessary to get in branch-2.0, since it requires a new public API. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15434: [SPARK-17873][SQL] ALTER TABLE RENAME TO should allow us...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15434 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15434: [SPARK-17873][SQL] ALTER TABLE RENAME TO should allow us...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15434 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66778/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15434: [SPARK-17873][SQL] ALTER TABLE RENAME TO should allow us...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15434 **[Test build #66778 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66778/consoleFull)** for PR 15434 at commit [`65c1885`](https://github.com/apache/spark/commit/65c1885818e4b712c2132e7e97e0b96ceb3f6dd7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14847: [SPARK-17254][SQL] Filter can stop when the condi...
Github user viirya closed the pull request at: https://github.com/apache/spark/pull/14847 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14847: [SPARK-17254][SQL] Filter can stop when the condition is...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14847 @rxin Thanks for recommendation! Let me close it now and work on it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15295 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66777/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15295 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15295: [SPARK-17720][SQL] introduce static SQL conf
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15295 **[Test build #66777 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66777/consoleFull)** for PR 15295 at commit [`595b220`](https://github.com/apache/spark/commit/595b22097dba8716545cd405fa36448065ce779d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15172: [SPARK-13331] AES support for over-the-wire encryption
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15172 **[Test build #66786 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66786/consoleFull)** for PR 15172 at commit [`46b52e6`](https://github.com/apache/spark/commit/46b52e63918376dcf5dde0359fdfe1efa2456dfd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15444: [SPARK-17870][MLLIB][ML]Change statistic to pValue for S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15444 **[Test build #66783 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66783/consoleFull)** for PR 15444 at commit [`59ee17d`](https://github.com/apache/spark/commit/59ee17df3b46996bcf62f427c21d0f89b6ced204). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15444: [SPARK-17870][MLLIB][ML]Change statistic to pValue for S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15444 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66783/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org