[GitHub] spark issue #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19647 **[Test build #83432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83432/testReport)** for PR 19647 at commit [`0e0123b`](https://github.com/apache/spark/commit/0e0123b0887aa33c382a63ee7843be5f64d38b41). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19647 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19650: [SPARK-22254][core] Fix the arrayMax in BufferHol...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19650 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19650: [SPARK-22254][core] Fix the arrayMax in BufferHolder
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19650 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #83431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83431/testReport)** for PR 19651 at commit [`fdde274`](https://github.com/apache/spark/commit/fdde27416fe54036afbd9b809a363e7871df67cf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19651 Retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19218 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19653 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19653 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19647 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83430/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19647 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19647: [SPARK-22211][SQL] Remove incorrect FOJ limit pushdown
Github user henryr commented on the issue: https://github.com/apache/spark/pull/19647 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19468 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19468 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83425/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19468 **[Test build #83425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83425/testReport)** for PR 19468 at commit [`6cf4ed7`](https://github.com/apache/spark/commit/6cf4ed7eec3f8a1798d260622ab5641db92ab13d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19582 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83424/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19582 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19582 **[Test build #83424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83424/testReport)** for PR 19582 at commit [`537c7b4`](https://github.com/apache/spark/commit/537c7b4bf27e1392fd4868db688f7d2f48baa3a5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19656 A rough glance looks good. Will review carefully after solving the build. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19642 About why `ExtractPythonUDFs` applies on physical plans, I think it may partly because the wrapper of Python function `PythonFunction` also encapsulates core concepts like `Broadcast` and `Accumulator`, it is out of the scope of catalyst APIs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19642: [SPARK-22410][SQL] Remove unnecessary output from BatchE...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19642 Another concern is, by seeing python udfs as normal expressions without specific operator, we can apply necessary optimization such as CollapseProject. If we extract python udfs earlier in logical plan, we might have multiple python runners that were collapsed originally now. To deal with it, we may need to add specific optimization rule for collapsing python runners. This adds more complexity, IMHO. We should consider if it's worth. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19656 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83429/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19656 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19656 **[Test build #83429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83429/testReport)** for PR 19656 at commit [`94beff0`](https://github.com/apache/spark/commit/94beff0e229da4c5dd3cb3ddb677e5fe6e42499c). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19656 **[Test build #83429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83429/testReport)** for PR 19656 at commit [`94beff0`](https://github.com/apache/spark/commit/94beff0e229da4c5dd3cb3ddb677e5fe6e42499c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19656 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...
Github user tengpeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19638#discussion_r148919520 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -764,13 +764,17 @@ class LinearRegressionSuite (Intercept) 6.3022157 0.00186003388 <2e-16 *** V2 4.6982442 0.00118053980 <2e-16 *** V3 7.1994344 0.00090447961 <2e-16 *** + + # R code for r2adj + summary(lm(X1 ~ ., data = part_0)) --- End diff -- The existing test "Linear regression model training summary" does not test without intercept case. If there is a "Linear regression model training summary wo intercept", I am glad to add in it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19623 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83428/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19623: [SPARK-22078][SQL] clarify exception behaviors for all d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19623 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83427/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19651 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19623#discussion_r148918617 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/ReadSupport.java --- @@ -30,6 +30,9 @@ /** * Creates a {@link DataSourceV2Reader} to scan the data from this data source. * + * If this method fails (by throwing an exception), the action would fail and no Spark job was --- End diff -- no exception is recommended here, you can throw any exception and Spark will fail your job. This document just describes what will happen if an exception is thrown, but no exception is expected actually. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19623#discussion_r148918542 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java --- @@ -50,28 +53,34 @@ /** * Creates a writer factory which will be serialized and sent to executors. + * + * If this method fails (by throwing an exception), the action would fail and no Spark job was + * submitted. */ DataWriterFactory createWriterFactory(); /** * Commits this writing job with a list of commit messages. The commit messages are collected from - * successful data writers and are produced by {@link DataWriter#commit()}. If this method - * fails(throw exception), this writing job is considered to be failed, and - * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible - * to data source readers if this method succeeds. + * successful data writers and are produced by {@link DataWriter#commit()}. + * + * If this method fails (by throwing an exception), this writing job is considered to to have been + * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination + * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it. * * Note that, one partition may have multiple committed data writers because of speculative tasks. * Spark will pick the first successful one and get its commit message. Implementations should be --- End diff -- it's really out of the scope of this PR... This PR is adding documents to describe exception behavior, please send a new PR to update the document for the commit stuff and we can move our discussion there. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19653 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83421/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19653 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19653 **[Test build #83421 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83421/testReport)** for PR 19653 at commit [`92308a4`](https://github.com/apache/spark/commit/92308a4341849258caf549d1bcbeabd9002d3ead). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19583: [WIP][SPARK-22339] [CORE] [NETWORK-SHUFFLE] Push ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19583#discussion_r148917664 --- Diff: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala --- @@ -51,7 +51,26 @@ private case class ExecutorRegistered(executorId: String) private case class ExecutorRemoved(executorId: String) -private[spark] case class HeartbeatResponse(reregisterBlockManager: Boolean) +private[spark] case class UpdatedEpoch(epoch: Long) + +private[spark] object HeartbeatResponse { + def apply(reregisterBlockManager: Boolean, +updatedEpoch: Option[Long] = None): HeartbeatResponse = + updatedEpoch.fold[HeartbeatResponse](BasicHeartbeatResponse(reregisterBlockManager)) { + epoch => HeartbeatResponseWithEpoch(reregisterBlockManager, Some(epoch)) +} +} + +private[spark] sealed trait HeartbeatResponse { + def reregisterBlockManager: Boolean + def updatedEpoch: Option[Long] = None +} + +private[spark] case class BasicHeartbeatResponse(reregisterBlockManager: Boolean) + extends HeartbeatResponse +private[spark] case class HeartbeatResponseWithEpoch(reregisterBlockManager: Boolean, + override val updatedEpoch: Option[Long]) + extends HeartbeatResponse --- End diff -- seems not a big saving, I think we can even always include the epoch in the heartbeat response. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19656 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83426/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19656 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19656: [SPARK-22445][SQL] move CodegenContext.copyResult...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19656#discussion_r148917096 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -213,19 +213,32 @@ trait CodegenSupport extends SparkPlan { } /** - * For optimization to suppress shouldStop() in a loop of WholeStageCodegen. - * Returning true means we need to insert shouldStop() into the loop producing rows, if any. + * Whether or not the result rows of this operator should be copied before putting into a buffer. + * + * If any operator inside WholeStageCodegen generate multiple rows from a single row (for + * example, Join), this should be true. + * + * If an operator starts a new pipeline, this should be false. */ - def isShouldStopRequired: Boolean = { -return shouldStopRequired && (this.parent == null || this.parent.isShouldStopRequired) + protected def needCopyResult: Boolean = { +if (children.isEmpty) { + false +} else if (children.length == 1) { + children.head.asInstanceOf[CodegenSupport].needStopCheck +} else { + throw new UnsupportedOperationException +} } /** - * Set to false if this plan consumes all rows produced by children but doesn't output row - * to buffer by calling append(), so the children don't require shouldStop() - * in the loop of producing rows. + * Whether or not the children of this operator should generate a stop check when consuming input + * rows. This is used to suppress shouldStop() in a loop of WholeStageCodegen. + * + * This should be false if an operator starts a new pipeline, which means it consumes all rows + * produced by children but doesn't output row to buffer by calling append(), so the children + * don't require shouldStop() in the loop of producing rows. */ - protected def shouldStopRequired: Boolean = true + protected def needStopCheck: Boolean = parent.needStopCheck --- End diff -- it's unrelated but a simple clean up: we only need one method instead of 2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19656: [SPARK-22445][SQL] move CodegenContext.copyResult to Cod...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19656 @juliuszsompolski @rednaxelafx @kiszk @viirya @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19656: [SPARK-22445][SQL] move CodegenContext.copyResult...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/19656 [SPARK-22445][SQL] move CodegenContext.copyResult to CodegenSupport ## What changes were proposed in this pull request? `CodegenContext.copyResult` is kind of a global status for whole stage codegen. But the tricky part is, it is only used to transfer an information from child to parent when calling the `consume` chain. We have to be super careful in `produce`/`consume`, to set it to true when producing multiple result rows, and set it to false in operators that start new pipeline(like sort). This PR moves the `copyResult` to `CodegenSupport`, and call it at `WholeStageCodegenExec`. This is much easier to reason about. ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark whole-sage Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19656.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19656 commit 94beff0e229da4c5dd3cb3ddb677e5fe6e42499c Author: Wenchen Fan Date: 2017-11-04T00:14:22Z move CodegenContext.copyResult to CodegenSupport --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19468 **[Test build #83425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83425/testReport)** for PR 19468 at commit [`6cf4ed7`](https://github.com/apache/spark/commit/6cf4ed7eec3f8a1798d260622ab5641db92ab13d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/19468#discussion_r148916581 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/SchedulerBackendUtils.scala --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.scheduler.cluster + +import org.apache.spark.SparkConf +import org.apache.spark.internal.config.{DYN_ALLOCATION_MAX_EXECUTORS, DYN_ALLOCATION_MIN_EXECUTORS, EXECUTOR_INSTANCES} +import org.apache.spark.util.Utils + +private[spark] object SchedulerBackendUtils { + val DEFAULT_NUMBER_EXECUTORS = 2 --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/19468#discussion_r148916566 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala --- @@ -0,0 +1,227 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.scheduler.cluster.k8s + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model._ + +import org.apache.spark.{SparkConf, SparkException} +import org.apache.spark.deploy.k8s.ConfigurationUtils +import org.apache.spark.deploy.k8s.config._ +import org.apache.spark.deploy.k8s.constants._ +import org.apache.spark.util.Utils + +/** + * Configures executor pods. Construct one of these with a SparkConf to set up properties that are + * common across all executors. Then, pass in dynamic parameters into createExecutorPod. + */ +private[spark] trait ExecutorPodFactory { + def createExecutorPod( + executorId: String, + applicationId: String, + driverUrl: String, + executorEnvs: Seq[(String, String)], + driverPod: Pod, + nodeToLocalTaskCount: Map[String, Int]): Pod +} + +private[spark] class ExecutorPodFactoryImpl(sparkConf: SparkConf) + extends ExecutorPodFactory { + + import ExecutorPodFactoryImpl._ + + private val executorExtraClasspath = sparkConf.get( +org.apache.spark.internal.config.EXECUTOR_CLASS_PATH) + + private val executorLabels = ConfigurationUtils.parsePrefixedKeyValuePairs( +sparkConf, +KUBERNETES_EXECUTOR_LABEL_PREFIX, +"executor label") + require( +!executorLabels.contains(SPARK_APP_ID_LABEL), +s"Custom executor labels cannot contain $SPARK_APP_ID_LABEL as it is reserved for Spark.") + require( +!executorLabels.contains(SPARK_EXECUTOR_ID_LABEL), +s"Custom executor labels cannot contain $SPARK_EXECUTOR_ID_LABEL as it is reserved for" + + s" Spark.") + + private val executorAnnotations = +ConfigurationUtils.parsePrefixedKeyValuePairs( + sparkConf, + KUBERNETES_EXECUTOR_ANNOTATION_PREFIX, + "executor annotation") + private val nodeSelector = +ConfigurationUtils.parsePrefixedKeyValuePairs( + sparkConf, + KUBERNETES_NODE_SELECTOR_PREFIX, + "node selector") + + private val executorDockerImage = sparkConf.get(EXECUTOR_DOCKER_IMAGE) + private val dockerImagePullPolicy = sparkConf.get(DOCKER_IMAGE_PULL_POLICY) + private val executorPort = sparkConf.getInt("spark.executor.port", DEFAULT_STATIC_PORT) + private val blockmanagerPort = sparkConf +.getInt("spark.blockmanager.port", DEFAULT_BLOCKMANAGER_PORT) + private val kubernetesDriverPodName = sparkConf +.get(KUBERNETES_DRIVER_POD_NAME) +.getOrElse(throw new SparkException("Must specify the driver pod name")) + + private val executorPodNamePrefix = sparkConf.get(KUBERNETES_EXECUTOR_POD_NAME_PREFIX) + + private val executorMemoryMiB = sparkConf.get(org.apache.spark.internal.config.EXECUTOR_MEMORY) + private val executorMemoryString = sparkConf.get( --- End diff -- More context here: https://github.com/apache-spark-on-k8s/spark/pull/470 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/19468#discussion_r148916475 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala --- @@ -0,0 +1,227 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.scheduler.cluster.k8s + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model._ + +import org.apache.spark.{SparkConf, SparkException} +import org.apache.spark.deploy.k8s.ConfigurationUtils +import org.apache.spark.deploy.k8s.config._ +import org.apache.spark.deploy.k8s.constants._ +import org.apache.spark.util.Utils + +/** + * Configures executor pods. Construct one of these with a SparkConf to set up properties that are + * common across all executors. Then, pass in dynamic parameters into createExecutorPod. + */ +private[spark] trait ExecutorPodFactory { + def createExecutorPod( + executorId: String, + applicationId: String, + driverUrl: String, + executorEnvs: Seq[(String, String)], + driverPod: Pod, + nodeToLocalTaskCount: Map[String, Int]): Pod +} + +private[spark] class ExecutorPodFactoryImpl(sparkConf: SparkConf) + extends ExecutorPodFactory { + + import ExecutorPodFactoryImpl._ + + private val executorExtraClasspath = sparkConf.get( +org.apache.spark.internal.config.EXECUTOR_CLASS_PATH) + + private val executorLabels = ConfigurationUtils.parsePrefixedKeyValuePairs( +sparkConf, +KUBERNETES_EXECUTOR_LABEL_PREFIX, +"executor label") + require( --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/19468#discussion_r148916344 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala --- @@ -0,0 +1,227 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.scheduler.cluster.k8s + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model._ + +import org.apache.spark.{SparkConf, SparkException} +import org.apache.spark.deploy.k8s.ConfigurationUtils +import org.apache.spark.deploy.k8s.config._ +import org.apache.spark.deploy.k8s.constants._ +import org.apache.spark.util.Utils + +/** + * Configures executor pods. Construct one of these with a SparkConf to set up properties that are + * common across all executors. Then, pass in dynamic parameters into createExecutorPod. + */ +private[spark] trait ExecutorPodFactory { + def createExecutorPod( + executorId: String, + applicationId: String, + driverUrl: String, + executorEnvs: Seq[(String, String)], + driverPod: Pod, + nodeToLocalTaskCount: Map[String, Int]): Pod +} + +private[spark] class ExecutorPodFactoryImpl(sparkConf: SparkConf) + extends ExecutorPodFactory { + + import ExecutorPodFactoryImpl._ + + private val executorExtraClasspath = sparkConf.get( +org.apache.spark.internal.config.EXECUTOR_CLASS_PATH) + + private val executorLabels = ConfigurationUtils.parsePrefixedKeyValuePairs( +sparkConf, +KUBERNETES_EXECUTOR_LABEL_PREFIX, +"executor label") + require( +!executorLabels.contains(SPARK_APP_ID_LABEL), +s"Custom executor labels cannot contain $SPARK_APP_ID_LABEL as it is reserved for Spark.") + require( +!executorLabels.contains(SPARK_EXECUTOR_ID_LABEL), +s"Custom executor labels cannot contain $SPARK_EXECUTOR_ID_LABEL as it is reserved for" + + s" Spark.") + + private val executorAnnotations = +ConfigurationUtils.parsePrefixedKeyValuePairs( + sparkConf, + KUBERNETES_EXECUTOR_ANNOTATION_PREFIX, + "executor annotation") + private val nodeSelector = +ConfigurationUtils.parsePrefixedKeyValuePairs( + sparkConf, + KUBERNETES_NODE_SELECTOR_PREFIX, + "node selector") + + private val executorDockerImage = sparkConf.get(EXECUTOR_DOCKER_IMAGE) + private val dockerImagePullPolicy = sparkConf.get(DOCKER_IMAGE_PULL_POLICY) + private val executorPort = sparkConf.getInt("spark.executor.port", DEFAULT_STATIC_PORT) + private val blockmanagerPort = sparkConf +.getInt("spark.blockmanager.port", DEFAULT_BLOCKMANAGER_PORT) + private val kubernetesDriverPodName = sparkConf +.get(KUBERNETES_DRIVER_POD_NAME) +.getOrElse(throw new SparkException("Must specify the driver pod name")) + + private val executorPodNamePrefix = sparkConf.get(KUBERNETES_EXECUTOR_POD_NAME_PREFIX) + + private val executorMemoryMiB = sparkConf.get(org.apache.spark.internal.config.EXECUTOR_MEMORY) + private val executorMemoryString = sparkConf.get( --- End diff -- Kubernetes will treat MiB differently from MB I believe. The JVM's formatting of memory strings is different from what Kubernetes expects in its API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/19468#discussion_r148916191 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.io.File + +import com.google.common.base.Charsets +import com.google.common.io.Files +import io.fabric8.kubernetes.client.{Config, ConfigBuilder, DefaultKubernetesClient, KubernetesClient} --- End diff -- @jiangxb1987 Is this the convention across the codebase? I don't think most IDEs will preserve that formatting. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19655: [SPARK-22444][core] Spark History Server missing /enviro...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19655 @ambud please close this, we don't add features to old releases (only bugs fixes). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19655: [SPARK-22444][core] Spark History Server missing /enviro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19655 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19655: [SPARK-22444][core] Spark History Server missing ...
GitHub user ambud opened a pull request: https://github.com/apache/spark/pull/19655 [SPARK-22444][core] Spark History Server missing /environment endpoint/api ## What changes were proposed in this pull request? Spark History Server REST API is missing the /environment endpoint. This endpoint is otherwise available in Spark 2.x however it's missing from 1.6.x. Since the environment endpoint provides programmatic access to critical launch parameters it would be great to have this back ported. This feature was contributed under: https://issues.apache.org/jira/browse/SPARK-16122 ## How was this patch tested? Will update this shortly! (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ambud/spark branch-1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19655.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19655 commit a0d549dd566d4683bd44fd162e776fa340f87428 Author: ambud Date: 2017-11-03T23:40:26Z Adding environment api SPARK-22444 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19582 **[Test build #83424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83424/testReport)** for PR 19582 at commit [`537c7b4`](https://github.com/apache/spark/commit/537c7b4bf27e1392fd4868db688f7d2f48baa3a5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19582 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19582: [SPARK-20644][core] Initial ground work for kvsto...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19582#discussion_r148913864 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -486,8 +580,21 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) bus.addListener(listener) replay(fileStatus, isApplicationCompleted(fileStatus), bus, eventsFilter) -listener.applicationInfo.foreach(addListing) -listing.write(LogInfo(logPath.toString(), fileStatus.getLen())) +listener.applicationInfo.foreach { app => + // Invalidate the existing UI for the reloaded app attempt, if any. Note that this does + // not remove the UI from the active list; that has to be done in onUIDetached, so that + // cleanup of files can be done in a thread-safe manner. It does mean the UI will remain + // in memory for longer than it should. --- End diff -- Done. Also updated a bunch of other stale comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148909027 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,192 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +.. attribute:: ImageSchema + +A singleton-like attribute of :class:`_ImageSchema` in this module. --- End diff -- Thanks for the thoughts! Keeping it as is sounds good to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83423/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r148908923 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( +StructField(imageFields(0), StringType, true) :: +StructField(imageFields(1), IntegerType, false) :: +StructField(imageFields(2), IntegerType, false) :: +StructField(imageFields(3), IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields(4), IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields(5), BinaryType, false) :: Nil) + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * :: Experimental :: + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * :: Experimental :: + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * :: Experimental :: + * Gets the width of the image + * + * @return The width of the image + */ + def getWidth(row: Row): Int = row.getInt(2) + + /** + * :: Experimental :: + * Gets the number of channels in the image + * + * @return The number of channels in the image + */ + def getNChannels(row: Row): Int = row.getInt(3) + + /** + * :: Experimental :: + * Gets the OpenCV representation as an int + * + * @return The OpenCV representation as an int + */ + def getMode(row: Row): Int = row.getInt(4) + + /** + * :: Experimental :: + * Gets the image data + * + * @return The image data + */ + def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5) + + /** + * Default values for the invalid image + * + * @param origin Origin of the invalid image + * @return Row with the default values + */ + private def invalidImageRow(origin: String): Row = +Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), Array.ofDim[Byte](0))) + + /** + * Convert the compressed image (jpeg, png, etc.) into OpenCV + * representation and store it in DataFrame Row + * + * @param origin Arbitrary string that identifies the image + * @param bytes Image bytes (for example, jpeg) + * @return DataFrame Row or None (if the decompression fails) + */ + private[spark] def decode(origin: String, bytes: Array[Byte]): Option[Row] = { + +
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19651 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19208 Done with review. I mainly review CrossValidator since some comments will apply to TrainValidationSplit as well. Thanks for the PR! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148885057 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -271,6 +303,20 @@ class CrossValidatorModel private[ml] ( @Since("1.6.0") object CrossValidatorModel extends MLReadable[CrossValidatorModel] { + private[CrossValidatorModel] def copySubModels(subModels: Option[Array[Array[Model[_) = { --- End diff -- style: state return value explicitly --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148885486 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -271,6 +303,20 @@ class CrossValidatorModel private[ml] ( @Since("1.6.0") object CrossValidatorModel extends MLReadable[CrossValidatorModel] { + private[CrossValidatorModel] def copySubModels(subModels: Option[Array[Array[Model[_) = { +subModels.map { subModels => --- End diff -- Can this be simplified using map? ``` subModels.map(_.map(_.map(_.copy(...).asInstanceOf[...]))) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148886008 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -282,12 +328,40 @@ object CrossValidatorModel extends MLReadable[CrossValidatorModel] { ValidatorParams.validateParams(instance) +protected var shouldPersistSubModels: Boolean = if (instance.hasSubModels) true else false + +/** + * Extra options for CrossValidatorModelWriter, current support "persistSubModels". + * if sub models exsit, the default value for option "persistSubModels" is "true". + */ +@Since("2.3.0") +override def option(key: String, value: String): this.type = { + key.toLowerCase(Locale.ROOT) match { +case "persistsubmodels" => shouldPersistSubModels = value.toBoolean +case _ => throw new IllegalArgumentException( + s"Illegal option ${key} for CrossValidatorModelWriter") + } + this +} + override protected def saveImpl(path: String): Unit = { import org.json4s.JsonDSL._ - val extraMetadata = "avgMetrics" -> instance.avgMetrics.toSeq + val extraMetadata = ("avgMetrics" -> instance.avgMetrics.toSeq) ~ +("shouldPersistSubModels" -> shouldPersistSubModels) ValidatorParams.saveImpl(path, instance, sc, Some(extraMetadata)) val bestModelPath = new Path(path, "bestModel").toString instance.bestModel.asInstanceOf[MLWritable].save(bestModelPath) + if (shouldPersistSubModels) { +require(instance.hasSubModels, "Cannot get sub models to persist.") --- End diff -- This error message may be unclear. How about adding: "When persisting tuning models, you can only set persistSubModels to true if the tuning was done with collectSubModels set to true. To save the sub-models, try rerunning fitting with collectSubModels set to true." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148886190 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -282,12 +328,40 @@ object CrossValidatorModel extends MLReadable[CrossValidatorModel] { ValidatorParams.validateParams(instance) +protected var shouldPersistSubModels: Boolean = if (instance.hasSubModels) true else false + +/** + * Extra options for CrossValidatorModelWriter, current support "persistSubModels". + * if sub models exsit, the default value for option "persistSubModels" is "true". + */ +@Since("2.3.0") +override def option(key: String, value: String): this.type = { + key.toLowerCase(Locale.ROOT) match { +case "persistsubmodels" => shouldPersistSubModels = value.toBoolean +case _ => throw new IllegalArgumentException( + s"Illegal option ${key} for CrossValidatorModelWriter") + } + this +} + override protected def saveImpl(path: String): Unit = { import org.json4s.JsonDSL._ - val extraMetadata = "avgMetrics" -> instance.avgMetrics.toSeq + val extraMetadata = ("avgMetrics" -> instance.avgMetrics.toSeq) ~ +("shouldPersistSubModels" -> shouldPersistSubModels) ValidatorParams.saveImpl(path, instance, sc, Some(extraMetadata)) val bestModelPath = new Path(path, "bestModel").toString instance.bestModel.asInstanceOf[MLWritable].save(bestModelPath) + if (shouldPersistSubModels) { +require(instance.hasSubModels, "Cannot get sub models to persist.") +val subModelsPath = new Path(path, "subModels") +for (splitIndex <- 0 until instance.getNumFolds) { + val splitPath = new Path(subModelsPath, splitIndex.toString) --- End diff -- How about naming this with the string "fold": ```splitIndex.toString``` --> ```"fold" + splitIndex.toString```? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148885817 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -282,12 +328,40 @@ object CrossValidatorModel extends MLReadable[CrossValidatorModel] { ValidatorParams.validateParams(instance) +protected var shouldPersistSubModels: Boolean = if (instance.hasSubModels) true else false + +/** + * Extra options for CrossValidatorModelWriter, current support "persistSubModels". + * if sub models exsit, the default value for option "persistSubModels" is "true". + */ +@Since("2.3.0") +override def option(key: String, value: String): this.type = { + key.toLowerCase(Locale.ROOT) match { +case "persistsubmodels" => shouldPersistSubModels = value.toBoolean +case _ => throw new IllegalArgumentException( + s"Illegal option ${key} for CrossValidatorModelWriter") + } + this +} + override protected def saveImpl(path: String): Unit = { import org.json4s.JsonDSL._ - val extraMetadata = "avgMetrics" -> instance.avgMetrics.toSeq + val extraMetadata = ("avgMetrics" -> instance.avgMetrics.toSeq) ~ +("shouldPersistSubModels" -> shouldPersistSubModels) --- End diff -- Let's have 1 name for this argument: "persistSubModels" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148908525 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala --- @@ -187,6 +191,50 @@ class CrossValidatorSuite .compareParamMaps(cv.getEstimatorParamMaps, cv2.getEstimatorParamMaps) } + test("CrossValidator expose sub models") { +val lr = new LogisticRegression +val lrParamMaps = new ParamGridBuilder() + .addGrid(lr.regParam, Array(0.001, 1000.0)) + .addGrid(lr.maxIter, Array(0, 3)) + .build() +val eval = new BinaryClassificationEvaluator +val numFolds = 3 +val subPath = new File(tempDir, "testCrossValidatorSubModels") +val persistSubModelsPath = new File(subPath, "subModels").toString + +val cv = new CrossValidator() + .setEstimator(lr) + .setEstimatorParamMaps(lrParamMaps) + .setEvaluator(eval) + .setNumFolds(numFolds) + .setParallelism(1) + .setCollectSubModels(true) + +val cvModel = cv.fit(dataset) + +assert(cvModel.hasSubModels && cvModel.subModels.length == numFolds) +cvModel.subModels.foreach(array => assert(array.length == lrParamMaps.length)) + +// Test the default value for option "persistSubModel" to be "true" +val savingPathWithSubModels = new File(subPath, "cvModel3").getPath +cvModel.save(savingPathWithSubModels) +val cvModel3 = CrossValidatorModel.load(savingPathWithSubModels) +assert(cvModel3.hasSubModels && cvModel3.subModels.length == numFolds) +cvModel3.subModels.foreach(array => assert(array.length == lrParamMaps.length)) + +val savingPathWithoutSubModels = new File(subPath, "cvModel2").getPath +cvModel.write.option("persistSubModels", "false").save(savingPathWithoutSubModels) +val cvModel2 = CrossValidatorModel.load(savingPathWithoutSubModels) --- End diff -- Shall we try saving cvModel2 with persistSubModels = true and check for an Exception? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83422/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19651 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19653 I checked with Postgres and we have the same results of Postgres too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19653 **[Test build #83421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83421/testReport)** for PR 19653 at commit [`92308a4`](https://github.com/apache/spark/commit/92308a4341849258caf549d1bcbeabd9002d3ead). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19653 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19653 Our results are exactly the same as Oracle. LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19653#discussion_r148903491 --- Diff: sql/core/src/test/resources/sql-tests/results/null-handling.sql.out --- @@ -0,0 +1,305 @@ +-- Automatically generated by SQLQueryTestSuite +-- Number of queries: 28 + + +-- !query 0 +create table t1(a int, b int, c int) using parquet +-- !query 0 schema +struct<> +-- !query 0 output + + + +-- !query 1 +insert into t1 values(1,0,0) +-- !query 1 schema +struct<> +-- !query 1 output + + + +-- !query 2 +insert into t1 values(2,0,1) +-- !query 2 schema +struct<> +-- !query 2 output + + + +-- !query 3 +insert into t1 values(3,1,0) +-- !query 3 schema +struct<> +-- !query 3 output + + + +-- !query 4 +insert into t1 values(4,1,1) +-- !query 4 schema +struct<> +-- !query 4 output + + + +-- !query 5 +insert into t1 values(5,null,0) +-- !query 5 schema +struct<> +-- !query 5 output + + + +-- !query 6 +insert into t1 values(6,null,1) +-- !query 6 schema +struct<> +-- !query 6 output + + + +-- !query 7 +insert into t1 values(7,null,null) +-- !query 7 schema +struct<> +-- !query 7 output + + + +-- !query 8 +select a, b+c from t1 +-- !query 8 schema +struct +-- !query 8 output +1 0 +2 1 +3 1 +4 2 +5 NULL +6 NULL +7 NULL + + +-- !query 9 +select a+10, b*0 from t1 +-- !query 9 schema +struct<(a + 10):int,(b * 0):int> +-- !query 9 output +11 0 +12 0 +13 0 +14 0 +15 NULL +16 NULL +17 NULL + + +-- !query 10 +select distinct b from t1 +-- !query 10 schema +struct +-- !query 10 output +0 +1 +NULL + + +-- !query 11 +select b from t1 union select b from t1 +-- !query 11 schema +struct +-- !query 11 output +0 +1 +NULL + + +-- !query 12 +select a+20, case b when c then 1 else 0 end from t1 +-- !query 12 schema +struct<(a + 20):int,CASE WHEN (b = c) THEN 1 ELSE 0 END:int> +-- !query 12 output +21 1 +22 0 +23 0 +24 1 +25 0 +26 0 +27 0 + + +-- !query 13 +select a+30, case c when b then 1 else 0 end from t1 +-- !query 13 schema +struct<(a + 30):int,CASE WHEN (c = b) THEN 1 ELSE 0 END:int> +-- !query 13 output +31 1 +32 0 +33 0 +34 1 +35 0 +36 0 +37 0 + + +-- !query 14 +select a+40, case when b<>0 then 1 else 0 end from t1 +-- !query 14 schema +struct<(a + 40):int,CASE WHEN (NOT (b = 0)) THEN 1 ELSE 0 END:int> +-- !query 14 output +41 0 +42 0 +43 1 +44 1 +45 0 +46 0 +47 0 + + +-- !query 15 +select a+50, case when not b<>0 then 1 else 0 end from t1 +-- !query 15 schema +struct<(a + 50):int,CASE WHEN (NOT (NOT (b = 0))) THEN 1 ELSE 0 END:int> +-- !query 15 output +51 1 +52 1 +53 0 +54 0 +55 0 +56 0 +57 0 + + +-- !query 16 +select a+60, case when b<>0 and c<>0 then 1 else 0 end from t1 +-- !query 16 schema +struct<(a + 60):int,CASE WHEN ((NOT (b = 0)) AND (NOT (c = 0))) THEN 1 ELSE 0 END:int> +-- !query 16 output +61 0 +62 0 +63 0 +64 1 +65 0 +66 0 +67 0 + + +-- !query 17 +select a+70, case when not (b<>0 and c<>0) then 1 else 0 end from t1 +-- !query 17 schema +struct<(a + 70):int,CASE WHEN (NOT ((NOT (b = 0)) AND (NOT (c = 0 THEN 1 ELSE 0 END:int> +-- !query 17 output +71 1 +72 1 +73 1 +74 0 +75 1 +76 0 +77 0 + + +-- !query 18 +select a+80, case when b<>0 or c<>0 then 1 else 0 end from t1 +-- !query 18 schema +struct<(a + 80):int,CASE WHEN ((NOT (b = 0)) OR (NOT (c = 0))) THEN 1 ELSE 0 END:int> +-- !query 18 output +81 0 +82 1 +83 1 +84 1 +85 0 +86 1 +87 0 + + +-- !query 19 +select a+90, case when not (b<>0 or c<>0) then 1 else 0 end from t1 +-- !query 19 schema +struct<(a + 90):int,CASE WHEN (NOT ((NOT (b = 0)) OR (NOT (c = 0 THEN 1 ELSE 0 END:int> +-- !query 19 output +91 1 +92 0 +93 0 +94 0 +95 0 +96 0 +97 0 + + +-- !query 20 +select count(*), count(b), sum(b), avg(b), min(b), max(b) from t1 +-- !query 20 schema +struct +-- !query 20 output +7 4 2 0.5 0 1 + + +-- !query 21 +select a+100 from t1 where b<10 +-- !query 21 schema +struct<(a + 100):int> +-- !query 21 output +101 +102 +103 +104 + + +-- !query 22 +select a+110 from t1 where not b>10 +-- !query 22 sc
[GitHub] spark pull request #19582: [SPARK-20644][core] Initial ground work for kvsto...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/19582#discussion_r148901343 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -486,8 +580,21 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) bus.addListener(listener) replay(fileStatus, isApplicationCompleted(fileStatus), bus, eventsFilter) -listener.applicationInfo.foreach(addListing) -listing.write(LogInfo(logPath.toString(), fileStatus.getLen())) +listener.applicationInfo.foreach { app => + // Invalidate the existing UI for the reloaded app attempt, if any. Note that this does + // not remove the UI from the active list; that has to be done in onUIDetached, so that + // cleanup of files can be done in a thread-safe manner. It does mean the UI will remain + // in memory for longer than it should. --- End diff -- it took me some time to figure how the cache & invalidation worked, mostly because I wasn't looking in the right places. I don't think you've made this any more confusing than it was before (in fact its probably better), but seems like a good opportunity to improve commenting a little. I think it might help to have one comment in the code where the entire sequence is described ( here on `mergeApplicationListing`, or on `AppCache`, or `ApplicationCacheCheckFilter`, doesn't really matter, but they could all reference the longer comment). if I understand correctly, it would be something like: Logs of incomplete apps are regularly polled to see if they have been updated (based on an increase in file size). If they have, the existing data for that app is marked as invalid in LoadedAppUI. However, no memory is freed, no files are cleaned up at this time, nor is a new UI built. On each request for one app's UI, the application cache is checked to see if it has a valid `LoadedAppUI` in the cache. If there is data in the cache and its valid, then its served. If there is data in the cache but it is invalid, then the UI is rebuilt from the raw event logs. If there is nothing in the cache, then the UI is built from the raw event logs and added to the cache. This may kick another entry out of the cache -- if its for an incomplete app, then any KVStore data written to disk is deleted (as the KVStore for an incomplete app is always regenerated from scratch anyway). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19582: [SPARK-20644][core] Initial ground work for kvsto...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/19582#discussion_r148879455 --- Diff: core/src/main/scala/org/apache/spark/status/storeTypes.scala --- @@ -17,12 +17,17 @@ package org.apache.spark.status +import java.lang.{Integer => JInteger, Long => JLong} + import com.fasterxml.jackson.annotation.JsonIgnore import org.apache.spark.status.KVUtils._ import org.apache.spark.status.api.v1._ import org.apache.spark.util.kvstore.KVIndex +private[spark] case class AppStatusStoreMetadata( +val version: Long) --- End diff -- nit: `val` is unnecessary --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19583: [WIP][SPARK-22339] [CORE] [NETWORK-SHUFFLE] Push epoch u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19583 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19582 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19583: [WIP][SPARK-22339] [CORE] [NETWORK-SHUFFLE] Push epoch u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19583 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83419/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19582: [SPARK-20644][core] Initial ground work for kvstore UI b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19582 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83420/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19653 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL Handlin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19653 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83418/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/19623#discussion_r14237 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java --- @@ -50,28 +53,34 @@ /** * Creates a writer factory which will be serialized and sent to executors. + * + * If this method fails (by throwing an exception), the action would fail and no Spark job was + * submitted. */ DataWriterFactory createWriterFactory(); /** * Commits this writing job with a list of commit messages. The commit messages are collected from - * successful data writers and are produced by {@link DataWriter#commit()}. If this method - * fails(throw exception), this writing job is considered to be failed, and - * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible - * to data source readers if this method succeeds. + * successful data writers and are produced by {@link DataWriter#commit()}. + * + * If this method fails (by throwing an exception), this writing job is considered to to have been + * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination + * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it. * * Note that, one partition may have multiple committed data writers because of speculative tasks. * Spark will pick the first successful one and get its commit message. Implementations should be --- End diff -- thinking about this some more (and still reserving the right to be provably wrong), I could imagine a job commit protocol which allowed 1+ Task attempt to commit, making the decision about which completed tasks to accept into the final results at job commit. That is, 1. every task attempt could (atomically) promote its output to the job commit dir 1. the job commit would enumerate all promoted task attempts, and choose which ones to accept. 1. The ones it didn't want would be aborted by the job committer. Example: task attempts rename their output to a dir `dest/_tempt/jobId/completed/$taskId_$task_attemptId`; job committer would enum all directories in the completed dir, and for all dirs where the task ID was the same: pick one to commit, abort the others. With that variant, you don't need to coordinate task commit across workers, even with speculation enabled. You'd just need to be confident that the promotion of task commit information was atomic, the view of the output consistent, and that job commit is not initiated until at least one attempt per task has succeeded. Job commit is potentially slower though. Because you can turn off use of the output co-ordinator when writing to hive/hadoop tables, then a commit protocol like this could work today. Interestingly, it's not that far off the FileOutputCommitter v1 protocol, you'd just renane the task attempt output dir to the promoted dir & let the job commit filter out duplicates from its directory listing. Be a bit more expensive in terms of storage use between task commit and job commit though. Maybe a good policy here is "a job must not commit 1+ task attempt", but give committers the option to bypass the output coordinator. if the job committer can handle the conflict resolution & so make that guarantee in its commit phase. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19638#discussion_r148881376 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -764,13 +764,17 @@ class LinearRegressionSuite (Intercept) 6.3022157 0.00186003388 <2e-16 *** V2 4.6982442 0.00118053980 <2e-16 *** V3 7.1994344 0.00090447961 <2e-16 *** + + # R code for r2adj --- End diff -- What is `X1`? it's never defined. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/19623#discussion_r148875607 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java --- @@ -50,28 +53,34 @@ /** * Creates a writer factory which will be serialized and sent to executors. + * + * If this method fails (by throwing an exception), the action would fail and no Spark job was + * submitted. */ DataWriterFactory createWriterFactory(); /** * Commits this writing job with a list of commit messages. The commit messages are collected from - * successful data writers and are produced by {@link DataWriter#commit()}. If this method - * fails(throw exception), this writing job is considered to be failed, and - * {@link #abort(WriterCommitMessage[])} will be called. The written data should only be visible - * to data source readers if this method succeeds. + * successful data writers and are produced by {@link DataWriter#commit()}. + * + * If this method fails (by throwing an exception), this writing job is considered to to have been + * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination + * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it. * * Note that, one partition may have multiple committed data writers because of speculative tasks. * Spark will pick the first successful one and get its commit message. Implementations should be --- End diff -- I haven't read Steve's points here entirely, but I agree that Spark should be primarily responsible for task commit coordination. Most implementations would be fine using the current [output commit coordinator](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala), which does a good job balancing the trade-offs that you've been discussing. It ensures that only one task is authorized to commit and has well-defined failure cases (when a network partition prevents the authorized committer from responding before its commit authorization times out). I think that Spark should use the current commit coordinator unless an implementation opts out of using it (and I'm not sure that opting out is a use case we care to support at this point). It's fine if Spark documents how its coordinator works and there are some drawbacks, but expecting implementations to handle their own commit coordination (which requires RPC for Spark) is, I think, unreasonable. Let's use the one we have by default, however imperfect. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19634: [SPARK-22412][SQL] Fix incorrect comment in DataS...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19634#discussion_r148875065 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -469,7 +469,7 @@ case class FileSourceScanExec( currentSize = 0 } -// Assign files to partitions using "First Fit Decreasing" (FFD) +// Assign files to partitions using "Next Fit Decreasing" --- End diff -- @liancheng do you agree? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148859381 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -252,19 +252,29 @@ object CrossValidator extends MLReadable[CrossValidator] { class CrossValidatorModel private[ml] ( @Since("1.4.0") override val uid: String, @Since("1.2.0") val bestModel: Model[_], -@Since("1.5.0") val avgMetrics: Array[Double], -@Since("2.3.0") val subModels: Option[Array[Array[Model[_) +@Since("1.5.0") val avgMetrics: Array[Double]) extends Model[CrossValidatorModel] with CrossValidatorParams with MLWritable { /** A Python-friendly auxiliary constructor. */ private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: JList[Double]) = { -this(uid, bestModel, avgMetrics.asScala.toArray, null) +this(uid, bestModel, avgMetrics.asScala.toArray) } - private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: Array[Double]) = { -this(uid, bestModel, avgMetrics, null) + private var _subModels: Option[Array[Array[Model[_ = None + + @Since("2.3.0") + private[tuning] def setSubModels(subModels: Option[Array[Array[Model[_) +: CrossValidatorModel = { +_subModels = subModels +this } + @Since("2.3.0") + def subModels: Array[Array[Model[_]]] = _subModels.get --- End diff -- Also, can you please add a better Exception message? If submodels are not available, then we should tell users to set the collectSubModels Param before fitting. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148859543 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -325,14 +328,19 @@ object CrossValidatorModel extends MLReadable[CrossValidatorModel] { ValidatorParams.validateParams(instance) -protected var shouldPersistSubModels: Boolean = false +protected var shouldPersistSubModels: Boolean = if (instance.hasSubModels) true else false /** - * Set option for persist sub models. + * Extra options for CrossValidatorModelWriter, current support "persistSubModels". + * if sub models exsit, the default value for option "persistSubModels" is "true". --- End diff -- typo: exsit -> exist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148859043 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -252,19 +252,29 @@ object CrossValidator extends MLReadable[CrossValidator] { class CrossValidatorModel private[ml] ( @Since("1.4.0") override val uid: String, @Since("1.2.0") val bestModel: Model[_], -@Since("1.5.0") val avgMetrics: Array[Double], -@Since("2.3.0") val subModels: Option[Array[Array[Model[_) +@Since("1.5.0") val avgMetrics: Array[Double]) extends Model[CrossValidatorModel] with CrossValidatorParams with MLWritable { /** A Python-friendly auxiliary constructor. */ private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: JList[Double]) = { -this(uid, bestModel, avgMetrics.asScala.toArray, null) +this(uid, bestModel, avgMetrics.asScala.toArray) } - private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: Array[Double]) = { -this(uid, bestModel, avgMetrics, null) + private var _subModels: Option[Array[Array[Model[_ = None + + @Since("2.3.0") --- End diff -- Only use Since annotations for public APIs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148859168 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -252,19 +252,29 @@ object CrossValidator extends MLReadable[CrossValidator] { class CrossValidatorModel private[ml] ( @Since("1.4.0") override val uid: String, @Since("1.2.0") val bestModel: Model[_], -@Since("1.5.0") val avgMetrics: Array[Double], -@Since("2.3.0") val subModels: Option[Array[Array[Model[_) +@Since("1.5.0") val avgMetrics: Array[Double]) extends Model[CrossValidatorModel] with CrossValidatorParams with MLWritable { /** A Python-friendly auxiliary constructor. */ private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: JList[Double]) = { -this(uid, bestModel, avgMetrics.asScala.toArray, null) +this(uid, bestModel, avgMetrics.asScala.toArray) } - private[ml] def this(uid: String, bestModel: Model[_], avgMetrics: Array[Double]) = { -this(uid, bestModel, avgMetrics, null) + private var _subModels: Option[Array[Array[Model[_ = None + + @Since("2.3.0") + private[tuning] def setSubModels(subModels: Option[Array[Array[Model[_) +: CrossValidatorModel = { +_subModels = subModels +this } + @Since("2.3.0") + def subModels: Array[Array[Model[_]]] = _subModels.get --- End diff -- Let's add Scala doc. We'll need to explain what the inner and outer array are and which one corresponds to the ordering of estimatorParamsMaps. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148860542 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -108,6 +108,13 @@ abstract class MLWriter extends BaseReadWrite with Logging { protected def saveImpl(path: String): Unit /** + * `option()` handles extra options. If subclasses need to support extra options, override this + * method. + */ + @Since("2.3.0") + def option(key: String, value: String): this.type = this --- End diff -- Rather than overriding this in each subclass, let's have this option() method collect the specified options in a map which is consumed by the subclass when saveImpl() is called. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19638: [SPARK-22422][ML] Add Adjusted R2 to RegressionMe...
Github user tengpeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19638#discussion_r148869278 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -764,13 +764,17 @@ class LinearRegressionSuite (Intercept) 6.3022157 0.00186003388 <2e-16 *** V2 4.6982442 0.00118053980 <2e-16 *** V3 7.1994344 0.00090447961 <2e-16 *** + + # R code for r2adj --- End diff -- I just want to confirm: is `summary(lm(X1 ~ ., data = part_0))` this does not work on your side? Is it because you used different variable name or data name? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 @viirya Can you please take a look at my latest revisions and replies to your comments? Cheers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r148866084 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala --- @@ -63,9 +74,22 @@ private[parquet] class ParquetReadSupport extends ReadSupport[UnsafeRow] with Lo StructType.fromString(schemaString) } -val parquetRequestedSchema = +val clippedParquetSchema = ParquetReadSupport.clipParquetSchema(context.getFileSchema, catalystRequestedSchema) +val parquetRequestedSchema = if (parquetMrCompatibility) { + // Parquet-mr will throw an exception if we try to read a superset of the file's schema. + // Therefore, we intersect our clipped schema with the underlying file's schema --- End diff -- On the other hand, if we fix `parquetMrCompatibility` to `true`, then a couple of other tests fail. Namely, these tests are "Filter applied on merged Parquet schema with new column should work" in `ParquetFilterSuite.scala` "read partitioned table - merging compatible schemas" in `ParquetPartitionDiscoverySuite.scala` In both cases, the failures involve queries over multiple files with compatible but different schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19653: [SPARK-22418][SQL][TEST] Add test cases for NULL ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19653#discussion_r148865915 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/NullHandlingSuite.scala --- @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql + +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.SharedSQLContext + +case class T1(a: Int, b: Option[Int], c: Option[Int]) + +/** + * This test suite takes https://sqlite.org/nulls.html as a reference. + */ +class NullHandlingSuite extends QueryTest with SharedSQLContext { --- End diff -- Maybe using SQL? It will be easier for the others who knew SQL only to understand our NULL handling logics. Thanks again! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r148863673 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala --- @@ -63,9 +74,22 @@ private[parquet] class ParquetReadSupport extends ReadSupport[UnsafeRow] with Lo StructType.fromString(schemaString) } -val parquetRequestedSchema = +val clippedParquetSchema = ParquetReadSupport.clipParquetSchema(context.getFileSchema, catalystRequestedSchema) +val parquetRequestedSchema = if (parquetMrCompatibility) { + // Parquet-mr will throw an exception if we try to read a superset of the file's schema. + // Therefore, we intersect our clipped schema with the underlying file's schema --- End diff -- As for the problem of requesting a read of a superset of a file's fields, if we disable the `parquetMrCompatibility` code, then the "partial schema intersection - select missing subfield" test in `ParquetSchemaPruningSuite.scala` fails with the following stack trace: ``` [info] Cause: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Volumes/VideoAmpCS/msa/workspace/spark-public/target/tmp/spark-a0bda193-9d3f-4cd1-885c-9e8b5b0fc1ed/contacts/p=2/part-1-4a8671f1-afb2-482f-8c4d-4f6f4df896bc-c000.snappy.parquet [info] at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223) [info] at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:215) [info] at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:432) [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1820) [info] at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160) [info] at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160) [info] at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2064) [info] at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2064) [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [info] at org.apache.spark.scheduler.Task.run(Task.scala:108) [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) [info] Cause: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 [info] at java.util.ArrayList.rangeCheck(ArrayList.java:653) [info] at java.util.ArrayList.get(ArrayList.java:429) [info] at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:103) [info] at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:103) [info] at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102) [info] at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97) [info] at org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:278) [info] at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:141) [info] at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:107) [info] at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:155) [info] at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:107) [info] at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:136) [info] at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194) [info] at org.apache.parquet.hadoop
[GitHub] spark pull request #19648: [SPARK-14516][ML][FOLLOW-UP] Move ClusteringEvalu...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19648#discussion_r148861446 --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala --- @@ -22,15 +22,21 @@ import org.apache.spark.ml.param.ParamsSuite import org.apache.spark.ml.util.DefaultReadWriteTest import org.apache.spark.ml.util.TestingUtils._ import org.apache.spark.mllib.util.MLlibTestSparkContext -import org.apache.spark.sql.{DataFrame, SparkSession} -import org.apache.spark.sql.types.IntegerType +import org.apache.spark.sql.Dataset class ClusteringEvaluatorSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { import testImplicits._ + @transient var irisDataset: Dataset[_] = _ --- End diff -- Either is ok, no difference on performance, we just keep consistent with other test suites. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r148857274 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala --- @@ -276,12 +315,32 @@ object TrainValidationSplitModel extends MLReadable[TrainValidationSplitModel] { ValidatorParams.validateParams(instance) +protected var shouldPersistSubModels: Boolean = false + +/** + * Set option for persist sub models. + */ +@Since("2.3.0") +def persistSubModels(persist: Boolean): this.type = { + shouldPersistSubModels = persist + this +} + override protected def saveImpl(path: String): Unit = { import org.json4s.JsonDSL._ - val extraMetadata = "validationMetrics" -> instance.validationMetrics.toSeq + val extraMetadata = ("validationMetrics" -> instance.validationMetrics.toSeq) ~ +("shouldPersistSubModels" -> shouldPersistSubModels) ValidatorParams.saveImpl(path, instance, sc, Some(extraMetadata)) val bestModelPath = new Path(path, "bestModel").toString instance.bestModel.asInstanceOf[MLWritable].save(bestModelPath) + if (shouldPersistSubModels) { +require(instance.subModels != null, "Cannot get sub models to persist.") +val subModelsPath = new Path(path, "subModels") +for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) { + val modelPath = new Path(subModelsPath, paramIndex.toString).toString + instance.subModels(paramIndex).asInstanceOf[MLWritable].save(modelPath) --- End diff -- Good question about cleaning up partially saved models. I agree it'd be nice to do in the future, rather than now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19641: [SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tun...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19641 LGTM pending tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org