[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23262 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5879/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23262 **[Test build #99861 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99861/testReport)** for PR 23262 at commit [`ddb2528`](https://github.com/apache/spark/commit/ddb252892a439281b16bc14fdfdb7faf756f1067). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/23256#discussion_r239997109 --- Diff: R/pkg/tests/fulltests/test_mllib_fpm.R --- @@ -84,19 +84,20 @@ test_that("spark.fpGrowth", { }) test_that("spark.prefixSpan", { -df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))), - list(list(list(1L), list(3L, 2L), list(1L, 2L))), - list(list(list(1L, 2L), list(5L))), - list(list(list(6L, schema = c("sequence")) -result1 <- spark.findFrequentSequentialPatterns(df, minSupport = 0.5, maxPatternLength = 5L, -maxLocalProjDBSize = 3200L) - -expected_result <- createDataFrame(list(list(list(list(1L)), 3L), -list(list(list(3L)), 2L), -list(list(list(2L)), 3L), -list(list(list(1L, 2L)), 3L), -list(list(list(1L), list(3L)), 2L)), -schema = c("sequence", "freq")) - }) + df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))), + list(list(list(1L), list(3L, 2L), list(1L, 2L))), + list(list(list(1L, 2L), list(5L))), + list(list(list(6L, +schema = c("sequence")) + result <- spark.findFrequentSequentialPatterns(df, minSupport = 0.5, maxPatternLength = 5L, + maxLocalProjDBSize = 3200L) + + expected_result <- createDataFrame(list(list(list(list(1L)), 3L), list(list(list(3L)), 2L), + list(list(list(2L)), 3L), list(list(list(1L, 2L)), 3L), + list(list(list(1L), list(3L)), 2L)), + schema = c("sequence", "freq")) + + expect_equivalent(expected_result, result) --- End diff -- this is an important fix.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23262 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99859/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23261 **[Test build #99859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99859/testReport)** for PR 23261 at commit [`56805d6`](https://github.com/apache/spark/commit/56805d60004142106b400b94eb94f8cf87486494). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TransformStart(transformer: Transformer, input: Dataset[_]) extends MLEvent` * `case class TransformEnd(transformer: Transformer, output: Dataset[_]) extends MLEvent` * `case class FitStart[M <: Model[M]](estimator: Estimator[M], dataset: Dataset[_]) extends MLEvent` * `case class FitEnd[M <: Model[M]](estimator: Estimator[M], model: M) extends MLEvent` * `case class LoadInstanceStart[T](reader: MLReader[T], path: String) extends MLEvent` * `case class LoadInstanceEnd[T](reader: MLReader[T], instance: T) extends MLEvent` * `case class SaveInstanceStart(writer: MLWriter, path: String) extends MLEvent` * `case class SaveInstanceEnd(writer: MLWriter, path: String) extends MLEvent` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23262: [SPARK-26312][SQL]Converting converters in RDDCon...
GitHub user eatoncys opened a pull request: https://github.com/apache/spark/pull/23262 [SPARK-26312][SQL]Converting converters in RDDConversions into arrays to improve their access performance ## What changes were proposed in this pull request? `RDDConversions` would get disproportionately slower as the number of columns in the query increased. This PR converts the `converters` in `RDDConversions` into arrays to improve their access performance, the type of `converters` before is `scala.collection.immutable.::` which is a subtype of list. The test of `PrunedScanSuite` for 2000 columns and 20k rows takes 409 seconds before this PR, and 361 seconds after. ## How was this patch tested? Test case of `PrunedScanSuite` Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/eatoncys/spark toarray Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23262.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23262 commit ddb252892a439281b16bc14fdfdb7faf756f1067 Author: 10129659 Date: 2018-12-08T07:15:10Z Converting converters in RDDConversions into arrays to improve their access performance --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23252 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99850/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23252 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22707#discussion_r239996822 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter } } } + + test("SPARK-25717: Insert overwrite a recreated external and partitioned table " ++ "should remove the historical partition first") { +withTempDir { tmpDir => + withTable("test_table") { +(0 until 3).foreach { _ => + sql("DROP TABLE IF EXISTS test_table") + sql( +s""" + |CREATE EXTERNAL TABLE test_table (key int) + |PARTITIONED BY (p int) + |LOCATION '${tmpDir.toURI.toString.stripSuffix("/")}/test_table' --- End diff -- nit: `|LOCATION '${tmpDir.toURI}'`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23252 **[Test build #99850 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99850/testReport)** for PR 23252 at commit [`662e38e`](https://github.com/apache/spark/commit/662e38e6f1a01297d75132b088afb6a6a761f70a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22707 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5878/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22707 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22707 **[Test build #99860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99860/testReport)** for PR 22707 at commit [`89f7223`](https://github.com/apache/spark/commit/89f7223593b171133bb716b0004aa13a592dee6a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22707#discussion_r239996528 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -774,4 +774,23 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter } } } + + test("SPARK-25717: Insert overwrite a recreated external and partitioned table " --- End diff -- Can you make the title shorter in a single line? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22707: [SPARK-25717][SQL] Insert overwrite a recreated external...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22707 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23260 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23260 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99857/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23260 **[Test build #99857 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99857/testReport)** for PR 23260 at commit [`65cc6a3`](https://github.com/apache/spark/commit/65cc6a32729cccba340f66c766c7255be4d7f356). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23224 **[Test build #99858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99858/testReport)** for PR 23224 at commit [`71e569b`](https://github.com/apache/spark/commit/71e569b2a89ff0fe83b3f94fee6f57596c093590). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [SPARK-23674][ML] Adds Spark ML Events
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23261 **[Test build #99859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99859/testReport)** for PR 23261 at commit [`56805d6`](https://github.com/apache/spark/commit/56805d60004142106b400b94eb94f8cf87486494). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5877/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23261: [SPARK-23674][ML] Adds Spark ML Events
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23261 [SPARK-23674][ML] Adds Spark ML Events ## What changes were proposed in this pull request? This PR proposes to add ML events so that other developers can track add some actions for them. ## Introduction This PR proposes to send some ML events like SQL. This is quite useful when people want to track and make some actions for corresponding ML operations. For instance, I have been working on integrating Spark with Atlas, and with some custom changes with this PR, I can visualise ML pipeline as below: ![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png) I think not to mention how useful it is to track the SQL operations. Likewise, I would like to propose ML events as well (as lowest stability `@Unstable` APIs for now). ## Implementation Details ### Sends event (but not expose ML specific listener) In `events.scala`, it adds: ```scala @Unstable case class ...Events object MLEvents { // Wrappers to send events: // def with...Event(body) = { // body() // SparkContext.getOrCreate().listenerBus.post(event) // } } ``` This way mimics both: **1.. Catalog events (see `org.apache.spark.sql.catalyst.catalog.events.scala`)** - This allows a Catalog specific listener to be added `ExternalCatalogEventListener` - It's implemented in a way of wrapping whole `ExternalCatalog` named `ExternalCatalogWithListener` which delegates the operations to `ExternalCatalog` This is not quite possible in this case because most of instances (like `Pipeline`) will be directly created in most of cases. We might be able to do that via extending `ListenerBus` for all possbiel instances but IMHO it's invasive. Also, exposing another ML specific listener sounds a bit too much for current status. Therefore, I simply borrowed file name and structures here **2.. SQL execution events (see `org.apache.spark.sql.execution.SQLExecution.scala`)** - Add an object that wraps a body to send events Current apporach is rather close to this. It has a `with...` wrapper to send events. I borrowed this approach to be consistent. ### Add `...Impl` methods to wrap each to send events **1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`** ```diff - def save(...) = { saveImpl(...) } + def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) } def saveImpl(...): Unit = ... ``` Note that `saveImpl` was already implemented unlike other instances below. ```diff - def load(...): T + def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) } + def loadImpl(...): T ``` **2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`** ```diff - def fit(...): Model + def fit(...): Model = MLEvents.withFitEvent { fitImpl() } + def fitImpl(...): Model ``` **3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`** ```diff - def transform(...): DataFrame + def transform(...): DataFrame = MLEvents.withTransformEvent { transformImpl() } + def transformImpl(...): DataFrame ``` This approach follows the existing way as below in ML: **1.. `transform` and `transformImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L190-L196 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042 **2.. `save` and `saveImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176 Inherited ones are intentionally omitted here for simplicity. They are inherited and implemented at multiple places. ## Backword Compatibility This keeps both source and binary backward compatibility. I was thinking enforcing `...Impl` by leaving it abstract methods but just decided to leave a body that throws `UnsupportedOperationException` so that we can keep full source and binary compatibilities. - For user-faced API perspective, there's no difference. `...Impl` methods are protected and not visible to end users. - For developer API perspective, if some developers want to `...` methods instead of `...Impl`, that's still fine. It only does n
[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23260 **[Test build #99857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99857/testReport)** for PR 23260 at commit [`65cc6a3`](https://github.com/apache/spark/commit/65cc6a32729cccba340f66c766c7255be4d7f356). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23260 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23260 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23260: [SPARK-26311][YARN] New feature: custom log URL f...
GitHub user HeartSaVioR opened a pull request: https://github.com/apache/spark/pull/23260 [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr ## What changes were proposed in this pull request? This patch proposes adding a new configuration on YARN mode: custom log URL. This will enable end users to point application logs to external log service which enables to serve logs when NodeManager becomes unavailable. Some pre-defined patterns are available for custom log URL to specify them like path variables. ## How was this patch tested? Manual test. Below run changes executor log URLs in UI pages. ``` ./bin/spark-submit --conf spark.yarn.custom.log.url="{{HttpScheme}}{{NodeHttpAddress}}/test/cluster/{{ClusterId}}/container/{{ContainerId}}/user/{{User}}/filename/{{FileName}}" --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-.jar ``` Example of stdout log url is below: `http://node-address:node-port`/test/cluster/`workload1`/container/`container_e08_1542798098040_0012_01_02`/user/`spark`/filename/`stdout` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HeartSaVioR/spark SPARK-26311 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23260.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23260 commit 65cc6a32729cccba340f66c766c7255be4d7f356 Author: Jungtaek Lim (HeartSaVioR) Date: 2018-12-08T06:32:46Z [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r239995952 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// Range(nodeId = 2) +val df = spark.range(9, -1, -1).sort('id).toDF() --- End diff -- Using `Seq` instead of `Range`, we makes things simpler and more readable. I'll change to use `Seq`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23258 **[Test build #99856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99856/testReport)** for PR 23258 at commit [`6e36336`](https://github.com/apache/spark/commit/6e3633620af4e624c8e562ff290e4264cf33eb8b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r239995901 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// Range(nodeId = 2) +val df = spark.range(9, -1, -1).sort('id).toDF() +testSparkPlanMetrics(df, 2, Map.empty) + df.queryExecution.executedPlan.find(_.isInstanceOf[SortExec]).getOrElse(assert(false)) --- End diff -- OK, I think this is more readable. fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r239995882 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -26,7 +26,7 @@ import org.apache.spark.SparkFunSuite import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.expressions.aggregate.{Final, Partial} import org.apache.spark.sql.catalyst.plans.logical.LocalRelation -import org.apache.spark.sql.execution.{FilterExec, RangeExec, SparkPlan, WholeStageCodegenExec} +import org.apache.spark.sql.execution._ --- End diff -- Ok, unfolded. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r239995006 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -38,13 +38,21 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode override def outputPartitioning: Partitioning = SinglePartition override def executeCollect(): Array[InternalRow] = child.executeTake(limit) private val serializer: Serializer = new UnsafeRowSerializer(child.output.size) - override lazy val metrics = SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext) + private lazy val writeMetrics = +SQLShuffleWriteMetricsReporter.createShuffleWriteMetrics(sparkContext) + private lazy val readMetrics = +SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext) --- End diff -- Yea that was done and revert in https://github.com/apache/spark/pull/23207/commits/7d104ebe854effb3d8ceb63cd408b9749cee1a8a, will separate to another pr after this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22952 **[Test build #99855 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99855/testReport)** for PR 22952 at commit [`998e769`](https://github.com/apache/spark/commit/998e769c3b552c39736af7814f60be895dbd90d4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22952 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99851/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22952 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22952 **[Test build #99851 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99851/testReport)** for PR 22952 at commit [`22dce0c`](https://github.com/apache/spark/commit/22dce0c1d17de360c0d887a0352c1e8f57761db7). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` class FileStreamSourceCleaner(fileSystem: FileSystem, sourcePath: Path,` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23242: [SPARK-26285][CORE] accumulator metrics sources f...
Github user redsanket commented on a diff in the pull request: https://github.com/apache/spark/pull/23242#discussion_r239994508 --- Diff: examples/src/main/scala/org/apache/spark/examples/AccumulatorMetricsTest.scala --- @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// scalastyle:off println +package org.apache.spark.examples + +import org.apache.spark.metrics.source.{DoubleAccumulatorSource, LongAccumulatorSource} +import org.apache.spark.sql.SparkSession + +/** + * Usage: AccumulatorMetricsTest [partitions] [numElem] [blockSize] + */ +object AccumulatorMetricsTest { --- End diff -- Example named as Test is a bit confusing i think... thoughts? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reser...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/23259#discussion_r239994423 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -769,7 +774,7 @@ nonReserved | REVOKE | GRANT | LOCK | UNLOCK | MSCK | REPAIR | RECOVER | EXPORT | IMPORT | LOAD | VALUES | COMMENT | ROLE | ROLES | COMPACTIONS | PRINCIPALS | TRANSACTIONS | INDEX | INDEXES | LOCKS | OPTION | LOCAL | INPATH | ASC | DESC | LIMIT | RENAME | SETS -| AT | NULLS | OVERWRITE | ALL | ANY | ALTER | AS | BETWEEN | BY | CREATE | DELETE +| AT | NULLS | OVERWRITE | ANY | ALTER | AS | BETWEEN | BY | CREATE | DELETE --- End diff -- yea, thanks. you're right. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reser...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23259#discussion_r239994385 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -769,7 +774,7 @@ nonReserved | REVOKE | GRANT | LOCK | UNLOCK | MSCK | REPAIR | RECOVER | EXPORT | IMPORT | LOAD | VALUES | COMMENT | ROLE | ROLES | COMPACTIONS | PRINCIPALS | TRANSACTIONS | INDEX | INDEXES | LOCKS | OPTION | LOCAL | INPATH | ASC | DESC | LIMIT | RENAME | SETS -| AT | NULLS | OVERWRITE | ALL | ANY | ALTER | AS | BETWEEN | BY | CREATE | DELETE +| AT | NULLS | OVERWRITE | ANY | ALTER | AS | BETWEEN | BY | CREATE | DELETE --- End diff -- Doesn't `ANY` move to `reserved`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r239994326 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. + - In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully. + --- End diff -- If there is no corrupt column? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reserved key...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23259 **[Test build #99854 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99854/testReport)** for PR 23259 at commit [`01bc383`](https://github.com/apache/spark/commit/01bc38347496a6194b46ace0feb7d2cd1adb614e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reserved key...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23259 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5876/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reserved key...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23259 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reserved key...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/23259 To discuss this topic smoothly, I made this pr. Any comment/suggestion is welcome. cc: @gatorsmile @cloud-fan @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23259: [SPARK-26215][SQL][WIP] Define reserved/non-reser...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/23259 [SPARK-26215][SQL][WIP] Define reserved/non-reserved keywords based on the ANSI SQL standard ## What changes were proposed in this pull request? This pr targeted to define reserved/non-reserved keywords based on the ANSI SQL standard. TODO: - Where should we hanlde reserved key words? - Which SQL standard does Spark SQL follow (e.g., 2011 or 2016)? - Where should we docment the list of reserved/non-reserved key words? - Others? ## How was this patch tested? Added tests in `TableIdentifierParserSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark SPARK-26215-WIP Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23259.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23259 commit 01bc38347496a6194b46ace0feb7d2cd1adb614e Author: Takeshi Yamamuro Date: 2018-12-06T08:04:49Z WIP: SQL Reserved/Non-Reserved Key Words --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20146 Thanks @holdenk for reviewing! I've resolved some comments and replied others. Please take a look. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20146#discussion_r239994064 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -130,21 +159,60 @@ class StringIndexer @Since("1.4.0") ( @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("2.4.0") + def setInputCols(value: Array[String]): this.type = set(inputCols, value) + + /** @group setParam */ + @Since("2.4.0") + def setOutputCols(value: Array[String]): this.type = set(outputCols, value) + + private def countByValue( + dataset: Dataset[_], + inputCols: Array[String]): Array[OpenHashMap[String, Long]] = { + +val aggregator = new StringIndexerAggregator(inputCols.length) +implicit val encoder = Encoders.kryo[Array[OpenHashMap[String, Long]]] + +dataset.select(inputCols.map(col(_).cast(StringType)): _*) + .toDF + .groupBy().agg(aggregator.toColumn) + .as[Array[OpenHashMap[String, Long]]] + .collect()(0) + } + @Since("2.0.0") override def fit(dataset: Dataset[_]): StringIndexerModel = { transformSchema(dataset.schema, logging = true) -val values = dataset.na.drop(Array($(inputCol))) - .select(col($(inputCol)).cast(StringType)) - .rdd.map(_.getString(0)) -val labels = $(stringOrderType) match { - case StringIndexer.frequencyDesc => values.countByValue().toSeq.sortBy(-_._2) -.map(_._1).toArray - case StringIndexer.frequencyAsc => values.countByValue().toSeq.sortBy(_._2) -.map(_._1).toArray - case StringIndexer.alphabetDesc => values.distinct.collect.sortWith(_ > _) - case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ < _) -} -copyValues(new StringIndexerModel(uid, labels).setParent(this)) + +val (inputCols, _) = getInOutCols() + +val filteredDF = dataset.na.drop(inputCols) + +// In case of equal frequency when frequencyDesc/Asc, we further sort the strings by alphabet. +val labelsArray = $(stringOrderType) match { + case StringIndexer.frequencyDesc => +countByValue(filteredDF, inputCols).map { counts => + counts.toSeq.sortBy(_._1).sortBy(-_._2).map(_._1).toArray --- End diff -- As one sorts by string and another one sorts by count, can we replace them with a compound expression with a single sortBy? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r239993927 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// Range(nodeId = 2) +val df = spark.range(9, -1, -1).sort('id).toDF() --- End diff -- Either is fine to me as we now add assert to make sure Sort node exist. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r239993889 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// Range(nodeId = 2) +val df = spark.range(9, -1, -1).sort('id).toDF() --- End diff -- We need to use `range` here? How about just writing `Seq(1, 3, 2, ...).toDF("id")`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20146 **[Test build #99853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99853/testReport)** for PR 20146 at commit [`301fa4c`](https://github.com/apache/spark/commit/301fa4cbb4d62d9a180dcbafbe0d2b68dac5a3c8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5875/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r239993718 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// Range(nodeId = 2) +val df = spark.range(9, -1, -1).sort('id).toDF() +testSparkPlanMetrics(df, 2, Map.empty) + df.queryExecution.executedPlan.find(_.isInstanceOf[SortExec]).getOrElse(assert(false)) --- End diff -- `assert(df.queryExecution.executedPlan.find(_.isInstanceOf[SortExec]).isDefined)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r239993561 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -26,7 +26,7 @@ import org.apache.spark.SparkFunSuite import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.expressions.aggregate.{Final, Partial} import org.apache.spark.sql.catalyst.plans.logical.LocalRelation -import org.apache.spark.sql.execution.{FilterExec, RangeExec, SparkPlan, WholeStageCodegenExec} +import org.apache.spark.sql.execution._ --- End diff -- nit: unfold? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/23258 cc: @mgaido91 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20146#discussion_r239993480 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -130,21 +159,60 @@ class StringIndexer @Since("1.4.0") ( @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("2.4.0") + def setInputCols(value: Array[String]): this.type = set(inputCols, value) + + /** @group setParam */ + @Since("2.4.0") + def setOutputCols(value: Array[String]): this.type = set(outputCols, value) + + private def countByValue( + dataset: Dataset[_], + inputCols: Array[String]): Array[OpenHashMap[String, Long]] = { + +val aggregator = new StringIndexerAggregator(inputCols.length) +implicit val encoder = Encoders.kryo[Array[OpenHashMap[String, Long]]] + +dataset.select(inputCols.map(col(_).cast(StringType)): _*) + .toDF + .groupBy().agg(aggregator.toColumn) + .as[Array[OpenHashMap[String, Long]]] + .collect()(0) + } + @Since("2.0.0") override def fit(dataset: Dataset[_]): StringIndexerModel = { transformSchema(dataset.schema, logging = true) -val values = dataset.na.drop(Array($(inputCol))) - .select(col($(inputCol)).cast(StringType)) - .rdd.map(_.getString(0)) -val labels = $(stringOrderType) match { - case StringIndexer.frequencyDesc => values.countByValue().toSeq.sortBy(-_._2) -.map(_._1).toArray - case StringIndexer.frequencyAsc => values.countByValue().toSeq.sortBy(_._2) -.map(_._1).toArray - case StringIndexer.alphabetDesc => values.distinct.collect.sortWith(_ > _) - case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ < _) -} -copyValues(new StringIndexerModel(uid, labels).setParent(this)) + +val (inputCols, _) = getInOutCols() + +val filteredDF = dataset.na.drop(inputCols) + +// In case of equal frequency when frequencyDesc/Asc, we further sort the strings by alphabet. --- End diff -- I'll also add it to ml migration document. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23218 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99849/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23218 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23218 **[Test build #99849 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99849/testReport)** for PR 23218 at commit [`b667d37`](https://github.com/apache/spark/commit/b667d37e9ee2d8cdce459806925cdc0fe725b7bf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20146#discussion_r239992942 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -310,11 +439,23 @@ object StringIndexerModel extends MLReadable[StringIndexerModel] { override def load(path: String): StringIndexerModel = { val metadata = DefaultParamsReader.loadMetadata(path, sc, className) val dataPath = new Path(path, "data").toString - val data = sparkSession.read.parquet(dataPath) -.select("labels") -.head() - val labels = data.getAs[Seq[String]](0).toArray - val model = new StringIndexerModel(metadata.uid, labels) + + val (majorVersion, minorVersion) = majorMinorVersion(metadata.sparkVersion) + val labelsArray = if (majorVersion < 2 || (majorVersion == 2 && minorVersion <= 3)) { --- End diff -- But this needs to change for Spark 3.0 now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metric...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23224#discussion_r239992863 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -80,8 +80,10 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared // Assume the execution plan is // WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Filter(nodeId = 1)) // TODO: update metrics in generated operators -val ds = spark.range(10).filter('id < 5) -testSparkPlanMetrics(ds.toDF(), 1, Map.empty) +val df = spark.range(10).filter('id < 5).toDF() +testSparkPlanMetrics(df, 1, Map.empty, true) + df.queryExecution.executedPlan.find(_.isInstanceOf[WholeStageCodegenExec]) + .getOrElse(assert(false)) --- End diff -- Thank you @viirya. Very good suggestions. After investigation, besides whole-stage codegen related issue, I found another issue. #20560/[SPARK-23375](https://issues.apache.org/jira/browse/SPARK-23375) introduced an optimizer rule to eliminate redundant Sort. For a test case named "Sort metrics" in `SQLMetricsSuite`, because range is already sorted, sort is removed by the `RemoveRedundantSorts`, which makes the test case meaningless. This seems to be a pretty different issue, so I opened a new PR. See #23258 for details. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20146#discussion_r239992845 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -310,11 +439,23 @@ object StringIndexerModel extends MLReadable[StringIndexerModel] { override def load(path: String): StringIndexerModel = { val metadata = DefaultParamsReader.loadMetadata(path, sc, className) val dataPath = new Path(path, "data").toString - val data = sparkSession.read.parquet(dataPath) -.select("labels") -.head() - val labels = data.getAs[Seq[String]](0).toArray - val model = new StringIndexerModel(metadata.uid, labels) + + val (majorVersion, minorVersion) = majorMinorVersion(metadata.sparkVersion) + val labelsArray = if (majorVersion < 2 || (majorVersion == 2 && minorVersion <= 3)) { --- End diff -- This is for loading old StringIndexerModel saved by previous Spark. Previous model has `labels`, but new model has `labelsArray`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23258 **[Test build #99852 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99852/testReport)** for PR 23258 at commit [`408ccf8`](https://github.com/apache/spark/commit/408ccf88325c23c6f2f6119cd6ba99c5056f95a2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23258 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23258 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/23258 [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics while Sort is missing ## What changes were proposed in this pull request? #20560/[SPARK-23375](https://issues.apache.org/jira/browse/SPARK-23375) introduced an optimizer rule to eliminate redundant Sort. For a test case named "Sort metrics" in `SQLMetricsSuite`, because range is already sorted, sort is removed by the `RemoveRedundantSorts`, which makes this test case meaningless. This PR modifies the query for testing Sort metrics and checks Sort exists in the plan. ## How was this patch tested? Modify the existing test case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark sort-metrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23258.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23258 commit 408ccf88325c23c6f2f6119cd6ba99c5056f95a2 Author: seancxmao Date: 2018-12-08T03:58:21Z [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics while Sort is missing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20146#discussion_r239992579 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -130,21 +159,60 @@ class StringIndexer @Since("1.4.0") ( @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("2.4.0") + def setInputCols(value: Array[String]): this.type = set(inputCols, value) + + /** @group setParam */ + @Since("2.4.0") + def setOutputCols(value: Array[String]): this.type = set(outputCols, value) + + private def countByValue( + dataset: Dataset[_], + inputCols: Array[String]): Array[OpenHashMap[String, Long]] = { + +val aggregator = new StringIndexerAggregator(inputCols.length) +implicit val encoder = Encoders.kryo[Array[OpenHashMap[String, Long]]] + +dataset.select(inputCols.map(col(_).cast(StringType)): _*) + .toDF + .groupBy().agg(aggregator.toColumn) + .as[Array[OpenHashMap[String, Long]]] + .collect()(0) + } + @Since("2.0.0") override def fit(dataset: Dataset[_]): StringIndexerModel = { transformSchema(dataset.schema, logging = true) -val values = dataset.na.drop(Array($(inputCol))) - .select(col($(inputCol)).cast(StringType)) - .rdd.map(_.getString(0)) -val labels = $(stringOrderType) match { - case StringIndexer.frequencyDesc => values.countByValue().toSeq.sortBy(-_._2) -.map(_._1).toArray - case StringIndexer.frequencyAsc => values.countByValue().toSeq.sortBy(_._2) -.map(_._1).toArray - case StringIndexer.alphabetDesc => values.distinct.collect.sortWith(_ > _) - case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ < _) -} -copyValues(new StringIndexerModel(uid, labels).setParent(this)) + +val (inputCols, _) = getInOutCols() + +val filteredDF = dataset.na.drop(inputCols) + +// In case of equal frequency when frequencyDesc/Asc, we further sort the strings by alphabet. +val labelsArray = $(stringOrderType) match { + case StringIndexer.frequencyDesc => +countByValue(filteredDF, inputCols).map { counts => + counts.toSeq.sortBy(_._1).sortBy(-_._2).map(_._1).toArray +} + case StringIndexer.frequencyAsc => +countByValue(filteredDF, inputCols).map { counts => + counts.toSeq.sortBy(_._1).sortBy(_._2).map(_._1).toArray +} + case StringIndexer.alphabetDesc => +import dataset.sparkSession.implicits._ +inputCols.map { inputCol => + filteredDF.select(inputCol).distinct().sort(dataset(s"$inputCol").desc) --- End diff -- That's good. I miss this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23196 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23196 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99848/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23196 **[Test build #99848 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99848/testReport)** for PR 23196 at commit [`244654b`](https://github.com/apache/spark/commit/244654b95ae789de83e853f74feade2a66adf432). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20146#discussion_r239992360 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -79,26 +81,53 @@ private[feature] trait StringIndexerBase extends Params with HasHandleInvalid wi @Since("2.3.0") def getStringOrderType: String = $(stringOrderType) - /** Validates and transforms the input schema. */ - protected def validateAndTransformSchema(schema: StructType): StructType = { -val inputColName = $(inputCol) + /** Returns the input and output column names corresponding in pair. */ + private[feature] def getInOutCols(): (Array[String], Array[String]) = { +ParamValidators.checkSingleVsMultiColumnParams(this, Seq(outputCol), Seq(outputCols)) + +if (isSet(inputCol)) { + (Array($(inputCol)), Array($(outputCol))) +} else { + require($(inputCols).length == $(outputCols).length, +"The number of input columns does not match output columns") + ($(inputCols), $(outputCols)) +} + } + + private def validateAndTransformField( + schema: StructType, + inputColName: String, + outputColName: String): StructField = { val inputDataType = schema(inputColName).dataType require(inputDataType == StringType || inputDataType.isInstanceOf[NumericType], s"The input column $inputColName must be either string type or numeric type, " + s"but got $inputDataType.") -val inputFields = schema.fields -val outputColName = $(outputCol) -require(inputFields.forall(_.name != outputColName), +require(schema.fields.forall(_.name != outputColName), --- End diff -- Sounds good to me. Added. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20146#discussion_r239992378 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -130,21 +159,60 @@ class StringIndexer @Since("1.4.0") ( @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("2.4.0") + def setInputCols(value: Array[String]): this.type = set(inputCols, value) + + /** @group setParam */ + @Since("2.4.0") + def setOutputCols(value: Array[String]): this.type = set(outputCols, value) + + private def countByValue( + dataset: Dataset[_], + inputCols: Array[String]): Array[OpenHashMap[String, Long]] = { + +val aggregator = new StringIndexerAggregator(inputCols.length) +implicit val encoder = Encoders.kryo[Array[OpenHashMap[String, Long]]] + +dataset.select(inputCols.map(col(_).cast(StringType)): _*) + .toDF + .groupBy().agg(aggregator.toColumn) + .as[Array[OpenHashMap[String, Long]]] + .collect()(0) + } + @Since("2.0.0") override def fit(dataset: Dataset[_]): StringIndexerModel = { transformSchema(dataset.schema, logging = true) -val values = dataset.na.drop(Array($(inputCol))) - .select(col($(inputCol)).cast(StringType)) - .rdd.map(_.getString(0)) -val labels = $(stringOrderType) match { - case StringIndexer.frequencyDesc => values.countByValue().toSeq.sortBy(-_._2) -.map(_._1).toArray - case StringIndexer.frequencyAsc => values.countByValue().toSeq.sortBy(_._2) -.map(_._1).toArray - case StringIndexer.alphabetDesc => values.distinct.collect.sortWith(_ > _) - case StringIndexer.alphabetAsc => values.distinct.collect.sortWith(_ < _) -} -copyValues(new StringIndexerModel(uid, labels).setParent(this)) + +val (inputCols, _) = getInOutCols() + +val filteredDF = dataset.na.drop(inputCols) + +// In case of equal frequency when frequencyDesc/Asc, we further sort the strings by alphabet. --- End diff -- Moved to `stringOrderType`'s doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23196 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23196 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99847/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23196 **[Test build #99847 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99847/testReport)** for PR 23196 at commit [`6b6ea8a`](https://github.com/apache/spark/commit/6b6ea8a89128a8f459a846dddc48aa6af09a2c51). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20146#discussion_r239991238 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -130,21 +159,60 @@ class StringIndexer @Since("1.4.0") ( @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("2.4.0") --- End diff -- Yes. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23226: [SPARK-26286][TEST] Add MAXIMUM_PAGE_SIZE_BYTES exceptio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23226 **[Test build #4462 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4462/testReport)** for PR 23226 at commit [`486f2b5`](https://github.com/apache/spark/commit/486f2b5c8b51f0f12782d9c9d1907c96a9298c89). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r239990986 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -38,13 +38,21 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode override def outputPartitioning: Partitioning = SinglePartition override def executeCollect(): Array[InternalRow] = child.executeTake(limit) private val serializer: Serializer = new UnsafeRowSerializer(child.output.size) - override lazy val metrics = SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext) + private lazy val writeMetrics = +SQLShuffleWriteMetricsReporter.createShuffleWriteMetrics(sparkContext) + private lazy val readMetrics = +SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext) --- End diff -- I feel it is better to rename SQLShuffleMetricsReporter to SQLShuffleReadMetricsReporter to make it match with SQLShuffleWriteMetricsReporter. It can be in a followup. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/22952 @zsxwing Please also take a look: I guess I addressed glob overlap issue as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22952 **[Test build #99851 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99851/testReport)** for PR 22952 at commit [`22dce0c`](https://github.com/apache/spark/commit/22dce0c1d17de360c0d887a0352c1e8f57761db7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23252 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5874/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23252 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23252 **[Test build #99850 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99850/testReport)** for PR 23252 at commit [`662e38e`](https://github.com/apache/spark/commit/662e38e6f1a01297d75132b088afb6a6a761f70a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23252: [SPARK-26239] File-based secret key loading for S...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23252#discussion_r239990434 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStepSuite.scala --- @@ -16,10 +16,13 @@ */ package org.apache.spark.deploy.k8s.features -import scala.collection.JavaConverters._ +import java.io.File +import java.nio.charset.StandardCharsets +import java.nio.file.Files import io.fabric8.kubernetes.api.model._ import org.scalatest.BeforeAndAfter +import scala.collection.JavaConverters._ --- End diff -- Think I fixed this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23225: [SPARK-26287][CORE]Don't need to create an empty ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23225#discussion_r239990036 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -161,6 +161,10 @@ private void writeSortedFile(boolean isLastFile) { final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords = inMemSorter.getSortedIterator(); +// If there are no sorted records, so we don't need to create an empty spill file. +if (!sortedRecords.hasNext()) { + return; +} --- End diff -- If you're going to short-circuit, why not do it at the start of the function and save the rest of the work done above? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23225: [SPARK-26287][CORE]Don't need to create an empty ...
Github user wangjiaochun commented on a diff in the pull request: https://github.com/apache/spark/pull/23225#discussion_r239989805 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -161,6 +161,10 @@ private void writeSortedFile(boolean isLastFile) { final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords = inMemSorter.getSortedIterator(); +// If there are no sorted records, so we don't need to create an empty spill file. +if (!sortedRecords.hasNext()) { + return; +} --- End diff -- I think it's better not to do that.Because change the original code style and it don't makes an appreciable difference in readability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23252 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23252 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99845/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23218 **[Test build #4461 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4461/testReport)** for PR 23218 at commit [`b667d37`](https://github.com/apache/spark/commit/b667d37e9ee2d8cdce459806925cdc0fe725b7bf). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23252 **[Test build #99845 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99845/testReport)** for PR 23252 at commit [`a6a44c6`](https://github.com/apache/spark/commit/a6a44c6099154bad8e1776df13aa1b00b258215a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23252 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99846/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23252 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23252 **[Test build #99846 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99846/testReport)** for PR 23252 at commit [`9443646`](https://github.com/apache/spark/commit/9443646e72b51a0cc4b94e98a495159942929c73). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23257: [SPARK-26310][SQL] Verify applicability of JSON options
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23257 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23257: [SPARK-26310][SQL] Verify applicability of JSON options
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23257 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99844/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23257: [SPARK-26310][SQL] Verify applicability of JSON options
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23257 **[Test build #99844 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99844/testReport)** for PR 23257 at commit [`af15070`](https://github.com/apache/spark/commit/af1507093f42a206c2c22f6c40cf4c43290244b8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23160: [SPARK-26196][SPARK-26281][WebUI] Total tasks title in t...
Github user shahidki31 commented on the issue: https://github.com/apache/spark/pull/23160 Thanks @srowen. This issue happens only in master branch. Thank you all for the review and comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23256 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org