[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23169 **[Test build #99876 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99876/testReport)** for PR 23169 at commit [`f6d0efc`](https://github.com/apache/spark/commit/f6d0efc7c0e1d461e5854c6e04f3347f174bf13a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user DaveDeCaprio commented on the issue: https://github.com/apache/spark/pull/23169 @HeartSaVioR I've updated the description in SQLConf.scala. Is there some other documentation that should be updated? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23264: [SPARK-26266][BUILD] Update to Scala 2.12.8 (branch-2.4)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23264 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23264: [SPARK-26266][BUILD] Update to Scala 2.12.8 (branch-2.4)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23264 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99871/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23264: [SPARK-26266][BUILD] Update to Scala 2.12.8 (branch-2.4)
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23264 **[Test build #99871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99871/testReport)** for PR 23264 at commit [`4427e9f`](https://github.com/apache/spark/commit/4427e9f183f5f0aae7e32643508b8a3b1c9bf234). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/23262 Good catch, LGTM cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23263 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23263 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99873/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23263 **[Test build #99873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99873/testReport)** for PR 23263 at commit [`ba9db6e`](https://github.com/apache/spark/commit/ba9db6eb2c98a1c84f982a093ff982a030b9eab7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23263 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99872/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23263 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23263 **[Test build #99872 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99872/testReport)** for PR 23263 at commit [`4f2fda2`](https://github.com/apache/spark/commit/4f2fda2272ee7e7d62df139c2f994ef7a122bf7c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON r...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23253 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON r...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23253 **[Test build #99875 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99875/testReport)** for PR 23253 at commit [`b19b3e1`](https://github.com/apache/spark/commit/b19b3e1ce25836a419b056a63bc320f1b82dc1b1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON r...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23253 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5891/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON r...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23253 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23169: [SPARK-26103][SQL] Limit the length of debug strings for...
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/23169 Thanks for addressong review comments. It looks great overall. We may want to document the new config so that we can guide setting the value to lower when end users suffer from memory pressure due to long physical plans in UI pages. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON r...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23253 **[Test build #99874 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99874/testReport)** for PR 23253 at commit [`62a6795`](https://github.com/apache/spark/commit/62a6795713c64b48a6e56449e161da8dbf8ec0aa). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON r...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23253 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99874/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON r...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23253 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5890/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON r...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23253 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON r...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23253 **[Test build #99874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99874/testReport)** for PR 23253 at commit [`62a6795`](https://github.com/apache/spark/commit/62a6795713c64b48a6e56449e161da8dbf8ec0aa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r240006479 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2563,4 +2563,18 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { assert(!files.exists(_.getName.endsWith("json"))) } } + + test("return partial result for bad records") { +val schema = new StructType() + .add("a", DoubleType) + .add("b", ArrayType(IntegerType)) + .add("c", StringType) + .add("_corrupt_record", StringType) --- End diff -- replaced --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23256 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23256 Ooops, > tbh, more meaningful dataset as example would be better... did you expect to fix more instances here Felix? Sorry, I misread. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23256 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23196 @srowen @HyukjinKwon @gatorsmile Could you take a look at the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23196 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99870/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23196 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23196 **[Test build #99870 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99870/testReport)** for PR 23196 at commit [`015fdce`](https://github.com/apache/spark/commit/015fdceb637ba831a357aae150c20ceafdec4a50). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23263 **[Test build #99873 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99873/testReport)** for PR 23263 at commit [`ba9db6e`](https://github.com/apache/spark/commit/ba9db6eb2c98a1c84f982a093ff982a030b9eab7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23263 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5889/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23263 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23263 **[Test build #99872 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99872/testReport)** for PR 23263 at commit [`4f2fda2`](https://github.com/apache/spark/commit/4f2fda2272ee7e7d62df139c2f994ef7a122bf7c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23224 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23224 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99869/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23224 **[Test build #99869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99869/testReport)** for PR 23224 at commit [`71e569b`](https://github.com/apache/spark/commit/71e569b2a89ff0fe83b3f94fee6f57596c093590). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99863/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20146 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20146 **[Test build #99863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99863/testReport)** for PR 20146 at commit [`301fa4c`](https://github.com/apache/spark/commit/301fa4cbb4d62d9a180dcbafbe0d2b68dac5a3c8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240004006 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala --- @@ -65,7 +65,19 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage { * Fits a model to the input data. */ @Since("2.0.0") - def fit(dataset: Dataset[_]): M + def fit(dataset: Dataset[_]): M = MLEvents.withFitEvent(this, dataset) { +fitImpl(dataset) + } + + /** + * `fit()` handles events and then calls this method. Subclasses should override this + * method to implement the actual fiting a model to the input data. + */ + @Since("3.0.0") + protected def fitImpl(dataset: Dataset[_]): M = { +// Keep this default body for backward compatibility. +throw new UnsupportedOperationException("fitImpl is not implemented.") --- End diff -- Yes, that was my intention. I wanted to force to implement `fitImpl` but was thinking that might be too breaking change (it's going to at least break source compatibility). I am willing to follow other suggestions - I am pretty sure you or other guys are more familiar with ML side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003952 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala --- @@ -65,7 +65,19 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage { * Fits a model to the input data. */ @Since("2.0.0") - def fit(dataset: Dataset[_]): M + def fit(dataset: Dataset[_]): M = MLEvents.withFitEvent(this, dataset) { +fitImpl(dataset) + } + + /** + * `fit()` handles events and then calls this method. Subclasses should override this + * method to implement the actual fiting a model to the input data. + */ + @Since("3.0.0") + protected def fitImpl(dataset: Dataset[_]): M = { +// Keep this default body for backward compatibility. +throw new UnsupportedOperationException("fitImpl is not implemented.") --- End diff -- For current change, Spark ML developers can still choose to override `fit` instead `fitImpl` so their ML model can work without ML event? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003885 --- Diff: mllib/src/test/scala/org/apache/spark/ml/MLEventsSuite.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import java.io.File + +import scala.collection.mutable +import scala.concurrent.duration._ + +import org.apache.hadoop.fs.Path +import org.mockito.Matchers.{any, eq => meq} +import org.mockito.Mockito.when +import org.scalatest.BeforeAndAfterEach +import org.scalatest.concurrent.Eventually +import org.scalatest.mockito.MockitoSugar.mock + +import org.apache.spark.{SparkContext, SparkFunSuite} +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.scheduler.{SparkListener, SparkListenerEvent} +import org.apache.spark.sql._ +import org.apache.spark.util.Utils + + +class MLEventsSuite +extends SparkFunSuite +with BeforeAndAfterEach +with DefaultReadWriteTest +with Eventually { + + private var spark: SparkSession = _ + private var sc: SparkContext = _ + private var checkpointDir: String = _ + private var listener: SparkListener = _ + private val dirName: String = "pipeline" + private val events = mutable.ArrayBuffer.empty[MLEvent] + + override def beforeAll(): Unit = { +super.beforeAll() +sc = new SparkContext("local[2]", "SparkListenerSuite") +listener = new SparkListener { + override def onOtherEvent(event: SparkListenerEvent): Unit = event match { +case e: FitStart[_] => events.append(e) +case e: FitEnd[_] => events.append(e) +case e: TransformStart => events.append(e) +case e: TransformEnd => events.append(e) +case e: SaveInstanceStart if e.path.endsWith(dirName) => events.append(e) +case e: SaveInstanceEnd if e.path.endsWith(dirName) => events.append(e) +case _ => + } +} +sc.addSparkListener(listener) + +spark = SparkSession.builder() + .sparkContext(sc) + .getOrCreate() + +checkpointDir = Utils.createDirectory(tempDir.getCanonicalPath, "checkpoints").toString +sc.setCheckpointDir(checkpointDir) --- End diff -- Let me double check and address this while fixing the test. I just copied this from `MLlibTestSparkContext`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003869 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala --- @@ -132,7 +132,8 @@ class Pipeline @Since("1.4.0") ( * @return fitted pipeline */ @Since("2.0.0") - override def fit(dataset: Dataset[_]): PipelineModel = { + override def fit(dataset: Dataset[_]): PipelineModel = super.fit(dataset) --- End diff -- Ah, it's there just only to keep the `@Since`. Looks some classes don't explicitly note that so I didn't call `super` in other places. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003674 --- Diff: mllib/src/test/scala/org/apache/spark/ml/MLEventsSuite.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import java.io.File + +import scala.collection.mutable +import scala.concurrent.duration._ + +import org.apache.hadoop.fs.Path +import org.mockito.Matchers.{any, eq => meq} +import org.mockito.Mockito.when +import org.scalatest.BeforeAndAfterEach +import org.scalatest.concurrent.Eventually +import org.scalatest.mockito.MockitoSugar.mock + +import org.apache.spark.{SparkContext, SparkFunSuite} +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.scheduler.{SparkListener, SparkListenerEvent} +import org.apache.spark.sql._ +import org.apache.spark.util.Utils + + +class MLEventsSuite +extends SparkFunSuite +with BeforeAndAfterEach +with DefaultReadWriteTest +with Eventually { + + private var spark: SparkSession = _ + private var sc: SparkContext = _ + private var checkpointDir: String = _ + private var listener: SparkListener = _ + private val dirName: String = "pipeline" + private val events = mutable.ArrayBuffer.empty[MLEvent] + + override def beforeAll(): Unit = { +super.beforeAll() +sc = new SparkContext("local[2]", "SparkListenerSuite") +listener = new SparkListener { + override def onOtherEvent(event: SparkListenerEvent): Unit = event match { +case e: FitStart[_] => events.append(e) +case e: FitEnd[_] => events.append(e) +case e: TransformStart => events.append(e) +case e: TransformEnd => events.append(e) +case e: SaveInstanceStart if e.path.endsWith(dirName) => events.append(e) +case e: SaveInstanceEnd if e.path.endsWith(dirName) => events.append(e) +case _ => + } +} +sc.addSparkListener(listener) + +spark = SparkSession.builder() + .sparkContext(sc) + .getOrCreate() + +checkpointDir = Utils.createDirectory(tempDir.getCanonicalPath, "checkpoints").toString +sc.setCheckpointDir(checkpointDir) --- End diff -- I may miss it, where do we use checkpoint? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003563 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala --- @@ -132,7 +132,8 @@ class Pipeline @Since("1.4.0") ( * @return fitted pipeline */ @Since("2.0.0") - override def fit(dataset: Dataset[_]): PipelineModel = { + override def fit(dataset: Dataset[_]): PipelineModel = super.fit(dataset) --- End diff -- Is there any `fit` method which doesn't do `super.fit()`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r240003434 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. + - In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully. + --- End diff -- Ok. Sounds reasonable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23262 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99866/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23262 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23262 **[Test build #99866 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99866/testReport)** for PR 23262 at commit [`ddb2528`](https://github.com/apache/spark/commit/ddb252892a439281b16bc14fdfdb7faf756f1067). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23264: [SPARK-26266][BUILD] Update to Scala 2.12.8 (branch-2.4)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23264 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23264: [SPARK-26266][BUILD] Update to Scala 2.12.8 (branch-2.4)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23264 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5888/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23264: [SPARK-26266][BUILD] Update to Scala 2.12.8 (branch-2.4)
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23264 **[Test build #99871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99871/testReport)** for PR 23264 at commit [`4427e9f`](https://github.com/apache/spark/commit/4427e9f183f5f0aae7e32643508b8a3b1c9bf234). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23264: Update to Scala 2.12.8
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/23264 Update to Scala 2.12.8 ## What changes were proposed in this pull request? Back-port of https://github.com/apache/spark/pull/23218 ; updates Scala 2.12 build to 2.12.8 ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-26266.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23264.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23264 commit 4427e9f183f5f0aae7e32643508b8a3b1c9bf234 Author: Sean Owen Date: 2018-12-08T12:09:30Z Update to Scala 2.12.8 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r240003036 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// LocalTableScan(nodeId = 2) +val df = Seq(1, 3, 2).toDF("id").sort('id) +testSparkPlanMetrics(df, 2, Map.empty) --- End diff -- Thanks for pinging me @maropu. What is the point about checking that `LocalTableScan` contains no metrics? I checked the original PR which introduced this UT by @sameeragarwal who can maybe help us stating the goal of the test here (unless someone else can answer me, because I have not understood it). It doesn't seem even related to the Sort operator to me. Maybe I am missing something. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user srowen commented on the issue: https://github.com/apache/spark/pull/23218 Merged to master. I'll open a separate PR for branch-2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23258 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23218: [SPARK-26266][BUILD] Update to Scala 2.12.8
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/23218 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23258 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99864/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23258 **[Test build #99864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99864/testReport)** for PR 23258 at commit [`6e36336`](https://github.com/apache/spark/commit/6e3633620af4e624c8e562ff290e4264cf33eb8b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22952 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22952 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99862/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22952 **[Test build #99862 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99862/testReport)** for PR 22952 at commit [`998e769`](https://github.com/apache/spark/commit/998e769c3b552c39736af7814f60be895dbd90d4). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` class FileStreamSourceCleaner(fileSystem: FileSystem, sourcePath: Path,` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22707: [SPARK-25717][SQL] Insert overwrite a recreated e...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22707#discussion_r240002357 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -227,18 +227,22 @@ case class InsertIntoHiveTable( // Newer Hive largely improves insert overwrite performance. As Spark uses older Hive // version and we may not want to catch up new Hive version every time. We delete the // Hive partition first and then load data file into the Hive partition. - if (oldPart.nonEmpty && overwrite) { -oldPart.get.storage.locationUri.foreach { uri => - val partitionPath = new Path(uri) - val fs = partitionPath.getFileSystem(hadoopConf) - if (fs.exists(partitionPath)) { -if (!fs.delete(partitionPath, true)) { - throw new RuntimeException( -"Cannot remove partition directory '" + partitionPath.toString) -} -// Don't let Hive do overwrite operation since it is slower. -doHiveOverwrite = false + if (overwrite) { +val oldPartitionPath = oldPart.flatMap(_.storage.locationUri.map(new Path(_))) + .getOrElse { +ExternalCatalogUtils.generatePartitionPath( + partitionSpec, + partitionColumnNames, + new Path(table.location)) --- End diff -- We still need to consider the old path, `oldPart`? Can't we write this?; ``` val oldPartitionPath = ExternalCatalogUtils.generatePartitionPath( partitionSpec, partitionColumnNames, new Path(table.location)) ``` Also, can you write a comment about how to solve this issue here and in the pr description? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23263 The tests pass in my local. I'll fix them shortly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r24694 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. + - In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully. + --- End diff -- @cloud-fan This PR propose similar changes as in https://github.com/apache/spark/pull/23120 . Could you take a look at it. > For such behavior change, shall we add a config to roll back to previous behavior? I don't think it makes sense to introduce global SQL config for this particular case. The risk of breaking users apps is low because apps logic cannot based only on presence of all nulls in row. All nulls don't differentiate bad and not-bad JSON records. From my point of view, a note in the migration guide is enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23263 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23263 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99868/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23263 **[Test build #99868 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99868/testReport)** for PR 23263 at commit [`a9112f3`](https://github.com/apache/spark/commit/a9112f33ff8fbfb66bad76bff6898abdef5b6881). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TransformStart(transformer: Transformer, input: Dataset[_]) extends MLEvent` * `case class TransformEnd(transformer: Transformer, output: Dataset[_]) extends MLEvent` * `case class FitStart[M <: Model[M]](estimator: Estimator[M], dataset: Dataset[_]) extends MLEvent` * `case class FitEnd[M <: Model[M]](estimator: Estimator[M], model: M) extends MLEvent` * `case class LoadInstanceStart[T](reader: MLReader[T], path: String) extends MLEvent` * `case class LoadInstanceEnd[T](reader: MLReader[T], instance: T) extends MLEvent` * `case class SaveInstanceStart(writer: MLWriter, path: String) extends MLEvent` * `case class SaveInstanceEnd(writer: MLWriter, path: String) extends MLEvent` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23201#discussion_r24411 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala --- @@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { DecimalType(bigDecimal.precision, bigDecimal.scale) } decimalTry.getOrElse(StringType) - case VALUE_STRING => StringType + case VALUE_STRING => +val stringValue = parser.getText --- End diff -- The order can be matter if you have the same pattern (or similar) for dates and timestamps. `DateType` can be preferable because it requires less memory. It seems reasonable to move from `DateType` to `TimestampType` during schema inferring since opposite one is impossible without loosing info. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23196 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5887/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23224 **[Test build #99869 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99869/testReport)** for PR 23224 at commit [`71e569b`](https://github.com/apache/spark/commit/71e569b2a89ff0fe83b3f94fee6f57596c093590). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23196 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23196 **[Test build #99870 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99870/testReport)** for PR 23196 at commit [`015fdce`](https://github.com/apache/spark/commit/015fdceb637ba831a357aae150c20ceafdec4a50). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/23224 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r24119 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. + - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. --- End diff -- I added an example when there is a difference, and updated the migration guide. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21289 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5886/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21289: [SPARK-24240] Add a config to control whether InMemoryFi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21289 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23263 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5885/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23263 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23263 cc @srowen, @cloud-fan (since it mimics SQL's event listener), @jkbradley, @mengxr and @yanboliang. Mind if I ask to take a look please? WDYT about this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r23747 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -210,7 +214,7 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType, } } - protected def transformImpl(dataset: Dataset[_]): DataFrame = { + override protected def transformImpl(dataset: Dataset[_]): DataFrame = { --- End diff -- `transformImpl` for some abstraction and `saveImpl` are already existent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23263 **[Test build #99868 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99868/testReport)** for PR 23263 at commit [`a9112f3`](https://github.com/apache/spark/commit/a9112f33ff8fbfb66bad76bff6898abdef5b6881). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5884/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/23261 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23263 [SPARK-23674][ML] Adds Spark ML Events ## What changes were proposed in this pull request? This PR proposes to add ML events so that other developers can track and add some actions for them. ## Introduction This PR proposes to send some ML events like SQL. This is quite useful when people want to track and make some actions for corresponding ML operations. For instance, I have been working on integrating Apache Spark with [Apache Atlas](https://atlas.apache.org/QuickStart.html). With some custom changes with this PR, I can visualise ML pipeline as below: ![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png) I think not to mention how useful it is to track the SQL operations. Likewise, I would like to propose ML events as well (as lowest stability `@Unstable` APIs for now - no guarantee about stability). ## Implementation Details ### Sends event (but not expose ML specific listener) In `events.scala`, it adds: ```scala @Unstable case class ...StartEvent(caller, input) @Unstable case class ...EndEvent(caller, output) object MLEvents { // Wrappers to send events: // def with...Event(body) = { // body() // SparkContext.getOrCreate().listenerBus.post(event) // } } ``` This way mimics both: **1. Catalog events (see `org.apache.spark.sql.catalyst.catalog.events.scala`)** - This allows a Catalog specific listener to be added `ExternalCatalogEventListener` - It's implemented in a way of wrapping whole `ExternalCatalog` named `ExternalCatalogWithListener` which delegates the operations to `ExternalCatalog` This is not quite possible in this case because most of instances (like `Pipeline`) will be directly created in most of cases. We might be able to do that via extending `ListenerBus` for all possible instances but IMHO it's too invasive. Also, exposing another ML specific listener sounds a bit too much at this stage. Therefore, I simply borrowed file name and structures here **2. SQL execution events (see `org.apache.spark.sql.execution.SQLExecution.scala`)** - Add an object that wraps a body to send events Current apporach is rather close to this. It has a `with...` wrapper to send events. I borrowed this approach to be consistent. ### Add `...Impl` methods to wrap each to send events **1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`** ```diff - def save(...) = { saveImpl(...) } + def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) } def saveImpl(...): Unit = ... ``` Note that `saveImpl` was already implemented unlike other instances below. ```diff - def load(...): T + def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) } + def loadImpl(...): T ``` **2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`** ```diff - def fit(...): Model + def fit(...): Model = MLEvents.withFitEvent { fitImpl(...) } + def fitImpl(...): Model ``` **3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`** ```diff - def transform(...): DataFrame + def transform(...): DataFrame = MLEvents.withTransformEvent { transformImpl(...) } + def transformImpl(...): DataFrame ``` This approach follows the existing way as below in ML: **1. `transform` and `transformImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L191-L196 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042 **2. `save` and `saveImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176 Inherited ones are intentionally omitted here for simplicity. They are inherited and implemented at multiple places. ## Backward Compatibility _This keeps both source and binary backward compatibility_. I was thinking enforcing `...Impl` by leaving it abstract methods to force to implement but just decided to leave a body that throws `UnsupportedOperationException` so that we can keep full source and binary compatibilities. - For user-faced API perspective, _there's no difference_.
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23261 I'm going to rebase and reopen a PR to retrigger AppVeyor tests. The failure looks unrelated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23261 **[Test build #99867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99867/testReport)** for PR 23261 at commit [`6dbe14c`](https://github.com/apache/spark/commit/6dbe14cec3892ce1238eadd0f41454a8922c5d33). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99867/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23261 **[Test build #99865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99865/testReport)** for PR 23261 at commit [`a44e30e`](https://github.com/apache/spark/commit/a44e30e03002c1efd2b8a482c5392f329d15d073). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TransformStart(transformer: Transformer, input: Dataset[_]) extends MLEvent` * `case class TransformEnd(transformer: Transformer, output: Dataset[_]) extends MLEvent` * `case class FitStart[M <: Model[M]](estimator: Estimator[M], dataset: Dataset[_]) extends MLEvent` * `case class FitEnd[M <: Model[M]](estimator: Estimator[M], model: M) extends MLEvent` * `case class LoadInstanceStart[T](reader: MLReader[T], path: String) extends MLEvent` * `case class LoadInstanceEnd[T](reader: MLReader[T], instance: T) extends MLEvent` * `case class SaveInstanceStart(writer: MLWriter, path: String) extends MLEvent` * `case class SaveInstanceEnd(writer: MLWriter, path: String) extends MLEvent` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99865/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23261#discussion_r239998861 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -163,7 +163,7 @@ abstract class MLWriter extends BaseReadWrite with Logging { */ @Since("1.6.0") @throws[IOException]("If the input path already exists but overwrite is not enabled.") - def save(path: String): Unit = { + def save(path: String): Unit = MLEvents.withSaveInstanceEvent(this, path) { new FileSystemOverwrite().handleOverwrite(path, shouldOverwrite, sc) saveImpl(path) --- End diff -- and `saveImpl` is already existent as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23261#discussion_r239998855 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -210,7 +214,7 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType, } } - protected def transformImpl(dataset: Dataset[_]): DataFrame = { + override protected def transformImpl(dataset: Dataset[_]): DataFrame = { --- End diff -- For clarification, some of abstraction already has `transformImpl`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5883/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23261 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23261 **[Test build #99867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99867/testReport)** for PR 23261 at commit [`6dbe14c`](https://github.com/apache/spark/commit/6dbe14cec3892ce1238eadd0f41454a8922c5d33). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23262 **[Test build #99866 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99866/testReport)** for PR 23262 at commit [`ddb2528`](https://github.com/apache/spark/commit/ddb252892a439281b16bc14fdfdb7faf756f1067). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org