[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159831366 --- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.JavaConverters._ + +import org.apache.spark.internal.Logging + +/** + * Thread audit for test suites. + * + * Thread audit happens normally in [[SparkFunSuite]] automatically when a new test suite created. --- End diff -- Good point, moving. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159831631 --- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.JavaConverters._ + +import org.apache.spark.internal.Logging + +/** + * Thread audit for test suites. + * + * Thread audit happens normally in [[SparkFunSuite]] automatically when a new test suite created. + * The only prerequisite for that is that the test class must extend [[SparkFunSuite]]. + * + * There are some test suites which are doing initialization before [[SparkFunSuite#beforeAll]] + * executed. This case auditing can be moved into another place in the call sequence. + * + * To do the audit in a custom place/way the following can be done: + * + * class MyTestSuite extends SparkFunSuite { + * + * override val doThreadAuditInSparkFunSuite = false + * + * protected override def beforeAll(): Unit = { + * doThreadPreAudit + * super.beforeAll + * } + * + * protected override def afterAll(): Unit = { + * super.afterAll + * doThreadPostAudit + * } + * } + */ +trait ThreadAudit extends Logging { + + val threadWhiteList = Set( +/** + * Netty related internal threads. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"netty.*", + +/** + * Netty related internal threads. + * A Single-thread singleton EventExecutor inside netty which creates such threads. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"globalEventExecutor.*", + +/** + * Netty related internal threads. + * Checks if a thread is alive periodically and runs a task when a thread dies. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"threadDeathWatcher.*", + +/** + * During [[SparkContext]] creation [[org.apache.spark.rpc.netty.NettyRpcEnv]] + * creates event loops. One is wrapped inside + * [[org.apache.spark.network.server.TransportServer]] + * the other one is inside [[org.apache.spark.network.client.TransportClient]]. + * The thread pools behind shut down asynchronously triggered by [[SparkContext#stop]]. + * Manually checked and all of them stopped properly. + */ +"rpc-client.*", +"rpc-server.*", + +/** + * During [[SparkContext]] creation BlockManager + * creates event loops. One is wrapped inside --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159832598 --- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.JavaConverters._ + +import org.apache.spark.internal.Logging + +/** + * Thread audit for test suites. + * + * Thread audit happens normally in [[SparkFunSuite]] automatically when a new test suite created. + * The only prerequisite for that is that the test class must extend [[SparkFunSuite]]. + * + * There are some test suites which are doing initialization before [[SparkFunSuite#beforeAll]] + * executed. This case auditing can be moved into another place in the call sequence. + * + * To do the audit in a custom place/way the following can be done: + * + * class MyTestSuite extends SparkFunSuite { + * + * override val doThreadAuditInSparkFunSuite = false + * + * protected override def beforeAll(): Unit = { + * doThreadPreAudit + * super.beforeAll + * } + * + * protected override def afterAll(): Unit = { + * super.afterAll + * doThreadPostAudit + * } + * } + */ +trait ThreadAudit extends Logging { + + val threadWhiteList = Set( +/** + * Netty related internal threads. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"netty.*", + +/** + * Netty related internal threads. + * A Single-thread singleton EventExecutor inside netty which creates such threads. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"globalEventExecutor.*", + +/** + * Netty related internal threads. + * Checks if a thread is alive periodically and runs a task when a thread dies. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"threadDeathWatcher.*", + +/** + * During [[SparkContext]] creation [[org.apache.spark.rpc.netty.NettyRpcEnv]] + * creates event loops. One is wrapped inside + * [[org.apache.spark.network.server.TransportServer]] + * the other one is inside [[org.apache.spark.network.client.TransportClient]]. + * The thread pools behind shut down asynchronously triggered by [[SparkContext#stop]]. + * Manually checked and all of them stopped properly. + */ +"rpc-client.*", +"rpc-server.*", + +/** + * During [[SparkContext]] creation BlockManager + * creates event loops. One is wrapped inside + * [[org.apache.spark.network.server.TransportServer]] + * the other one is inside [[org.apache.spark.network.client.TransportClient]]. + * The thread pools behind shut down asynchronously triggered by [[SparkContext#stop]]. + * Manually checked and all of them stopped properly. + */ +"shuffle-client.*", +"shuffle-server.*" + ) + private var threadNamesSnapshot: Set[String] = Set.empty + + protected def doThreadPreAudit(): Unit = snapshotRunningThreadNames + protected def doThreadPostAudit(): Unit = printRemainingThreadNames + + private def snapshotRunningThreadNames(): Unit = { --- End diff -- Inlined. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159832729 --- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.JavaConverters._ + +import org.apache.spark.internal.Logging + +/** + * Thread audit for test suites. + * + * Thread audit happens normally in [[SparkFunSuite]] automatically when a new test suite created. + * The only prerequisite for that is that the test class must extend [[SparkFunSuite]]. + * + * There are some test suites which are doing initialization before [[SparkFunSuite#beforeAll]] + * executed. This case auditing can be moved into another place in the call sequence. + * + * To do the audit in a custom place/way the following can be done: + * + * class MyTestSuite extends SparkFunSuite { + * + * override val doThreadAuditInSparkFunSuite = false + * + * protected override def beforeAll(): Unit = { + * doThreadPreAudit + * super.beforeAll + * } + * + * protected override def afterAll(): Unit = { + * super.afterAll + * doThreadPostAudit + * } + * } + */ +trait ThreadAudit extends Logging { + + val threadWhiteList = Set( +/** + * Netty related internal threads. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"netty.*", + +/** + * Netty related internal threads. + * A Single-thread singleton EventExecutor inside netty which creates such threads. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"globalEventExecutor.*", + +/** + * Netty related internal threads. + * Checks if a thread is alive periodically and runs a task when a thread dies. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"threadDeathWatcher.*", + +/** + * During [[SparkContext]] creation [[org.apache.spark.rpc.netty.NettyRpcEnv]] + * creates event loops. One is wrapped inside + * [[org.apache.spark.network.server.TransportServer]] + * the other one is inside [[org.apache.spark.network.client.TransportClient]]. + * The thread pools behind shut down asynchronously triggered by [[SparkContext#stop]]. + * Manually checked and all of them stopped properly. + */ +"rpc-client.*", +"rpc-server.*", + +/** + * During [[SparkContext]] creation BlockManager + * creates event loops. One is wrapped inside + * [[org.apache.spark.network.server.TransportServer]] + * the other one is inside [[org.apache.spark.network.client.TransportClient]]. + * The thread pools behind shut down asynchronously triggered by [[SparkContext#stop]]. + * Manually checked and all of them stopped properly. + */ +"shuffle-client.*", +"shuffle-server.*" + ) + private var threadNamesSnapshot: Set[String] = Set.empty + + protected def doThreadPreAudit(): Unit = snapshotRunningThreadNames + protected def doThreadPostAudit(): Unit = printRemainingThreadNames + + private def snapshotRunningThreadNames(): Unit = { +threadNamesSnapshot = runningThreadNames --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159834709 --- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.JavaConverters._ + +import org.apache.spark.internal.Logging + +/** + * Thread audit for test suites. + * + * Thread audit happens normally in [[SparkFunSuite]] automatically when a new test suite created. + * The only prerequisite for that is that the test class must extend [[SparkFunSuite]]. + * + * There are some test suites which are doing initialization before [[SparkFunSuite#beforeAll]] + * executed. This case auditing can be moved into another place in the call sequence. + * + * To do the audit in a custom place/way the following can be done: + * + * class MyTestSuite extends SparkFunSuite { + * + * override val doThreadAuditInSparkFunSuite = false + * + * protected override def beforeAll(): Unit = { + * doThreadPreAudit + * super.beforeAll + * } + * + * protected override def afterAll(): Unit = { + * super.afterAll + * doThreadPostAudit + * } + * } + */ +trait ThreadAudit extends Logging { + + val threadWhiteList = Set( +/** + * Netty related internal threads. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"netty.*", + +/** + * Netty related internal threads. + * A Single-thread singleton EventExecutor inside netty which creates such threads. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"globalEventExecutor.*", + +/** + * Netty related internal threads. + * Checks if a thread is alive periodically and runs a task when a thread dies. + * These are excluded because their lifecycle is handled by the netty itself + * and spark has no explicit effect on them. + */ +"threadDeathWatcher.*", + +/** + * During [[SparkContext]] creation [[org.apache.spark.rpc.netty.NettyRpcEnv]] + * creates event loops. One is wrapped inside + * [[org.apache.spark.network.server.TransportServer]] + * the other one is inside [[org.apache.spark.network.client.TransportClient]]. + * The thread pools behind shut down asynchronously triggered by [[SparkContext#stop]]. + * Manually checked and all of them stopped properly. + */ +"rpc-client.*", +"rpc-server.*", + +/** + * During [[SparkContext]] creation BlockManager + * creates event loops. One is wrapped inside + * [[org.apache.spark.network.server.TransportServer]] + * the other one is inside [[org.apache.spark.network.client.TransportClient]]. + * The thread pools behind shut down asynchronously triggered by [[SparkContext#stop]]. + * Manually checked and all of them stopped properly. + */ +"shuffle-client.*", +"shuffle-server.*" + ) + private var threadNamesSnapshot: Set[String] = Set.empty + + protected def doThreadPreAudit(): Unit = snapshotRunningThreadNames + protected def doThreadPostAudit(): Unit = printRemainingThreadNames + + private def snapshotRunningThreadNames(): Unit = { +threadNamesSnapshot = runningThreadNames + } + + private def printRemainingThreadNames(): Unit = { --- End diff -- Inlined. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159837804 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SessionStateSuite.scala --- @@ -39,6 +41,7 @@ class SessionStateSuite extends SparkFunSuite protected var activeSession: SparkSession = _ override def beforeAll(): Unit = { +doThreadPreAudit() --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159838691 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/DataSourceAnalysisSuite.scala --- @@ -29,16 +29,24 @@ import org.apache.spark.sql.types.{DataType, IntegerType, StructType} class DataSourceAnalysisSuite extends SparkFunSuite with BeforeAndAfterAll { + protected override val doThreadAuditInSparkFunSuite = false + private var targetAttributes: Seq[Attribute] = _ private var targetPartitionSchema: StructType = _ override def beforeAll(): Unit = { +doThreadPreAudit() --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85720/testReport)** for PR 20096 at commit [`be6c378`](https://github.com/apache/spark/commit/be6c378969920aeda6506d1c2cfb91a33dfe7027). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159840754 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSessionStateSuite.scala --- @@ -28,14 +28,18 @@ import org.apache.spark.sql.hive.test.TestHiveSingleton class HiveSessionStateSuite extends SessionStateSuite with TestHiveSingleton with BeforeAndAfterEach { + override protected val doThreadAuditInSparkFunSuite = false + override def beforeAll(): Unit = { // Reuse the singleton session activeSession = spark +doThreadPreAudit() --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159842007 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -31,15 +31,28 @@ import org.apache.spark.util.AccumulatorContext abstract class SparkFunSuite extends FunSuite with BeforeAndAfterAll + with ThreadAudit with Logging { // scalastyle:on + protected val doThreadAuditInSparkFunSuite = true --- End diff -- Renamed to enableAutoThreadAudit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159841944 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -31,15 +31,28 @@ import org.apache.spark.util.AccumulatorContext abstract class SparkFunSuite extends FunSuite with BeforeAndAfterAll + with ThreadAudit with Logging { // scalastyle:on + protected val doThreadAuditInSparkFunSuite = true --- End diff -- I was thinking about proper naming before. The last suggested one is definitely better. No exact place where it happens but not suggesting that it's completely turned off. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20164: [SPARK-22971][ML] OneVsRestModel should use tempo...
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/20164 [SPARK-22971][ML] OneVsRestModel should use temporary RawPredictionCol ## What changes were proposed in this pull request? use temporary RawPredictionCol in `OneVsRestModel#transform` ## How was this patch tested? existing tests and added tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhengruifeng/spark ovr_not_use_getRawPredictionCol Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20164.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20164 commit f155e1cc6b175ac06a5f2ab710d4c053b0776507 Author: Zheng RuiFeng Date: 2018-01-05T09:29:25Z create pr commit 9b0dcc69535b6731c9b6cdc0030c846c3352a5de Author: Zheng RuiFeng Date: 2018-01-05T10:19:59Z create pr commit 6c567ffb02738346fc83e467752add0d00a42e07 Author: Zheng RuiFeng Date: 2018-01-05T10:26:16Z add test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20164 **[Test build #85721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85721/testReport)** for PR 20164 at commit [`6c567ff`](https://github.com/apache/spark/commit/6c567ffb02738346fc83e467752add0d00a42e07). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20013: [SPARK-20657][core] Speed up rendering of the sta...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/20013#discussion_r159844819 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -848,7 +853,7 @@ private[spark] class AppStatusListener( } stages.foreach { s => - val key = s.id + val key = Array(s.info.stageId, s.info.attemptId) --- End diff -- Use a case class instead? Or create a function `xxxKey` and return the Array/Tuple. Using `Array(s.info.stageId, s.info.attemptId)` in several places looks not robust enough --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20013: [SPARK-20657][core] Speed up rendering of the sta...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/20013#discussion_r159835376 --- Diff: core/src/main/scala/org/apache/spark/status/LiveEntity.scala --- @@ -119,118 +121,115 @@ private class LiveTask( import LiveEntityHelpers._ - private var recordedMetrics: v1.TaskMetrics = null + private var metrics: MetricsTracker = new MetricsTracker() var errorMessage: Option[String] = None /** * Update the metrics for the task and return the difference between the previous and new * values. */ - def updateMetrics(metrics: TaskMetrics): v1.TaskMetrics = { + def updateMetrics(metrics: TaskMetrics): MetricsTracker = { if (metrics != null) { - val old = recordedMetrics - recordedMetrics = new v1.TaskMetrics( -metrics.executorDeserializeTime, -metrics.executorDeserializeCpuTime, -metrics.executorRunTime, -metrics.executorCpuTime, -metrics.resultSize, -metrics.jvmGCTime, -metrics.resultSerializationTime, -metrics.memoryBytesSpilled, -metrics.diskBytesSpilled, -metrics.peakExecutionMemory, -new v1.InputMetrics( - metrics.inputMetrics.bytesRead, - metrics.inputMetrics.recordsRead), -new v1.OutputMetrics( - metrics.outputMetrics.bytesWritten, - metrics.outputMetrics.recordsWritten), -new v1.ShuffleReadMetrics( - metrics.shuffleReadMetrics.remoteBlocksFetched, - metrics.shuffleReadMetrics.localBlocksFetched, - metrics.shuffleReadMetrics.fetchWaitTime, - metrics.shuffleReadMetrics.remoteBytesRead, - metrics.shuffleReadMetrics.remoteBytesReadToDisk, - metrics.shuffleReadMetrics.localBytesRead, - metrics.shuffleReadMetrics.recordsRead), -new v1.ShuffleWriteMetrics( - metrics.shuffleWriteMetrics.bytesWritten, - metrics.shuffleWriteMetrics.writeTime, - metrics.shuffleWriteMetrics.recordsWritten)) - if (old != null) calculateMetricsDelta(recordedMetrics, old) else recordedMetrics + val old = this.metrics + val newMetrics = new MetricsTracker() --- End diff -- create a method in `MetricsTracker` which accepts a `TaskMetrics` and update the metrics? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159851476 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala --- @@ -17,4 +17,17 @@ package org.apache.spark.sql.test -trait SharedSQLContext extends SQLTestUtils with SharedSparkSession +trait SharedSQLContext extends SQLTestUtils with SharedSparkSession { + + override protected val doThreadAuditInSparkFunSuite = false + + protected override def beforeAll(): Unit = { +doThreadPreAudit() --- End diff -- It's kind of similar but not the same. Comment added. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19893 **[Test build #85722 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85722/testReport)** for PR 19893 at commit [`0851ef2`](https://github.com/apache/spark/commit/0851ef2b572ed507155358e541a4fa53a8e92173). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20164 **[Test build #85721 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85721/testReport)** for PR 20164 at commit [`6c567ff`](https://github.com/apache/spark/commit/6c567ffb02738346fc83e467752add0d00a42e07). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20164 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85721/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20164 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/20013 @vanzin I haven't run the code. I wonder which changes double the disk usage? The new indices or the cached quantiles? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20163 The problem here seems, `returnType` is mismatched to the value. In case of `DateType`, it needs an explicit conversion into integers: https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L170-L171 https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L173-L175 which will be called via in `worker.py` https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74 If the `returnType` is `StringType`, then it doesn't need the conversion because Pyrolite and serialization work fine between them up to my knowledge: https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L141-L145 https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L76-L82 So, here: https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74 we will send the return values as are without conversion, which ends up with `datetime.date` -> `java.util.Calendar` as you described in the PR description. Therefore, I don't think the current fix in `EvaluatePython.scala` is reachable in the reproducer above. For the fix in Python side in `udf.py`, this is a band-aid fix. To deal with this problem correctly, I believe we should do something like: ```diff diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index 146e673ae97..37137e02c08 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -144,6 +144,17 @@ class StringType(AtomicType): __metaclass__ = DataTypeSingleton +def needConversion(self): +return True + +def toInternal(self, v): +if v is not None: +return str(v) + +def fromInternal(self, v): +if v is not None: +return str(v) + ``` but then this will bring performance regression because `str` is required to be called every value. This extra function call could cause performance regression, for example, see both https://github.com/apache/spark/pull/19246 and https://github.com/apache/spark/pull/19249. I am less sure this is something we should allow. Can we simply document this saying `returnType` should be compatible with the actual return value? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85720/testReport)** for PR 20096 at commit [`be6c378`](https://github.com/apache/spark/commit/be6c378969920aeda6506d1c2cfb91a33dfe7027). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85720/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20162: [SPARK-22965] [PySpark] [SQL] Add deterministic p...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/20162 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
GitHub user xubo245 opened a pull request: https://github.com/apache/spark/pull/20165 [SPARK-22972] Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.hive.orc ## What changes were proposed in this pull request? Fix the warning: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.hive.orc. (Please fill in changes proposed in this fix) ## How was this patch tested? test("SPARK-22972: hive orc source") assert(HiveSerDe.sourceToSerDe("org.apache.spark.sql.hive.orc") .equals(HiveSerDe.sourceToSerDe("orc"))) (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xubo245/spark HiveSerDe Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20165.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20165 commit fa902d6d3fb635236ac01ee5b43470359f16cfdd Author: xubo245 <601450868@...> Date: 2018-01-05T13:20:53Z [SPARK-22972] Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.hive.orc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20155: [SPARK-22961][REGRESSION] Constant columns should genera...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20155 Thanks! Merged to master/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20165 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20155: [SPARK-22961][REGRESSION] Constant columns should...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20155 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20165 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20165 **[Test build #85723 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85723/testReport)** for PR 20165 at commit [`fa902d6`](https://github.com/apache/spark/commit/fa902d6d3fb635236ac01ee5b43470359f16cfdd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20165#discussion_r159877971 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala --- @@ -72,7 +72,7 @@ object HiveSerDe { def sourceToSerDe(source: String): Option[HiveSerDe] = { val key = source.toLowerCase(Locale.ROOT) match { case s if s.startsWith("org.apache.spark.sql.parquet") => "parquet" - case s if s.startsWith("org.apache.spark.sql.orc") => "orc" --- End diff -- We also need to keep the original one. We do not want to introduce behavior breaks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20165#discussion_r159878203 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { """.stripMargin) } + test("SPARK-22972: hive orc source") { +spark.sql( + s"""CREATE TABLE normal_orc_as_source_hive + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}' + |) + """. +stripMargin) --- End diff -- combine the line 73 and 74 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20165#discussion_r159878257 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { """.stripMargin) } + test("SPARK-22972: hive orc source") { +spark.sql( + s"""CREATE TABLE normal_orc_as_source_hive + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}' + |) + """. +stripMargin) +spark.sql( + "desc formatted normal_orc_as_source_hive").show() --- End diff -- ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20165#discussion_r159878556 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { """.stripMargin) } + test("SPARK-22972: hive orc source") { +spark.sql( + s"""CREATE TABLE normal_orc_as_source_hive + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}' + |) + """. +stripMargin) +spark.sql( + "desc formatted normal_orc_as_source_hive").show() +checkAnswer(sql("SELECT COUNT(*) FROM normal_orc_as_source_hive"), Row(10)) +assert(HiveSerDe.sourceToSerDe("org.apache.spark.sql.hive.orc") + .equals(HiveSerDe.sourceToSerDe("orc"))) --- End diff -- Also add the checks: ``` assert(HiveSerDe.sourceToSerDe("org.apache.spark.sql.orc") .equals(HiveSerDe.sourceToSerDe("orc"))) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
Github user xubo245 commented on a diff in the pull request: https://github.com/apache/spark/pull/20165#discussion_r159878683 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala --- @@ -72,7 +72,7 @@ object HiveSerDe { def sourceToSerDe(source: String): Option[HiveSerDe] = { val key = source.toLowerCase(Locale.ROOT) match { case s if s.startsWith("org.apache.spark.sql.parquet") => "parquet" - case s if s.startsWith("org.apache.spark.sql.orc") => "orc" --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
Github user xubo245 commented on a diff in the pull request: https://github.com/apache/spark/pull/20165#discussion_r159878689 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { """.stripMargin) } + test("SPARK-22972: hive orc source") { +spark.sql( + s"""CREATE TABLE normal_orc_as_source_hive + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}' + |) + """. +stripMargin) --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
Github user xubo245 commented on a diff in the pull request: https://github.com/apache/spark/pull/20165#discussion_r159879115 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { """.stripMargin) } + test("SPARK-22972: hive orc source") { +spark.sql( + s"""CREATE TABLE normal_orc_as_source_hive + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}' + |) + """. +stripMargin) +spark.sql( + "desc formatted normal_orc_as_source_hive").show() --- End diff -- change it to spark.sql("desc formatted normal_orc_as_source_hive").show(), is it ok? How to get the warning and verify it in code? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
Github user xubo245 commented on a diff in the pull request: https://github.com/apache/spark/pull/20165#discussion_r159879314 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { """.stripMargin) } + test("SPARK-22972: hive orc source") { +spark.sql( + s"""CREATE TABLE normal_orc_as_source_hive + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}' + |) + """. +stripMargin) +spark.sql( + "desc formatted normal_orc_as_source_hive").show() +checkAnswer(sql("SELECT COUNT(*) FROM normal_orc_as_source_hive"), Row(10)) +assert(HiveSerDe.sourceToSerDe("org.apache.spark.sql.hive.orc") + .equals(HiveSerDe.sourceToSerDe("orc"))) --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20165 **[Test build #85724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85724/testReport)** for PR 20165 at commit [`cf7cbce`](https://github.com/apache/spark/commit/cf7cbce6061894eacbfd334f75476268068446c9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20165 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20165 **[Test build #85724 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85724/testReport)** for PR 20165 at commit [`cf7cbce`](https://github.com/apache/spark/commit/cf7cbce6061894eacbfd334f75476268068446c9). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20165 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85724/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20165 **[Test build #85725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85725/testReport)** for PR 20165 at commit [`b64ce53`](https://github.com/apache/spark/commit/b64ce532d36442cde636db54d8ecbc08d6030825). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/20080 Good point @felixcheung though the polling is hard-coded to 100ms, so will be close enough to "immediately". Allowing a value of 0 on both timeouts seems OK, but yeah not clear whether it does have the desired effect. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20076#discussion_r159885686 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecPrecedenceSuite.scala --- @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.parquet.hadoop.ParquetOutputFormat + +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext + +class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with SharedSQLContext { + test("Test `spark.sql.parquet.compression.codec` config") { +Seq("NONE", "UNCOMPRESSED", "SNAPPY", "GZIP", "LZO").foreach { c => + withSQLConf(SQLConf.PARQUET_COMPRESSION.key -> c) { +val expected = if (c == "NONE") "UNCOMPRESSED" else c +val option = new ParquetOptions(Map.empty[String, String], spark.sessionState.conf) +assert(option.compressionCodecClassName == expected) + } +} + } + + test("[SPARK-21786] Test Acquiring 'compressionCodecClassName' for parquet in right order.") { +// When "compression" is configured, it should be the first choice. +withSQLConf(SQLConf.PARQUET_COMPRESSION.key -> "snappy") { + val props = Map("compression" -> "uncompressed", ParquetOutputFormat.COMPRESSION -> "gzip") + val option = new ParquetOptions(props, spark.sessionState.conf) + assert(option.compressionCodecClassName == "UNCOMPRESSED") +} + +// When "compression" is not configured, "parquet.compression" should be the preferred choice. +withSQLConf(SQLConf.PARQUET_COMPRESSION.key -> "snappy") { + val props = Map(ParquetOutputFormat.COMPRESSION -> "gzip") + val option = new ParquetOptions(props, spark.sessionState.conf) + assert(option.compressionCodecClassName == "GZIP") +} + +// When both "compression" and "parquet.compression" are not configured, +// spark.sql.parquet.compression.codec should be the right choice. +withSQLConf(SQLConf.PARQUET_COMPRESSION.key -> "snappy") { + val props = Map.empty[String, String] + val option = new ParquetOptions(props, spark.sessionState.conf) + assert(option.compressionCodecClassName == "SNAPPY") +} + } + + private def getTableCompressionCodec(path: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = for { + footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) + block <- footer.getParquetMetadata.getBlocks.asScala + column <- block.getColumns.asScala +} yield column.getCodec.name() +codecs.distinct + } + + private def createTableWithCompression( + tableName: String, + isPartitioned: Boolean, + compressionCodec: String, + rootDir: File): Unit = { +val options = + s"""OPTIONS('path'='${rootDir.toURI.toString.stripSuffix("/")}/$tableName', + |'parquet.compression'='$compressionCodec')""".stripMargin +val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else "" +sql(s"""CREATE TABLE $tableName USING Parquet $options $partitionCreate +|as select 1 as col1, 2 as p""".stripMargin) --- End diff -- ``` val options = s""" |OPTIONS('path'='${rootDir.toURI.toString.stripSuffix("/")}/$tableName', |'parquet.compression'='$compressionCodec') """.stripMargin val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else "" sql( s""" |CREATE TABLE $tableName USING Parquet $options $partitionCreate |AS SELECT 1 AS col1, 2 AS p """.stripMargin) ``` --
[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20076#discussion_r159889119 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -323,11 +323,13 @@ object SQLConf { .createWithDefault(false) val PARQUET_COMPRESSION = buildConf("spark.sql.parquet.compression.codec") -.doc("Sets the compression codec use when writing Parquet files. Acceptable values include: " + - "uncompressed, snappy, gzip, lzo.") +.doc("Sets the compression codec used when writing Parquet files. If other compression codec " + + "configuration was found through hive or parquet, the precedence would be `compression`, " + --- End diff -- > Sets the compression codec used when writing Parquet files. If either `compression` or `parquet.compression` is specified in the table-specific options/properties, the precedence would be `compression`, ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19247#discussion_r159893158 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -233,7 +233,7 @@ class FileStreamSource( } val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map { status => - (status.getPath.toUri.toString, status.getModificationTime) + (status.getPath.toUri.getPath, status.getModificationTime) --- End diff -- ping @xysun ^ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19893 **[Test build #85722 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85722/testReport)** for PR 19893 at commit [`0851ef2`](https://github.com/apache/spark/commit/0851ef2b572ed507155358e541a4fa53a8e92173). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19893 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85722/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19893 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20165#discussion_r159900236 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { """.stripMargin) } + test("SPARK-22972: hive orc source") { +spark.sql( + s"""CREATE TABLE normal_orc_as_source_hive + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}' + |) + """.stripMargin) +spark.sql("desc formatted normal_orc_as_source_hive").show() --- End diff -- Replace it by ``` val tableMetadata = spark.sessionState.catalog.getTableMetadata( TableIdentifier("normal_orc_as_source_hive")) assert(tableMetadata.storage.inputFormat == Option("org.apache.hadoop.hive.ql.io.orc.OrcInputFormat")) assert(tableMetadata.storage.outputFormat == Option("org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat")) assert(tableMetadata.storage.serde == Option("org.apache.hadoop.hive.ql.io.orc.OrcSerde")) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20165#discussion_r159900297 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { """.stripMargin) } + test("SPARK-22972: hive orc source") { +spark.sql( --- End diff -- `WithTable("normal_orc_as_source_hive")` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20165#discussion_r159900404 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { """.stripMargin) } + test("SPARK-22972: hive orc source") { +spark.sql( + s"""CREATE TABLE normal_orc_as_source_hive --- End diff -- Put it next line. You can refer the other test cases of `CREATE TABLE ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20165 **[Test build #85723 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85723/testReport)** for PR 20165 at commit [`fa902d6`](https://github.com/apache/spark/commit/fa902d6d3fb635236ac01ee5b43470359f16cfdd). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20165 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20165 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85723/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20163 I ran some experiments: ``` py_date = udf(datetime.date, DateType()) py_timestamp = udf(datetime.datetime, TimestampType()) ``` This works correctly ``` spark.range(1).select(py_date(lit(2017), lit(10), lit(30))).show() spark.range(1).select(py_timestamp(lit(2017), lit(10), lit(30))).show() ``` Result: ``` +--+ |date(2017, 10, 30)| +--+ |2017-10-30| +--+ +--+ |datetime(2017, 10, 30)| +--+ | 2017-10-30 00:00:00| +--+ ``` The change that the PR proposes seem to be coercing python `datetime.datetime` and `datetime.date` to the python datetime string representation rather the java one. We could call function `str` on the return value of the python udf if it's a String type to get the python string representation, but this probably needs some microbenchmark to see the performance implication. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20166: [SPARK-22973][SQL] Fix incorrect results of Casti...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/20166 [SPARK-22973][SQL] Fix incorrect results of Casting Map to String ## What changes were proposed in this pull request? This pr fixed the issue when casting maps into strings; ``` scala> Seq(Map(1 -> "a", 2 -> "b")).toDF("a").write.saveAsTable("t") scala> sql("SELECT cast(a as String) FROM t").show(false) ++ |a | ++ |org.apache.spark.sql.catalyst.expressions.UnsafeMapData@38bdd75d| ++ ``` This pr modified the result into; ``` ++ |a | ++ |[1 -> a, 2 -> b]| ++ ``` ## How was this patch tested? Added tests in `CastSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark SPARK-22973 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20166.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20166 commit efc28e3aa2a0add8e325d60162090c9ee3fdfcba Author: Takeshi Yamamuro Date: 2018-01-05T14:33:14Z Cast maps to strings --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20166 **[Test build #85726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85726/testReport)** for PR 20166 at commit [`efc28e3`](https://github.com/apache/spark/commit/efc28e3aa2a0add8e325d60162090c9ee3fdfcba). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20166 **[Test build #85727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85727/testReport)** for PR 20166 at commit [`be04e64`](https://github.com/apache/spark/commit/be04e64733d6051864d6597420d2c982c72606e6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20165 **[Test build #85725 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85725/testReport)** for PR 20165 at commit [`b64ce53`](https://github.com/apache/spark/commit/b64ce532d36442cde636db54d8ecbc08d6030825). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20165 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85725/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20165 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20167: Allow providing Mesos principal & secret via file...
GitHub user rvesse opened a pull request: https://github.com/apache/spark/pull/20167 Allow providing Mesos principal & secret via files (SPARK-16501) ## What changes were proposed in this pull request? This commit modifies the Mesos submission client to allow the principal and secret to be provided indirectly via files. The path to these files can be specified either via Spark configuration or via environment variable. Assuming these files are appropriately protected by FS/OS permissions this means we don't ever leak the actual values in process info like ps Environment variable specification is useful because it allows you to interpolate the location of this file when using per-user Mesos credentials. For some background as to why we have taken this approach I will briefly describe our set up. On our systems we provide each authorised user account with their own Mesos credentials to provide certain security and audit guarantees to our customers. These credentials are managed by a central Secret management service. In our `spark-env.sh` we determine the appropriate secret and principal files to use depending on the user who is invoking Spark hence the need to inject these via environment variables as well as by configuration properties. So we set these environment variables appropriately and our Spark read in the contents of those files to authenticate itself with Mesos. ## How was this patch tested? This is functionality we have been using it in production across multiple customer sites for some time. This has been in the field for around 18 months with no reported issues. These changes have been sufficient to meet our customer security and audit requirements. We have been building and deploying custom builds of Apache Spark with various minor tweaks like this which we are now looking to contribute back into the community in order that we can rely upon stock Apache Spark builds and stop maintaining our own internal fork. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rvesse/spark SPARK-16501 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20167.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20167 commit d1bd57d29a79bfcfde33e37a6935e761456f438c Author: Rob Vesse Date: 2018-01-05T17:19:44Z Allow providing Mesos principal & secret via files (SPARK-16501) This commit modifies the Mesos submission client to allow the principal and secret to be provided indirectly via files. The path to these files can be specified either via Spark configuration or via environment variable. Assuming these files are appropriately protected by FS/OS permissions this means we don't ever leak the actual values in process info like ps Environment variable specification is useful because it allows you to interpolate the location of this file when using per-user Mesos credentials. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20167: Allow providing Mesos principal & secret via files (SPAR...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20167 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20154: [SPARK-22960][k8s] Make build-push-docker-images.sh more...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20154 ARGs can have default values, so we could do that if we decide to use the Docker Hub infra. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20155: [SPARK-22961][REGRESSION] Constant columns should genera...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/20155 A late LGTM! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSui...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20147#discussion_r159939270 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala --- @@ -85,6 +93,34 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { new File(tmpDataDir, name).getCanonicalPath } + private def getFileFromUrl(urlString: String, targetDir: String, filename: String): Unit = { +val conf = new SparkConf +// if the caller passes the name of an existing file, we want doFetchFile to write over it with +// the contents from the specified url. +conf.set("spark.files.overwrite", "true") +val securityManager = new SecurityManager(conf) +val hadoopConf = new Configuration + +val outDir = new File(targetDir) +if (!outDir.exists()) { + outDir.mkdirs() +} + +// propagate exceptions up to the caller of getFileFromUrl --- End diff -- We generally don't add these kind of comments since it's implied in every statement outside of a try...catch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSui...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20147 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20013 > I wonder which changes double the disk usage? It's the new indices, more explicitly the values, not the keys. I tried changing the disk layout to write all the indices in a new namespace with a very short key length, and that didn't change the resulting store size at all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85728/testReport)** for PR 20096 at commit [`1e5c7a9`](https://github.com/apache/spark/commit/1e5c7a99c5a75539500c572770c03ee4c7e6d4a0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20013: [SPARK-20657][core] Speed up rendering of the sta...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20013#discussion_r159943385 --- Diff: core/src/main/scala/org/apache/spark/status/LiveEntity.scala --- @@ -119,118 +121,115 @@ private class LiveTask( import LiveEntityHelpers._ - private var recordedMetrics: v1.TaskMetrics = null + private var metrics: MetricsTracker = new MetricsTracker() var errorMessage: Option[String] = None /** * Update the metrics for the task and return the difference between the previous and new * values. */ - def updateMetrics(metrics: TaskMetrics): v1.TaskMetrics = { + def updateMetrics(metrics: TaskMetrics): MetricsTracker = { if (metrics != null) { - val old = recordedMetrics - recordedMetrics = new v1.TaskMetrics( -metrics.executorDeserializeTime, -metrics.executorDeserializeCpuTime, -metrics.executorRunTime, -metrics.executorCpuTime, -metrics.resultSize, -metrics.jvmGCTime, -metrics.resultSerializationTime, -metrics.memoryBytesSpilled, -metrics.diskBytesSpilled, -metrics.peakExecutionMemory, -new v1.InputMetrics( - metrics.inputMetrics.bytesRead, - metrics.inputMetrics.recordsRead), -new v1.OutputMetrics( - metrics.outputMetrics.bytesWritten, - metrics.outputMetrics.recordsWritten), -new v1.ShuffleReadMetrics( - metrics.shuffleReadMetrics.remoteBlocksFetched, - metrics.shuffleReadMetrics.localBlocksFetched, - metrics.shuffleReadMetrics.fetchWaitTime, - metrics.shuffleReadMetrics.remoteBytesRead, - metrics.shuffleReadMetrics.remoteBytesReadToDisk, - metrics.shuffleReadMetrics.localBytesRead, - metrics.shuffleReadMetrics.recordsRead), -new v1.ShuffleWriteMetrics( - metrics.shuffleWriteMetrics.bytesWritten, - metrics.shuffleWriteMetrics.writeTime, - metrics.shuffleWriteMetrics.recordsWritten)) - if (old != null) calculateMetricsDelta(recordedMetrics, old) else recordedMetrics + val old = this.metrics + val newMetrics = new MetricsTracker() --- End diff -- This would be the only place where it's used, so I don't see any gains. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20166 **[Test build #85726 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85726/testReport)** for PR 20166 at commit [`efc28e3`](https://github.com/apache/spark/commit/efc28e3aa2a0add8e325d60162090c9ee3fdfcba). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20166 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20166 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85726/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: SPARK-22730 Add ImageSchema support for non-integer imag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20168 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: SPARK-22730 Add ImageSchema support for non-integ...
GitHub user tomasatdatabricks opened a pull request: https://github.com/apache/spark/pull/20168 SPARK-22730 Add ImageSchema support for non-integer image formats ## What changes were proposed in this pull request? Added functionality to handle all OpenCV modes to ImageSchema: 1. updated toImage and toNDArray functions to handle non-uint8 based images. 2. add information about individual OpenCv modes ## How was this patch tested? Added test for conversion between numpy arrays and images stored as all possible OpenCV modes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tomasatdatabricks/spark tomas/ImageSchemaUpdate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20168.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20168 commit fe87dd112709ca2b101d78c97c4826536ec7da7d Author: tomasatdatabricks Date: 2017-12-29T22:56:28Z Added functionality for handling non-uint8-based images for ImageSchema commit 70bae2f7e9d85a5f464f1bfc3a9426136259d5d1 Author: tomasatdatabricks Date: 2018-01-03T19:14:06Z Added test for conversion between array and image struct for all ocv types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20166 **[Test build #85727 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85727/testReport)** for PR 20166 at commit [`be04e64`](https://github.com/apache/spark/commit/be04e64733d6051864d6597420d2c982c72606e6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20166 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20166 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85727/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159955132 --- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.JavaConverters._ + +import org.apache.spark.internal.Logging + +/** + * Thread audit for test suites. + * + * Thread audit happens normally in [[SparkFunSuite]] automatically when a new test suite created. --- End diff -- I'd just remove this paragraph since this class is independent of `SparkFunSuite`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159954589 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -27,19 +27,55 @@ import org.apache.spark.util.AccumulatorContext /** * Base abstract class for all unit tests in Spark for handling common functionality. + * + * Thread audit happens normally here automatically when a new test suite created. + * The only prerequisite for that is that the test class must extend [[SparkFunSuite]]. + * + * There are some test suites which are doing initialization before [[SparkFunSuite#beforeAll]] + * executed. This case auditing can be moved into another place in the call sequence. + * + * To do the audit in a custom place/way the following can be done: + * + * class MyTestSuite extends SparkFunSuite { + * + * override val doThreadAuditInSparkFunSuite = false --- End diff -- `enableAutoThreadAudit` now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159954825 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -27,19 +27,55 @@ import org.apache.spark.util.AccumulatorContext /** * Base abstract class for all unit tests in Spark for handling common functionality. + * + * Thread audit happens normally here automatically when a new test suite created. + * The only prerequisite for that is that the test class must extend [[SparkFunSuite]]. + * + * There are some test suites which are doing initialization before [[SparkFunSuite#beforeAll]] --- End diff -- Better: " It is possible to override the default thread audit behavior by setting `enableAutoThreadAudit` to false and manually calling the audit methods, if desired. For example: // Code " --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r159955747 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala --- @@ -17,4 +17,22 @@ package org.apache.spark.sql.test -trait SharedSQLContext extends SQLTestUtils with SharedSparkSession +trait SharedSQLContext extends SQLTestUtils with SharedSparkSession { + + /** + * Auto thread audit is turned off here intentionally and done manually. --- End diff -- I'm still a little not convinced that this is needed. I still think that any reported leaks here are caused by bugs in the test suites and not because of this. The code you have here is basically the same thing as `SparkFunSuite`. For example, if a suite extending this does not call `super.beforeAll()` but calls `super.afterAll()`, won't you get false positives in the output? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/19848 Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20132 Thanks! Merging with master and branch-2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEn...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20132 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85728/testReport)** for PR 20096 at commit [`1e5c7a9`](https://github.com/apache/spark/commit/1e5c7a99c5a75539500c572770c03ee4c7e6d4a0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85728/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20097#discussion_r159958467 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -33,15 +40,21 @@ class MicroBatchExecution( name: String, checkpointRoot: String, analyzedPlan: LogicalPlan, -sink: Sink, +sink: BaseStreamingSink, trigger: Trigger, triggerClock: Clock, outputMode: OutputMode, +extraOptions: Map[String, String], deleteCheckpointOnStop: Boolean) extends StreamExecution( sparkSession, name, checkpointRoot, analyzedPlan, sink, trigger, triggerClock, outputMode, deleteCheckpointOnStop) { + private def toJava( --- End diff -- super nit: I usually prefer moving such small less-important methods at the end of the class --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20097#discussion_r159960120 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -236,14 +264,27 @@ class MicroBatchExecution( val hasNewData = { awaitProgressLock.lock() try { -val latestOffsets: Map[Source, Option[Offset]] = uniqueSources.map { +// Generate a map from each unique source to the next available offset. +val latestOffsets: Map[BaseStreamingSource, Option[Offset]] = uniqueSources.map { case s: Source => updateStatusMessage(s"Getting offsets from $s") reportTimeTaken("getOffset") { (s, s.getOffset) } + case s: MicroBatchReader => +updateStatusMessage(s"Getting offsets from $s") +reportTimeTaken("getOffset") { + // Once v1 streaming source execution is gone, we can refactor this away. + // For now, we set the range here to get the source to infer the available end offset, + // get that offset, and then set the range again when we later execute. +s.setOffsetRange( --- End diff -- incorrect indentation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20097#discussion_r159980863 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -35,6 +35,16 @@ case class DataSourceV2Relation( } } +/** + * A specialization of DataSourceV2Relation with the streaming bit set to true. Otherwwise identical --- End diff -- nit: `Otherwwise`?? :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20097#discussion_r159981274 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala --- @@ -28,17 +28,38 @@ import org.json4s.jackson.Serialization import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.util.DateTimeUtils import org.apache.spark.sql.execution.streaming.{RateStreamOffset, ValueRunTimeMsPair} -import org.apache.spark.sql.sources.v2.DataSourceV2Options +import org.apache.spark.sql.sources.DataSourceRegister +import org.apache.spark.sql.sources.v2.{DataSourceV2, DataSourceV2Options} import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.sources.v2.streaming.MicroBatchReadSupport import org.apache.spark.sql.sources.v2.streaming.reader.{MicroBatchReader, Offset} import org.apache.spark.sql.types.{LongType, StructField, StructType, TimestampType} -import org.apache.spark.util.SystemClock +import org.apache.spark.util.{ManualClock, SystemClock} -class RateStreamV2Reader(options: DataSourceV2Options) +/** + * This is a temporary register as we build out v2 migration. Microbatch read support should + * be implemented in the same register as v1. + */ +class RateSourceProviderV2 extends DataSourceV2 with MicroBatchReadSupport with DataSourceRegister { + override def createMicroBatchReader( + schema: Optional[StructType], + checkpointLocation: String, + options: DataSourceV2Options): MicroBatchReader = { +new MicroBatchRateStreamReader(options) + } + + override def shortName(): String = "ratev2" +} + +class MicroBatchRateStreamReader(options: DataSourceV2Options) --- End diff -- As with the other kafka PR, can you rename these classes to start with "RateStream"? Only if it is not too much refactoring, otherwise we can clean this up later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20097#discussion_r159961186 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -357,31 +400,39 @@ class MicroBatchExecution( s"DataFrame returned by getBatch from $source did not have isStreaming=true\n" + s"${batch.queryExecution.logical}") logDebug(s"Retrieving data from $source: $current -> $available") - Some(source -> batch) + Some(source -> batch.logicalPlan) +case (reader: MicroBatchReader, available) + if committedOffsets.get(reader).map(_ != available).getOrElse(true) => + val current = committedOffsets.get(reader).map(off => reader.deserializeOffset(off.json)) + reader.setOffsetRange( +toJava(current), + Optional.of(available.asInstanceOf[v2.streaming.reader.Offset])) --- End diff -- `v2.streaming.reader.Offset` is being used in a lot of places. Please rename it to OffsetV2 in the imports and use that in all places. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20097#discussion_r159978319 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -392,6 +443,21 @@ class MicroBatchExecution( cd.dataType, cd.timeZoneId) } +val triggerLogicalPlan = sink match { + case _: Sink => newAttributePlan + case s: MicroBatchWriteSupport => +val writer = s.createMicroBatchWriter( + s"$runId", + currentBatchId, + newAttributePlan.schema, + outputMode, + new DataSourceV2Options(extraOptions.asJava)) +Option(writer.orElse(null)).map(WriteToDataSourceV2(_, newAttributePlan)).getOrElse { --- End diff -- can you add a comment explaining why the fallback in a LocalRelation? and when can the writer be empty. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20142: [SPARK-22930][PYTHON][SQL] Improve the description of Ve...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20142 I added the test. @gatorsmile do you have to take a look or let me know who should I ping for review? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20097#discussion_r159980158 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala --- @@ -33,15 +40,21 @@ class MicroBatchExecution( name: String, checkpointRoot: String, analyzedPlan: LogicalPlan, -sink: Sink, +sink: BaseStreamingSink, trigger: Trigger, triggerClock: Clock, outputMode: OutputMode, +extraOptions: Map[String, String], deleteCheckpointOnStop: Boolean) extends StreamExecution( sparkSession, name, checkpointRoot, analyzedPlan, sink, trigger, triggerClock, outputMode, deleteCheckpointOnStop) { + private def toJava( + scalaOption: Option[v2.streaming.reader.Offset]): Optional[v2.streaming.reader.Offset] = { --- End diff -- mentioned elsewhere as well, import new Offset as OffsetV2 instead of using full package path. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org