[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159831366
  
--- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.internal.Logging
+
+/**
+ * Thread audit for test suites.
+ *
+ * Thread audit happens normally in [[SparkFunSuite]] automatically when a 
new test suite created.
--- End diff --

Good point, moving.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159831631
  
--- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.internal.Logging
+
+/**
+ * Thread audit for test suites.
+ *
+ * Thread audit happens normally in [[SparkFunSuite]] automatically when a 
new test suite created.
+ * The only prerequisite for that is that the test class must extend 
[[SparkFunSuite]].
+ *
+ * There are some test suites which are doing initialization before 
[[SparkFunSuite#beforeAll]]
+ * executed. This case auditing can be moved into another place in the 
call sequence.
+ *
+ * To do the audit in a custom place/way the following can be done:
+ *
+ * class MyTestSuite extends SparkFunSuite {
+ *
+ *   override val doThreadAuditInSparkFunSuite = false
+ *
+ *   protected override def beforeAll(): Unit = {
+ * doThreadPreAudit
+ * super.beforeAll
+ *   }
+ *
+ *   protected override def afterAll(): Unit = {
+ * super.afterAll
+ * doThreadPostAudit
+ *   }
+ * }
+ */
+trait ThreadAudit extends Logging {
+
+  val threadWhiteList = Set(
+/**
+ * Netty related internal threads.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"netty.*",
+
+/**
+ * Netty related internal threads.
+ * A Single-thread singleton EventExecutor inside netty which creates 
such threads.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"globalEventExecutor.*",
+
+/**
+ * Netty related internal threads.
+ * Checks if a thread is alive periodically and runs a task when a 
thread dies.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"threadDeathWatcher.*",
+
+/**
+ * During [[SparkContext]] creation 
[[org.apache.spark.rpc.netty.NettyRpcEnv]]
+ * creates event loops. One is wrapped inside
+ * [[org.apache.spark.network.server.TransportServer]]
+ * the other one is inside 
[[org.apache.spark.network.client.TransportClient]].
+ * The thread pools behind shut down asynchronously triggered by 
[[SparkContext#stop]].
+ * Manually checked and all of them stopped properly.
+ */
+"rpc-client.*",
+"rpc-server.*",
+
+/**
+ * During [[SparkContext]] creation BlockManager
+ * creates event loops. One is wrapped inside
--- End diff --

Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159832598
  
--- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.internal.Logging
+
+/**
+ * Thread audit for test suites.
+ *
+ * Thread audit happens normally in [[SparkFunSuite]] automatically when a 
new test suite created.
+ * The only prerequisite for that is that the test class must extend 
[[SparkFunSuite]].
+ *
+ * There are some test suites which are doing initialization before 
[[SparkFunSuite#beforeAll]]
+ * executed. This case auditing can be moved into another place in the 
call sequence.
+ *
+ * To do the audit in a custom place/way the following can be done:
+ *
+ * class MyTestSuite extends SparkFunSuite {
+ *
+ *   override val doThreadAuditInSparkFunSuite = false
+ *
+ *   protected override def beforeAll(): Unit = {
+ * doThreadPreAudit
+ * super.beforeAll
+ *   }
+ *
+ *   protected override def afterAll(): Unit = {
+ * super.afterAll
+ * doThreadPostAudit
+ *   }
+ * }
+ */
+trait ThreadAudit extends Logging {
+
+  val threadWhiteList = Set(
+/**
+ * Netty related internal threads.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"netty.*",
+
+/**
+ * Netty related internal threads.
+ * A Single-thread singleton EventExecutor inside netty which creates 
such threads.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"globalEventExecutor.*",
+
+/**
+ * Netty related internal threads.
+ * Checks if a thread is alive periodically and runs a task when a 
thread dies.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"threadDeathWatcher.*",
+
+/**
+ * During [[SparkContext]] creation 
[[org.apache.spark.rpc.netty.NettyRpcEnv]]
+ * creates event loops. One is wrapped inside
+ * [[org.apache.spark.network.server.TransportServer]]
+ * the other one is inside 
[[org.apache.spark.network.client.TransportClient]].
+ * The thread pools behind shut down asynchronously triggered by 
[[SparkContext#stop]].
+ * Manually checked and all of them stopped properly.
+ */
+"rpc-client.*",
+"rpc-server.*",
+
+/**
+ * During [[SparkContext]] creation BlockManager
+ * creates event loops. One is wrapped inside
+ * [[org.apache.spark.network.server.TransportServer]]
+ * the other one is inside 
[[org.apache.spark.network.client.TransportClient]].
+ * The thread pools behind shut down asynchronously triggered by 
[[SparkContext#stop]].
+ * Manually checked and all of them stopped properly.
+ */
+"shuffle-client.*",
+"shuffle-server.*"
+  )
+  private var threadNamesSnapshot: Set[String] = Set.empty
+
+  protected def doThreadPreAudit(): Unit = snapshotRunningThreadNames
+  protected def doThreadPostAudit(): Unit = printRemainingThreadNames
+
+  private def snapshotRunningThreadNames(): Unit = {
--- End diff --

Inlined.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159832729
  
--- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.internal.Logging
+
+/**
+ * Thread audit for test suites.
+ *
+ * Thread audit happens normally in [[SparkFunSuite]] automatically when a 
new test suite created.
+ * The only prerequisite for that is that the test class must extend 
[[SparkFunSuite]].
+ *
+ * There are some test suites which are doing initialization before 
[[SparkFunSuite#beforeAll]]
+ * executed. This case auditing can be moved into another place in the 
call sequence.
+ *
+ * To do the audit in a custom place/way the following can be done:
+ *
+ * class MyTestSuite extends SparkFunSuite {
+ *
+ *   override val doThreadAuditInSparkFunSuite = false
+ *
+ *   protected override def beforeAll(): Unit = {
+ * doThreadPreAudit
+ * super.beforeAll
+ *   }
+ *
+ *   protected override def afterAll(): Unit = {
+ * super.afterAll
+ * doThreadPostAudit
+ *   }
+ * }
+ */
+trait ThreadAudit extends Logging {
+
+  val threadWhiteList = Set(
+/**
+ * Netty related internal threads.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"netty.*",
+
+/**
+ * Netty related internal threads.
+ * A Single-thread singleton EventExecutor inside netty which creates 
such threads.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"globalEventExecutor.*",
+
+/**
+ * Netty related internal threads.
+ * Checks if a thread is alive periodically and runs a task when a 
thread dies.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"threadDeathWatcher.*",
+
+/**
+ * During [[SparkContext]] creation 
[[org.apache.spark.rpc.netty.NettyRpcEnv]]
+ * creates event loops. One is wrapped inside
+ * [[org.apache.spark.network.server.TransportServer]]
+ * the other one is inside 
[[org.apache.spark.network.client.TransportClient]].
+ * The thread pools behind shut down asynchronously triggered by 
[[SparkContext#stop]].
+ * Manually checked and all of them stopped properly.
+ */
+"rpc-client.*",
+"rpc-server.*",
+
+/**
+ * During [[SparkContext]] creation BlockManager
+ * creates event loops. One is wrapped inside
+ * [[org.apache.spark.network.server.TransportServer]]
+ * the other one is inside 
[[org.apache.spark.network.client.TransportClient]].
+ * The thread pools behind shut down asynchronously triggered by 
[[SparkContext#stop]].
+ * Manually checked and all of them stopped properly.
+ */
+"shuffle-client.*",
+"shuffle-server.*"
+  )
+  private var threadNamesSnapshot: Set[String] = Set.empty
+
+  protected def doThreadPreAudit(): Unit = snapshotRunningThreadNames
+  protected def doThreadPostAudit(): Unit = printRemainingThreadNames
+
+  private def snapshotRunningThreadNames(): Unit = {
+threadNamesSnapshot = runningThreadNames
--- End diff --

Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159834709
  
--- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.internal.Logging
+
+/**
+ * Thread audit for test suites.
+ *
+ * Thread audit happens normally in [[SparkFunSuite]] automatically when a 
new test suite created.
+ * The only prerequisite for that is that the test class must extend 
[[SparkFunSuite]].
+ *
+ * There are some test suites which are doing initialization before 
[[SparkFunSuite#beforeAll]]
+ * executed. This case auditing can be moved into another place in the 
call sequence.
+ *
+ * To do the audit in a custom place/way the following can be done:
+ *
+ * class MyTestSuite extends SparkFunSuite {
+ *
+ *   override val doThreadAuditInSparkFunSuite = false
+ *
+ *   protected override def beforeAll(): Unit = {
+ * doThreadPreAudit
+ * super.beforeAll
+ *   }
+ *
+ *   protected override def afterAll(): Unit = {
+ * super.afterAll
+ * doThreadPostAudit
+ *   }
+ * }
+ */
+trait ThreadAudit extends Logging {
+
+  val threadWhiteList = Set(
+/**
+ * Netty related internal threads.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"netty.*",
+
+/**
+ * Netty related internal threads.
+ * A Single-thread singleton EventExecutor inside netty which creates 
such threads.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"globalEventExecutor.*",
+
+/**
+ * Netty related internal threads.
+ * Checks if a thread is alive periodically and runs a task when a 
thread dies.
+ * These are excluded because their lifecycle is handled by the netty 
itself
+ * and spark has no explicit effect on them.
+ */
+"threadDeathWatcher.*",
+
+/**
+ * During [[SparkContext]] creation 
[[org.apache.spark.rpc.netty.NettyRpcEnv]]
+ * creates event loops. One is wrapped inside
+ * [[org.apache.spark.network.server.TransportServer]]
+ * the other one is inside 
[[org.apache.spark.network.client.TransportClient]].
+ * The thread pools behind shut down asynchronously triggered by 
[[SparkContext#stop]].
+ * Manually checked and all of them stopped properly.
+ */
+"rpc-client.*",
+"rpc-server.*",
+
+/**
+ * During [[SparkContext]] creation BlockManager
+ * creates event loops. One is wrapped inside
+ * [[org.apache.spark.network.server.TransportServer]]
+ * the other one is inside 
[[org.apache.spark.network.client.TransportClient]].
+ * The thread pools behind shut down asynchronously triggered by 
[[SparkContext#stop]].
+ * Manually checked and all of them stopped properly.
+ */
+"shuffle-client.*",
+"shuffle-server.*"
+  )
+  private var threadNamesSnapshot: Set[String] = Set.empty
+
+  protected def doThreadPreAudit(): Unit = snapshotRunningThreadNames
+  protected def doThreadPostAudit(): Unit = printRemainingThreadNames
+
+  private def snapshotRunningThreadNames(): Unit = {
+threadNamesSnapshot = runningThreadNames
+  }
+
+  private def printRemainingThreadNames(): Unit = {
--- End diff --

Inlined.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159837804
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SessionStateSuite.scala ---
@@ -39,6 +41,7 @@ class SessionStateSuite extends SparkFunSuite
   protected var activeSession: SparkSession = _
 
   override def beforeAll(): Unit = {
+doThreadPreAudit()
--- End diff --

Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159838691
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/DataSourceAnalysisSuite.scala
 ---
@@ -29,16 +29,24 @@ import org.apache.spark.sql.types.{DataType, 
IntegerType, StructType}
 
 class DataSourceAnalysisSuite extends SparkFunSuite with BeforeAndAfterAll 
{
 
+  protected override val doThreadAuditInSparkFunSuite = false
+
   private var targetAttributes: Seq[Attribute] = _
   private var targetPartitionSchema: StructType = _
 
   override def beforeAll(): Unit = {
+doThreadPreAudit()
--- End diff --

Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85720 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85720/testReport)**
 for PR 20096 at commit 
[`be6c378`](https://github.com/apache/spark/commit/be6c378969920aeda6506d1c2cfb91a33dfe7027).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159840754
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSessionStateSuite.scala 
---
@@ -28,14 +28,18 @@ import org.apache.spark.sql.hive.test.TestHiveSingleton
 class HiveSessionStateSuite extends SessionStateSuite
   with TestHiveSingleton with BeforeAndAfterEach {
 
+  override protected val doThreadAuditInSparkFunSuite = false
+
   override def beforeAll(): Unit = {
 // Reuse the singleton session
 activeSession = spark
+doThreadPreAudit()
--- End diff --

Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159842007
  
--- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala ---
@@ -31,15 +31,28 @@ import org.apache.spark.util.AccumulatorContext
 abstract class SparkFunSuite
   extends FunSuite
   with BeforeAndAfterAll
+  with ThreadAudit
   with Logging {
 // scalastyle:on
 
+  protected val doThreadAuditInSparkFunSuite = true
--- End diff --

Renamed to enableAutoThreadAudit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159841944
  
--- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala ---
@@ -31,15 +31,28 @@ import org.apache.spark.util.AccumulatorContext
 abstract class SparkFunSuite
   extends FunSuite
   with BeforeAndAfterAll
+  with ThreadAudit
   with Logging {
 // scalastyle:on
 
+  protected val doThreadAuditInSparkFunSuite = true
--- End diff --

I was thinking about proper naming before. The last suggested one is 
definitely better. No exact place where it happens but not suggesting that it's 
completely turned off.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20164: [SPARK-22971][ML] OneVsRestModel should use tempo...

2018-01-05 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/20164

[SPARK-22971][ML] OneVsRestModel should use temporary RawPredictionCol

## What changes were proposed in this pull request?
use temporary RawPredictionCol in `OneVsRestModel#transform`

## How was this patch tested?
existing tests and added tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark 
ovr_not_use_getRawPredictionCol

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20164


commit f155e1cc6b175ac06a5f2ab710d4c053b0776507
Author: Zheng RuiFeng 
Date:   2018-01-05T09:29:25Z

create pr

commit 9b0dcc69535b6731c9b6cdc0030c846c3352a5de
Author: Zheng RuiFeng 
Date:   2018-01-05T10:19:59Z

create pr

commit 6c567ffb02738346fc83e467752add0d00a42e07
Author: Zheng RuiFeng 
Date:   2018-01-05T10:26:16Z

add test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20164
  
**[Test build #85721 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85721/testReport)**
 for PR 20164 at commit 
[`6c567ff`](https://github.com/apache/spark/commit/6c567ffb02738346fc83e467752add0d00a42e07).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20013: [SPARK-20657][core] Speed up rendering of the sta...

2018-01-05 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20013#discussion_r159844819
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ---
@@ -848,7 +853,7 @@ private[spark] class AppStatusListener(
 }
 
 stages.foreach { s =>
-  val key = s.id
+  val key = Array(s.info.stageId, s.info.attemptId)
--- End diff --

Use a case class instead?
Or create a function `xxxKey` and return the Array/Tuple. 
Using `Array(s.info.stageId, s.info.attemptId)` in several places looks not 
robust enough


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20013: [SPARK-20657][core] Speed up rendering of the sta...

2018-01-05 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20013#discussion_r159835376
  
--- Diff: core/src/main/scala/org/apache/spark/status/LiveEntity.scala ---
@@ -119,118 +121,115 @@ private class LiveTask(
 
   import LiveEntityHelpers._
 
-  private var recordedMetrics: v1.TaskMetrics = null
+  private var metrics: MetricsTracker = new MetricsTracker()
 
   var errorMessage: Option[String] = None
 
   /**
* Update the metrics for the task and return the difference between the 
previous and new
* values.
*/
-  def updateMetrics(metrics: TaskMetrics): v1.TaskMetrics = {
+  def updateMetrics(metrics: TaskMetrics): MetricsTracker = {
 if (metrics != null) {
-  val old = recordedMetrics
-  recordedMetrics = new v1.TaskMetrics(
-metrics.executorDeserializeTime,
-metrics.executorDeserializeCpuTime,
-metrics.executorRunTime,
-metrics.executorCpuTime,
-metrics.resultSize,
-metrics.jvmGCTime,
-metrics.resultSerializationTime,
-metrics.memoryBytesSpilled,
-metrics.diskBytesSpilled,
-metrics.peakExecutionMemory,
-new v1.InputMetrics(
-  metrics.inputMetrics.bytesRead,
-  metrics.inputMetrics.recordsRead),
-new v1.OutputMetrics(
-  metrics.outputMetrics.bytesWritten,
-  metrics.outputMetrics.recordsWritten),
-new v1.ShuffleReadMetrics(
-  metrics.shuffleReadMetrics.remoteBlocksFetched,
-  metrics.shuffleReadMetrics.localBlocksFetched,
-  metrics.shuffleReadMetrics.fetchWaitTime,
-  metrics.shuffleReadMetrics.remoteBytesRead,
-  metrics.shuffleReadMetrics.remoteBytesReadToDisk,
-  metrics.shuffleReadMetrics.localBytesRead,
-  metrics.shuffleReadMetrics.recordsRead),
-new v1.ShuffleWriteMetrics(
-  metrics.shuffleWriteMetrics.bytesWritten,
-  metrics.shuffleWriteMetrics.writeTime,
-  metrics.shuffleWriteMetrics.recordsWritten))
-  if (old != null) calculateMetricsDelta(recordedMetrics, old) else 
recordedMetrics
+  val old = this.metrics
+  val newMetrics = new MetricsTracker()
--- End diff --

create a method in `MetricsTracker` which accepts a `TaskMetrics` and 
update the metrics?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159851476
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala ---
@@ -17,4 +17,17 @@
 
 package org.apache.spark.sql.test
 
-trait SharedSQLContext extends SQLTestUtils with SharedSparkSession
+trait SharedSQLContext extends SQLTestUtils with SharedSparkSession {
+
+  override protected val doThreadAuditInSparkFunSuite = false
+
+  protected override def beforeAll(): Unit = {
+doThreadPreAudit()
--- End diff --

It's kind of similar but not the same. Comment added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19893
  
**[Test build #85722 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85722/testReport)**
 for PR 19893 at commit 
[`0851ef2`](https://github.com/apache/spark/commit/0851ef2b572ed507155358e541a4fa53a8e92173).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20164
  
**[Test build #85721 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85721/testReport)**
 for PR 20164 at commit 
[`6c567ff`](https://github.com/apache/spark/commit/6c567ffb02738346fc83e467752add0d00a42e07).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20164
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85721/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20164
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...

2018-01-05 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/20013
  
@vanzin I haven't run the code. I wonder which changes double the disk 
usage? The new indices or the cached quantiles? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20163
  
The problem here seems, `returnType` is mismatched to the value. In case of 
`DateType`, it needs an explicit conversion into integers:


https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L170-L171


https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L173-L175

which will be called via in `worker.py`


https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74

If the `returnType` is `StringType`, then it doesn't need the conversion 
because Pyrolite and serialization work fine between them up to my knowledge:


https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L141-L145


https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L76-L82


So, here:



https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74


we will send the return values as are without conversion, which ends up 
with `datetime.date` -> `java.util.Calendar` as you described in the PR 
description. Therefore, I don't think the current fix in `EvaluatePython.scala` 
is reachable in the reproducer above.

For the fix in Python side in `udf.py`, this is a band-aid fix. To deal 
with this problem correctly, I believe we should do something like:

```diff
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 146e673ae97..37137e02c08 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -144,6 +144,17 @@ class StringType(AtomicType):

 __metaclass__ = DataTypeSingleton

+def needConversion(self):
+return True
+
+def toInternal(self, v):
+if v is not None:
+return str(v)
+
+def fromInternal(self, v):
+if v is not None:
+return str(v)
+
```

but then this will bring performance regression because `str` is required 
to be called every value. This extra function call could cause performance 
regression, for example, see both https://github.com/apache/spark/pull/19246 
and https://github.com/apache/spark/pull/19249.

I am less sure this is something we should allow. Can we simply document 
this saying `returnType` should be compatible with the actual return value?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85720 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85720/testReport)**
 for PR 20096 at commit 
[`be6c378`](https://github.com/apache/spark/commit/be6c378969920aeda6506d1c2cfb91a33dfe7027).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85720/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20162: [SPARK-22965] [PySpark] [SQL] Add deterministic p...

2018-01-05 Thread gatorsmile
Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/20162


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread xubo245
GitHub user xubo245 opened a pull request:

https://github.com/apache/spark/pull/20165

[SPARK-22972] Couldn't find corresponding Hive SerDe for data source 
provider org.apache.spark.sql.hive.orc


## What changes were proposed in this pull request?
Fix the warning: Couldn't find corresponding Hive SerDe for data source 
provider org.apache.spark.sql.hive.orc.
(Please fill in changes proposed in this fix)

## How was this patch tested?
 test("SPARK-22972: hive orc source") 
assert(HiveSerDe.sourceToSerDe("org.apache.spark.sql.hive.orc")
  .equals(HiveSerDe.sourceToSerDe("orc")))

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xubo245/spark HiveSerDe

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20165.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20165


commit fa902d6d3fb635236ac01ee5b43470359f16cfdd
Author: xubo245 <601450868@...>
Date:   2018-01-05T13:20:53Z

[SPARK-22972] Couldn't find corresponding Hive SerDe for data source 
provider org.apache.spark.sql.hive.orc.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20155: [SPARK-22961][REGRESSION] Constant columns should genera...

2018-01-05 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20155
  
Thanks! Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20165
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20155: [SPARK-22961][REGRESSION] Constant columns should...

2018-01-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20155


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20165
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20165
  
**[Test build #85723 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85723/testReport)**
 for PR 20165 at commit 
[`fa902d6`](https://github.com/apache/spark/commit/fa902d6d3fb635236ac01ee5b43470359f16cfdd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20165#discussion_r159877971
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala ---
@@ -72,7 +72,7 @@ object HiveSerDe {
   def sourceToSerDe(source: String): Option[HiveSerDe] = {
 val key = source.toLowerCase(Locale.ROOT) match {
   case s if s.startsWith("org.apache.spark.sql.parquet") => "parquet"
-  case s if s.startsWith("org.apache.spark.sql.orc") => "orc"
--- End diff --

We also need to keep the original one. We do not want to introduce behavior 
breaks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20165#discussion_r159878203
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
""".stripMargin)
   }
 
+  test("SPARK-22972: hive orc source") {
+spark.sql(
+  s"""CREATE TABLE normal_orc_as_source_hive
+ |USING org.apache.spark.sql.hive.orc
+ |OPTIONS (
+ |  PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}'
+ |)
+   """.
+stripMargin)
--- End diff --

combine the line 73 and 74


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20165#discussion_r159878257
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
""".stripMargin)
   }
 
+  test("SPARK-22972: hive orc source") {
+spark.sql(
+  s"""CREATE TABLE normal_orc_as_source_hive
+ |USING org.apache.spark.sql.hive.orc
+ |OPTIONS (
+ |  PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}'
+ |)
+   """.
+stripMargin)
+spark.sql(
+  "desc formatted normal_orc_as_source_hive").show()
--- End diff --

?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20165#discussion_r159878556
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
""".stripMargin)
   }
 
+  test("SPARK-22972: hive orc source") {
+spark.sql(
+  s"""CREATE TABLE normal_orc_as_source_hive
+ |USING org.apache.spark.sql.hive.orc
+ |OPTIONS (
+ |  PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}'
+ |)
+   """.
+stripMargin)
+spark.sql(
+  "desc formatted normal_orc_as_source_hive").show()
+checkAnswer(sql("SELECT COUNT(*) FROM normal_orc_as_source_hive"), 
Row(10))
+assert(HiveSerDe.sourceToSerDe("org.apache.spark.sql.hive.orc")
+  .equals(HiveSerDe.sourceToSerDe("orc")))
--- End diff --

Also add the checks:
```
assert(HiveSerDe.sourceToSerDe("org.apache.spark.sql.orc")
  .equals(HiveSerDe.sourceToSerDe("orc")))
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread xubo245
Github user xubo245 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20165#discussion_r159878683
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala ---
@@ -72,7 +72,7 @@ object HiveSerDe {
   def sourceToSerDe(source: String): Option[HiveSerDe] = {
 val key = source.toLowerCase(Locale.ROOT) match {
   case s if s.startsWith("org.apache.spark.sql.parquet") => "parquet"
-  case s if s.startsWith("org.apache.spark.sql.orc") => "orc"
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread xubo245
Github user xubo245 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20165#discussion_r159878689
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
""".stripMargin)
   }
 
+  test("SPARK-22972: hive orc source") {
+spark.sql(
+  s"""CREATE TABLE normal_orc_as_source_hive
+ |USING org.apache.spark.sql.hive.orc
+ |OPTIONS (
+ |  PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}'
+ |)
+   """.
+stripMargin)
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread xubo245
Github user xubo245 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20165#discussion_r159879115
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
""".stripMargin)
   }
 
+  test("SPARK-22972: hive orc source") {
+spark.sql(
+  s"""CREATE TABLE normal_orc_as_source_hive
+ |USING org.apache.spark.sql.hive.orc
+ |OPTIONS (
+ |  PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}'
+ |)
+   """.
+stripMargin)
+spark.sql(
+  "desc formatted normal_orc_as_source_hive").show()
--- End diff --

change it to spark.sql("desc formatted normal_orc_as_source_hive").show(), 
is it ok?

How to get the warning and verify it in code?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread xubo245
Github user xubo245 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20165#discussion_r159879314
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
""".stripMargin)
   }
 
+  test("SPARK-22972: hive orc source") {
+spark.sql(
+  s"""CREATE TABLE normal_orc_as_source_hive
+ |USING org.apache.spark.sql.hive.orc
+ |OPTIONS (
+ |  PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}'
+ |)
+   """.
+stripMargin)
+spark.sql(
+  "desc formatted normal_orc_as_source_hive").show()
+checkAnswer(sql("SELECT COUNT(*) FROM normal_orc_as_source_hive"), 
Row(10))
+assert(HiveSerDe.sourceToSerDe("org.apache.spark.sql.hive.orc")
+  .equals(HiveSerDe.sourceToSerDe("orc")))
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20165
  
**[Test build #85724 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85724/testReport)**
 for PR 20165 at commit 
[`cf7cbce`](https://github.com/apache/spark/commit/cf7cbce6061894eacbfd334f75476268068446c9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20165
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20165
  
**[Test build #85724 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85724/testReport)**
 for PR 20165 at commit 
[`cf7cbce`](https://github.com/apache/spark/commit/cf7cbce6061894eacbfd334f75476268068446c9).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20165
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85724/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20165
  
**[Test build #85725 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85725/testReport)**
 for PR 20165 at commit 
[`b64ce53`](https://github.com/apache/spark/commit/b64ce532d36442cde636db54d8ecbc08d6030825).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...

2018-01-05 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/20080
  
Good point @felixcheung though the polling is hard-coded to 100ms, so will 
be close enough to "immediately". Allowing a value of 0 on both timeouts seems 
OK, but yeah not clear whether it does have the desired effect.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2018-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r159885686
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecPrecedenceSuite.scala
 ---
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import java.io.File
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.Path
+import org.apache.parquet.hadoop.ParquetOutputFormat
+
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with 
SharedSQLContext {
+  test("Test `spark.sql.parquet.compression.codec` config") {
+Seq("NONE", "UNCOMPRESSED", "SNAPPY", "GZIP", "LZO").foreach { c =>
+  withSQLConf(SQLConf.PARQUET_COMPRESSION.key -> c) {
+val expected = if (c == "NONE") "UNCOMPRESSED" else c
+val option = new ParquetOptions(Map.empty[String, String], 
spark.sessionState.conf)
+assert(option.compressionCodecClassName == expected)
+  }
+}
+  }
+
+  test("[SPARK-21786] Test Acquiring 'compressionCodecClassName' for 
parquet in right order.") {
+// When "compression" is configured, it should be the first choice.
+withSQLConf(SQLConf.PARQUET_COMPRESSION.key -> "snappy") {
+  val props = Map("compression" -> "uncompressed", 
ParquetOutputFormat.COMPRESSION -> "gzip")
+  val option = new ParquetOptions(props, spark.sessionState.conf)
+  assert(option.compressionCodecClassName == "UNCOMPRESSED")
+}
+
+// When "compression" is not configured, "parquet.compression" should 
be the preferred choice.
+withSQLConf(SQLConf.PARQUET_COMPRESSION.key -> "snappy") {
+  val props = Map(ParquetOutputFormat.COMPRESSION -> "gzip")
+  val option = new ParquetOptions(props, spark.sessionState.conf)
+  assert(option.compressionCodecClassName == "GZIP")
+}
+
+// When both "compression" and "parquet.compression" are not 
configured,
+// spark.sql.parquet.compression.codec should be the right choice.
+withSQLConf(SQLConf.PARQUET_COMPRESSION.key -> "snappy") {
+  val props = Map.empty[String, String]
+  val option = new ParquetOptions(props, spark.sessionState.conf)
+  assert(option.compressionCodecClassName == "SNAPPY")
+}
+  }
+
+  private def getTableCompressionCodec(path: String): Seq[String] = {
+val hadoopConf = spark.sessionState.newHadoopConf()
+val codecs = for {
+  footer <- readAllFootersWithoutSummaryFiles(new Path(path), 
hadoopConf)
+  block <- footer.getParquetMetadata.getBlocks.asScala
+  column <- block.getColumns.asScala
+} yield column.getCodec.name()
+codecs.distinct
+  }
+
+  private def createTableWithCompression(
+  tableName: String,
+  isPartitioned: Boolean,
+  compressionCodec: String,
+  rootDir: File): Unit = {
+val options =
+  
s"""OPTIONS('path'='${rootDir.toURI.toString.stripSuffix("/")}/$tableName',
+ |'parquet.compression'='$compressionCodec')""".stripMargin
+val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else ""
+sql(s"""CREATE TABLE $tableName USING Parquet $options $partitionCreate
+|as select 1 as col1, 2 as p""".stripMargin)
--- End diff --

```
val options =
  s"""

|OPTIONS('path'='${rootDir.toURI.toString.stripSuffix("/")}/$tableName',
|'parquet.compression'='$compressionCodec')
   """.stripMargin
val partitionCreate = if (isPartitioned) "PARTITIONED BY (p)" else ""
sql(
  s"""
|CREATE TABLE $tableName USING Parquet $options $partitionCreate
|AS SELECT 1 AS col1, 2 AS p
  """.stripMargin)
```


--

[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2018-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r159889119
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -323,11 +323,13 @@ object SQLConf {
 .createWithDefault(false)
 
   val PARQUET_COMPRESSION = 
buildConf("spark.sql.parquet.compression.codec")
-.doc("Sets the compression codec use when writing Parquet files. 
Acceptable values include: " +
-  "uncompressed, snappy, gzip, lzo.")
+.doc("Sets the compression codec used when writing Parquet files. If 
other compression codec " +
+  "configuration was found through hive or parquet, the precedence 
would be `compression`, " +
--- End diff --

> Sets the compression codec used when writing Parquet files. If either 
`compression` or `parquet.compression` is specified in the table-specific 
options/properties, the precedence would be `compression`, ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19247: [Spark-21996][SQL] read files with space in name ...

2018-01-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19247#discussion_r159893158
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -233,7 +233,7 @@ class FileStreamSource(
 }
 
 val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map 
{ status =>
-  (status.getPath.toUri.toString, status.getModificationTime)
+  (status.getPath.toUri.getPath, status.getModificationTime)
--- End diff --

ping @xysun ^


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19893
  
**[Test build #85722 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85722/testReport)**
 for PR 19893 at commit 
[`0851ef2`](https://github.com/apache/spark/commit/0851ef2b572ed507155358e541a4fa53a8e92173).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19893
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85722/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19893
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20165#discussion_r159900236
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
""".stripMargin)
   }
 
+  test("SPARK-22972: hive orc source") {
+spark.sql(
+  s"""CREATE TABLE normal_orc_as_source_hive
+ |USING org.apache.spark.sql.hive.orc
+ |OPTIONS (
+ |  PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}'
+ |)
+   """.stripMargin)
+spark.sql("desc formatted normal_orc_as_source_hive").show()
--- End diff --

Replace it by
```
val tableMetadata = spark.sessionState.catalog.getTableMetadata(
  TableIdentifier("normal_orc_as_source_hive"))
assert(tableMetadata.storage.inputFormat ==
  Option("org.apache.hadoop.hive.ql.io.orc.OrcInputFormat"))
assert(tableMetadata.storage.outputFormat ==
  Option("org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"))
assert(tableMetadata.storage.serde ==
  Option("org.apache.hadoop.hive.ql.io.orc.OrcSerde"))
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20165#discussion_r159900297
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
""".stripMargin)
   }
 
+  test("SPARK-22972: hive orc source") {
+spark.sql(
--- End diff --

`WithTable("normal_orc_as_source_hive")`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20165: [SPARK-22972] Couldn't find corresponding Hive Se...

2018-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20165#discussion_r159900404
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -62,6 +63,22 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
""".stripMargin)
   }
 
+  test("SPARK-22972: hive orc source") {
+spark.sql(
+  s"""CREATE TABLE normal_orc_as_source_hive
--- End diff --

Put it next line. You can refer the other test cases of `CREATE TABLE `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20165
  
**[Test build #85723 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85723/testReport)**
 for PR 20165 at commit 
[`fa902d6`](https://github.com/apache/spark/commit/fa902d6d3fb635236ac01ee5b43470359f16cfdd).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20165
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20165
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85723/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-05 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20163
  
I ran some experiments:
```
py_date = udf(datetime.date, DateType())
py_timestamp = udf(datetime.datetime, TimestampType())
```
This works correctly
```
spark.range(1).select(py_date(lit(2017), lit(10), lit(30))).show()
spark.range(1).select(py_timestamp(lit(2017), lit(10), lit(30))).show()
```
Result:
```
+--+
|date(2017, 10, 30)|
+--+
|2017-10-30|
+--+

+--+
|datetime(2017, 10, 30)|
+--+
|   2017-10-30 00:00:00|
+--+
```

The change that the PR proposes seem to be coercing python 
`datetime.datetime` and `datetime.date` to the python datetime string 
representation rather the java one. We could call function `str` on the return 
value of the python udf if it's a String type to get the python string 
representation, but this probably needs some microbenchmark to see the 
performance implication.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20166: [SPARK-22973][SQL] Fix incorrect results of Casti...

2018-01-05 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/20166

[SPARK-22973][SQL] Fix incorrect results of Casting Map to String

## What changes were proposed in this pull request?
This pr fixed the issue when casting maps into strings;
```
scala> Seq(Map(1 -> "a", 2 -> "b")).toDF("a").write.saveAsTable("t")
scala> sql("SELECT cast(a as String) FROM t").show(false)
++
|a   |
++
|org.apache.spark.sql.catalyst.expressions.UnsafeMapData@38bdd75d|
++
```
This pr modified the result into;
```
++
|a   |
++
|[1 -> a, 2 -> b]|
++
```

## How was this patch tested?
Added tests in `CastSuite`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark SPARK-22973

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20166.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20166


commit efc28e3aa2a0add8e325d60162090c9ee3fdfcba
Author: Takeshi Yamamuro 
Date:   2018-01-05T14:33:14Z

Cast maps to strings




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20166
  
**[Test build #85726 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85726/testReport)**
 for PR 20166 at commit 
[`efc28e3`](https://github.com/apache/spark/commit/efc28e3aa2a0add8e325d60162090c9ee3fdfcba).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20166
  
**[Test build #85727 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85727/testReport)**
 for PR 20166 at commit 
[`be04e64`](https://github.com/apache/spark/commit/be04e64733d6051864d6597420d2c982c72606e6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20165
  
**[Test build #85725 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85725/testReport)**
 for PR 20165 at commit 
[`b64ce53`](https://github.com/apache/spark/commit/b64ce532d36442cde636db54d8ecbc08d6030825).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20165
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85725/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20165: [SPARK-22972] Couldn't find corresponding Hive SerDe for...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20165
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20167: Allow providing Mesos principal & secret via file...

2018-01-05 Thread rvesse
GitHub user rvesse opened a pull request:

https://github.com/apache/spark/pull/20167

Allow providing Mesos principal & secret via files (SPARK-16501)

## What changes were proposed in this pull request?

This commit modifies the Mesos submission client to allow the principal
and secret to be provided indirectly via files.  The path to these files
can be specified either via Spark configuration or via environment
variable.

Assuming these files are appropriately protected by FS/OS permissions
this means we don't ever leak the actual values in process info like ps

Environment variable specification is useful because it allows you to
interpolate the location of this file when using per-user Mesos
credentials.

For some background as to why we have taken this approach I will briefly 
describe our set up.  On our systems we provide each authorised user account 
with their own Mesos credentials to provide certain security and audit 
guarantees to our customers. These credentials are managed by a central Secret 
management service. In our `spark-env.sh` we determine the appropriate secret 
and principal files to use depending on the user who is invoking Spark hence 
the need to inject these via environment variables as well as by configuration 
properties. So we set these environment variables appropriately and our Spark 
read in the contents of those files to authenticate itself with Mesos.

## How was this patch tested?

This is functionality we have been using it in production across multiple 
customer sites for some time. This has been in the field for around 18 months 
with no reported issues. These changes have been sufficient to meet our 
customer security and audit requirements.

We have been building and deploying custom builds of Apache Spark with 
various minor tweaks like this which we are now looking to contribute back into 
the community in order that we can rely upon stock Apache Spark builds and stop 
maintaining our own internal fork. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rvesse/spark SPARK-16501

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20167.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20167


commit d1bd57d29a79bfcfde33e37a6935e761456f438c
Author: Rob Vesse 
Date:   2018-01-05T17:19:44Z

Allow providing Mesos principal & secret via files (SPARK-16501)

This commit modifies the Mesos submission client to allow the principal
and secret to be provided indirectly via files.  The path to these files
can be specified either via Spark configuration or via environment
variable.

Assuming these files are appropriately protected by FS/OS permissions
this means we don't ever leak the actual values in process info like ps

Environment variable specification is useful because it allows you to
interpolate the location of this file when using per-user Mesos
credentials.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20167: Allow providing Mesos principal & secret via files (SPAR...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20167
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20154: [SPARK-22960][k8s] Make build-push-docker-images.sh more...

2018-01-05 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/20154
  
ARGs can have default values, so we could do that if we decide to use the 
Docker Hub infra.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20155: [SPARK-22961][REGRESSION] Constant columns should genera...

2018-01-05 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/20155
  
A late LGTM!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSui...

2018-01-05 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20147#discussion_r159939270
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 ---
@@ -85,6 +93,34 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 new File(tmpDataDir, name).getCanonicalPath
   }
 
+  private def getFileFromUrl(urlString: String, targetDir: String, 
filename: String): Unit = {
+val conf = new SparkConf
+// if the caller passes the name of an existing file, we want 
doFetchFile to write over it with
+// the contents from the specified url.
+conf.set("spark.files.overwrite", "true")
+val securityManager = new SecurityManager(conf)
+val hadoopConf = new Configuration
+
+val outDir = new File(targetDir)
+if (!outDir.exists()) {
+  outDir.mkdirs()
+}
+
+// propagate exceptions up to the caller of getFileFromUrl
--- End diff --

We generally don't add these kind of comments since it's implied in every 
statement outside of a try...catch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSui...

2018-01-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20147


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...

2018-01-05 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/20013
  
> I wonder which changes double the disk usage? 

It's the new indices, more explicitly the values, not the keys. I tried 
changing the disk layout to write all the indices in a new namespace with a 
very short key length, and that didn't change the resulting store size at all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85728 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85728/testReport)**
 for PR 20096 at commit 
[`1e5c7a9`](https://github.com/apache/spark/commit/1e5c7a99c5a75539500c572770c03ee4c7e6d4a0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20013: [SPARK-20657][core] Speed up rendering of the sta...

2018-01-05 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20013#discussion_r159943385
  
--- Diff: core/src/main/scala/org/apache/spark/status/LiveEntity.scala ---
@@ -119,118 +121,115 @@ private class LiveTask(
 
   import LiveEntityHelpers._
 
-  private var recordedMetrics: v1.TaskMetrics = null
+  private var metrics: MetricsTracker = new MetricsTracker()
 
   var errorMessage: Option[String] = None
 
   /**
* Update the metrics for the task and return the difference between the 
previous and new
* values.
*/
-  def updateMetrics(metrics: TaskMetrics): v1.TaskMetrics = {
+  def updateMetrics(metrics: TaskMetrics): MetricsTracker = {
 if (metrics != null) {
-  val old = recordedMetrics
-  recordedMetrics = new v1.TaskMetrics(
-metrics.executorDeserializeTime,
-metrics.executorDeserializeCpuTime,
-metrics.executorRunTime,
-metrics.executorCpuTime,
-metrics.resultSize,
-metrics.jvmGCTime,
-metrics.resultSerializationTime,
-metrics.memoryBytesSpilled,
-metrics.diskBytesSpilled,
-metrics.peakExecutionMemory,
-new v1.InputMetrics(
-  metrics.inputMetrics.bytesRead,
-  metrics.inputMetrics.recordsRead),
-new v1.OutputMetrics(
-  metrics.outputMetrics.bytesWritten,
-  metrics.outputMetrics.recordsWritten),
-new v1.ShuffleReadMetrics(
-  metrics.shuffleReadMetrics.remoteBlocksFetched,
-  metrics.shuffleReadMetrics.localBlocksFetched,
-  metrics.shuffleReadMetrics.fetchWaitTime,
-  metrics.shuffleReadMetrics.remoteBytesRead,
-  metrics.shuffleReadMetrics.remoteBytesReadToDisk,
-  metrics.shuffleReadMetrics.localBytesRead,
-  metrics.shuffleReadMetrics.recordsRead),
-new v1.ShuffleWriteMetrics(
-  metrics.shuffleWriteMetrics.bytesWritten,
-  metrics.shuffleWriteMetrics.writeTime,
-  metrics.shuffleWriteMetrics.recordsWritten))
-  if (old != null) calculateMetricsDelta(recordedMetrics, old) else 
recordedMetrics
+  val old = this.metrics
+  val newMetrics = new MetricsTracker()
--- End diff --

This would be the only place where it's used, so I don't see any gains.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20166
  
**[Test build #85726 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85726/testReport)**
 for PR 20166 at commit 
[`efc28e3`](https://github.com/apache/spark/commit/efc28e3aa2a0add8e325d60162090c9ee3fdfcba).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20166
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20166
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85726/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20168: SPARK-22730 Add ImageSchema support for non-integer imag...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20168
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20168: SPARK-22730 Add ImageSchema support for non-integ...

2018-01-05 Thread tomasatdatabricks
GitHub user tomasatdatabricks opened a pull request:

https://github.com/apache/spark/pull/20168

SPARK-22730 Add ImageSchema support for non-integer image formats

## What changes were proposed in this pull request?
Added functionality to handle all OpenCV modes to ImageSchema:
  1. updated toImage and toNDArray functions to handle non-uint8 based 
images.
  2. add information about individual OpenCv modes

## How was this patch tested?
Added test for conversion between numpy arrays and images stored as all 
possible OpenCV modes. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tomasatdatabricks/spark 
tomas/ImageSchemaUpdate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20168.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20168


commit fe87dd112709ca2b101d78c97c4826536ec7da7d
Author: tomasatdatabricks 
Date:   2017-12-29T22:56:28Z

Added functionality for handling non-uint8-based images for ImageSchema

commit 70bae2f7e9d85a5f464f1bfc3a9426136259d5d1
Author: tomasatdatabricks 
Date:   2018-01-03T19:14:06Z

Added test for conversion between array and image struct for all ocv types.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20166
  
**[Test build #85727 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85727/testReport)**
 for PR 20166 at commit 
[`be04e64`](https://github.com/apache/spark/commit/be04e64733d6051864d6597420d2c982c72606e6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20166
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20166: [SPARK-22973][SQL] Fix incorrect results of Casting Map ...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20166
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85727/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159955132
  
--- Diff: core/src/test/scala/org/apache/spark/ThreadAudit.scala ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.internal.Logging
+
+/**
+ * Thread audit for test suites.
+ *
+ * Thread audit happens normally in [[SparkFunSuite]] automatically when a 
new test suite created.
--- End diff --

I'd just remove this paragraph since this class is independent of 
`SparkFunSuite`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159954589
  
--- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala ---
@@ -27,19 +27,55 @@ import org.apache.spark.util.AccumulatorContext
 
 /**
  * Base abstract class for all unit tests in Spark for handling common 
functionality.
+ *
+ * Thread audit happens normally here automatically when a new test suite 
created.
+ * The only prerequisite for that is that the test class must extend 
[[SparkFunSuite]].
+ *
+ * There are some test suites which are doing initialization before 
[[SparkFunSuite#beforeAll]]
+ * executed. This case auditing can be moved into another place in the 
call sequence.
+ *
+ * To do the audit in a custom place/way the following can be done:
+ *
+ * class MyTestSuite extends SparkFunSuite {
+ *
+ *   override val doThreadAuditInSparkFunSuite = false
--- End diff --

`enableAutoThreadAudit` now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159954825
  
--- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala ---
@@ -27,19 +27,55 @@ import org.apache.spark.util.AccumulatorContext
 
 /**
  * Base abstract class for all unit tests in Spark for handling common 
functionality.
+ *
+ * Thread audit happens normally here automatically when a new test suite 
created.
+ * The only prerequisite for that is that the test class must extend 
[[SparkFunSuite]].
+ *
+ * There are some test suites which are doing initialization before 
[[SparkFunSuite#beforeAll]]
--- End diff --

Better:

"
It is possible to override the default thread audit behavior by setting 
`enableAutoThreadAudit` to false and manually calling the audit methods, if 
desired. For example:

// Code
"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-05 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r159955747
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala ---
@@ -17,4 +17,22 @@
 
 package org.apache.spark.sql.test
 
-trait SharedSQLContext extends SQLTestUtils with SharedSparkSession
+trait SharedSQLContext extends SQLTestUtils with SharedSparkSession {
+
+  /**
+   * Auto thread audit is turned off here intentionally and done manually.
--- End diff --

I'm still a little not convinced that this is needed.

I still think that any reported leaks here are caused by bugs in the test 
suites and not because of this. The code you have here is basically the same 
thing as `SparkFunSuite`.

For example, if a suite extending this does not call `super.beforeAll()` 
but calls `super.afterAll()`, won't you get false positives in the output?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2018-01-05 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/19848
  
Done. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...

2018-01-05 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20132
  
Thanks!  Merging with master and branch-2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEn...

2018-01-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20132


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85728 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85728/testReport)**
 for PR 20096 at commit 
[`1e5c7a9`](https://github.com/apache/spark/commit/1e5c7a99c5a75539500c572770c03ee4c7e6d4a0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85728/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...

2018-01-05 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20097#discussion_r159958467
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -33,15 +40,21 @@ class MicroBatchExecution(
 name: String,
 checkpointRoot: String,
 analyzedPlan: LogicalPlan,
-sink: Sink,
+sink: BaseStreamingSink,
 trigger: Trigger,
 triggerClock: Clock,
 outputMode: OutputMode,
+extraOptions: Map[String, String],
 deleteCheckpointOnStop: Boolean)
   extends StreamExecution(
 sparkSession, name, checkpointRoot, analyzedPlan, sink,
 trigger, triggerClock, outputMode, deleteCheckpointOnStop) {
 
+  private def toJava(
--- End diff --

super nit: I usually prefer moving such small less-important methods at the 
end of the class 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...

2018-01-05 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20097#discussion_r159960120
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -236,14 +264,27 @@ class MicroBatchExecution(
 val hasNewData = {
   awaitProgressLock.lock()
   try {
-val latestOffsets: Map[Source, Option[Offset]] = uniqueSources.map 
{
+// Generate a map from each unique source to the next available 
offset.
+val latestOffsets: Map[BaseStreamingSource, Option[Offset]] = 
uniqueSources.map {
   case s: Source =>
 updateStatusMessage(s"Getting offsets from $s")
 reportTimeTaken("getOffset") {
   (s, s.getOffset)
 }
+  case s: MicroBatchReader =>
+updateStatusMessage(s"Getting offsets from $s")
+reportTimeTaken("getOffset") {
+  // Once v1 streaming source execution is gone, we can 
refactor this away.
+  // For now, we set the range here to get the source to infer 
the available end offset,
+  // get that offset, and then set the range again when we 
later execute.
+s.setOffsetRange(
--- End diff --

incorrect indentation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...

2018-01-05 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20097#discussion_r159980863
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -35,6 +35,16 @@ case class DataSourceV2Relation(
   }
 }
 
+/**
+ * A specialization of DataSourceV2Relation with the streaming bit set to 
true. Otherwwise identical
--- End diff --

nit: `Otherwwise`?? :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...

2018-01-05 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20097#discussion_r159981274
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamSourceV2.scala
 ---
@@ -28,17 +28,38 @@ import org.json4s.jackson.Serialization
 import org.apache.spark.sql.Row
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.execution.streaming.{RateStreamOffset, 
ValueRunTimeMsPair}
-import org.apache.spark.sql.sources.v2.DataSourceV2Options
+import org.apache.spark.sql.sources.DataSourceRegister
+import org.apache.spark.sql.sources.v2.{DataSourceV2, DataSourceV2Options}
 import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.sources.v2.streaming.MicroBatchReadSupport
 import org.apache.spark.sql.sources.v2.streaming.reader.{MicroBatchReader, 
Offset}
 import org.apache.spark.sql.types.{LongType, StructField, StructType, 
TimestampType}
-import org.apache.spark.util.SystemClock
+import org.apache.spark.util.{ManualClock, SystemClock}
 
-class RateStreamV2Reader(options: DataSourceV2Options)
+/**
+ * This is a temporary register as we build out v2 migration. Microbatch 
read support should
+ * be implemented in the same register as v1.
+ */
+class RateSourceProviderV2 extends DataSourceV2 with MicroBatchReadSupport 
with DataSourceRegister {
+  override def createMicroBatchReader(
+  schema: Optional[StructType],
+  checkpointLocation: String,
+  options: DataSourceV2Options): MicroBatchReader = {
+new MicroBatchRateStreamReader(options)
+  }
+
+  override def shortName(): String = "ratev2"
+}
+
+class MicroBatchRateStreamReader(options: DataSourceV2Options)
--- End diff --

As with the other kafka PR, can you rename these classes to start with 
"RateStream"? Only if it is not too much refactoring, otherwise we can clean 
this up later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...

2018-01-05 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20097#discussion_r159961186
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -357,31 +400,39 @@ class MicroBatchExecution(
 s"DataFrame returned by getBatch from $source did not have 
isStreaming=true\n" +
   s"${batch.queryExecution.logical}")
   logDebug(s"Retrieving data from $source: $current -> $available")
-  Some(source -> batch)
+  Some(source -> batch.logicalPlan)
+case (reader: MicroBatchReader, available)
+  if committedOffsets.get(reader).map(_ != 
available).getOrElse(true) =>
+  val current = committedOffsets.get(reader).map(off => 
reader.deserializeOffset(off.json))
+  reader.setOffsetRange(
+toJava(current),
+
Optional.of(available.asInstanceOf[v2.streaming.reader.Offset]))
--- End diff --

`v2.streaming.reader.Offset` is being used in a lot of places. Please 
rename it to OffsetV2 in the imports and use that in all places.
  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...

2018-01-05 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20097#discussion_r159978319
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -392,6 +443,21 @@ class MicroBatchExecution(
   cd.dataType, cd.timeZoneId)
 }
 
+val triggerLogicalPlan = sink match {
+  case _: Sink => newAttributePlan
+  case s: MicroBatchWriteSupport =>
+val writer = s.createMicroBatchWriter(
+  s"$runId",
+  currentBatchId,
+  newAttributePlan.schema,
+  outputMode,
+  new DataSourceV2Options(extraOptions.asJava))
+Option(writer.orElse(null)).map(WriteToDataSourceV2(_, 
newAttributePlan)).getOrElse {
--- End diff --

can you add a comment explaining why the fallback in a LocalRelation? and 
when can the writer be empty.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20142: [SPARK-22930][PYTHON][SQL] Improve the description of Ve...

2018-01-05 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20142
  
I added the test. @gatorsmile do you have to take a look or let me know who 
should I ping for review?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20097: [SPARK-22912] v2 data source support in MicroBatc...

2018-01-05 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20097#discussion_r159980158
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
 ---
@@ -33,15 +40,21 @@ class MicroBatchExecution(
 name: String,
 checkpointRoot: String,
 analyzedPlan: LogicalPlan,
-sink: Sink,
+sink: BaseStreamingSink,
 trigger: Trigger,
 triggerClock: Clock,
 outputMode: OutputMode,
+extraOptions: Map[String, String],
 deleteCheckpointOnStop: Boolean)
   extends StreamExecution(
 sparkSession, name, checkpointRoot, analyzedPlan, sink,
 trigger, triggerClock, outputMode, deleteCheckpointOnStop) {
 
+  private def toJava(
+  scalaOption: Option[v2.streaming.reader.Offset]): 
Optional[v2.streaming.reader.Offset] = {
--- End diff --

mentioned elsewhere as well, import new Offset as OffsetV2 instead of using 
full package path.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >