[GitHub] spark pull request #21092: [SPARK-23984][K8S] Initial Python Bindings for Py...

2018-06-06 Thread kokes
Github user kokes commented on a diff in the pull request:

https://github.com/apache/spark/pull/21092#discussion_r193639554
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
 ---
@@ -154,6 +176,24 @@ private[spark] object Config extends Logging {
   .checkValue(interval => interval > 0, s"Logging interval must be a 
positive time value.")
   .createWithDefaultString("1s")
 
+  val MEMORY_OVERHEAD_FACTOR =
+ConfigBuilder("spark.kubernetes.memoryOverheadFactor")
+  .doc("This sets the Memory Overhead Factor that will allocate memory 
to non-JVM jobs " +
+"which in the case of JVM tasks will default to 0.10 and 0.40 for 
non-JVM jobs")
+  .doubleConf
+  .checkValue(mem_overhead => mem_overhead >= 0 && mem_overhead < 1,
+"Ensure that memory overhead is a double between 0 --> 1.0")
+  .createWithDefault(0.1)
+
+  val PYSPARK_MAJOR_PYTHON_VERSION =
+ConfigBuilder("spark.kubernetes.pyspark.pythonversion")
+  .doc("This sets the python version. Either 2 or 3. (Python2 or 
Python3)")
+  .stringConf
+  .checkValue(pv => List("2", "3").contains(pv),
+"Ensure that Python Version is either Python2 or Python3")
+  .createWithDefault("2")
--- End diff --

Am I reading this right that the default is Python 2? Is there a reason for 
that? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20929: [SPARK-23772][SQL] Provide an option to ignore column of...

2018-06-06 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20929
  
yea, thanks for the comments! I'll try to fix based on the comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...

2018-06-06 Thread galv
Github user galv commented on a diff in the pull request:

https://github.com/apache/spark/pull/21494#discussion_r193290266
  
--- Diff: python/pyspark/worker.py ---
@@ -232,6 +236,13 @@ def main(infile, outfile):
 shuffle.DiskBytesSpilled = 0
 _accumulatorRegistry.clear()
 
+if (isBarrier):
+port = 25333 + 2 + 2 * taskContext._partitionId
+paras = GatewayParameters(port=port)
--- End diff --

paras -> params


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...

2018-06-06 Thread galv
Github user galv commented on a diff in the pull request:

https://github.com/apache/spark/pull/21494#discussion_r193269255
  
--- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala ---
@@ -627,6 +627,52 @@ class SparkContextSuite extends SparkFunSuite with 
LocalSparkContext with Eventu
 assert(exc.getCause() != null)
 stream.close()
   }
+
+  test("support barrier sync under local mode") {
+val conf = new SparkConf().setAppName("test").setMaster("local[2]")
+sc = new SparkContext(conf)
+val rdd = sc.makeRDD(Seq(1, 2, 3, 4), 2).barrier()
+val rdd2 = rdd.mapPartitions { it =>
+  val tc = 
TaskContext.get.asInstanceOf[org.apache.spark.barrier.BarrierTaskContext]
+  // If we don't get the expected taskInfos, the job shall abort due 
to stage failure.
+  if (tc.hosts().length != 2) {
+throw new SparkException("Expected taksInfos length is 2, actual 
length is " +
--- End diff --

`taksInfos` -> `taskInfos`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...

2018-06-06 Thread galv
Github user galv commented on a diff in the pull request:

https://github.com/apache/spark/pull/21494#discussion_r193289530
  
--- Diff: python/pyspark/worker.py ---
@@ -232,6 +236,13 @@ def main(infile, outfile):
 shuffle.DiskBytesSpilled = 0
 _accumulatorRegistry.clear()
 
+if (isBarrier):
--- End diff --

Style: `if (isBarrier):` -> `if isBarrier:`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...

2018-06-06 Thread galv
Github user galv commented on a diff in the pull request:

https://github.com/apache/spark/pull/21494#discussion_r193291076
  
--- Diff: python/pyspark/worker.py ---
@@ -232,6 +236,13 @@ def main(infile, outfile):
 shuffle.DiskBytesSpilled = 0
 _accumulatorRegistry.clear()
 
+if (isBarrier):
+port = 25333 + 2 + 2 * taskContext._partitionId
--- End diff --

I recommend using DEFAULT_PORT and DEFAULT_PYTHON_PORT. They are exposed as 
part of the public API of py4j:  
https://github.com/bartdag/py4j/blob/216432d859de41441f0d1a0d55b31b5d8d09dd28/py4j-python/src/py4j/java_gateway.py#L54

By the way, acquiring ports like this is a little hacky and may require 
more thought.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...

2018-06-06 Thread galv
Github user galv commented on a diff in the pull request:

https://github.com/apache/spark/pull/21494#discussion_r193555968
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -123,6 +124,21 @@ private[spark] class TaskSetManager(
   // TODO: We should kill any running task attempts when the task set 
manager becomes a zombie.
   private[scheduler] var isZombie = false
 
+  private[scheduler] lazy val barrierCoordinator = {
--- End diff --

I recommend adding a return type here for readability.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...

2018-06-06 Thread galv
Github user galv commented on a diff in the pull request:

https://github.com/apache/spark/pull/21494#discussion_r193269297
  
--- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala ---
@@ -627,6 +627,52 @@ class SparkContextSuite extends SparkFunSuite with 
LocalSparkContext with Eventu
 assert(exc.getCause() != null)
 stream.close()
   }
+
+  test("support barrier sync under local mode") {
+val conf = new SparkConf().setAppName("test").setMaster("local[2]")
+sc = new SparkContext(conf)
+val rdd = sc.makeRDD(Seq(1, 2, 3, 4), 2).barrier()
+val rdd2 = rdd.mapPartitions { it =>
+  val tc = 
TaskContext.get.asInstanceOf[org.apache.spark.barrier.BarrierTaskContext]
+  // If we don't get the expected taskInfos, the job shall abort due 
to stage failure.
+  if (tc.hosts().length != 2) {
+throw new SparkException("Expected taksInfos length is 2, actual 
length is " +
+  s"${tc.hosts().length}.")
+  }
+  // println(tc.getTaskInfos().toList)
--- End diff --

Remove comment


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-06 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/spark/pull/21500
  
Retaining versions of state is also relevant to do snapshotting the last 
version in files: HDFSBackedStateStoreProvider doesn't snapshot if the version 
doesn't exist in loadedMaps. So we may want to check whether this option also 
works with current approach of snapshotting.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193635361
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html 
Locale]] of the input.
+   * Ignored when [[caseSensitive]] is false.
+   * Default: Locale.English
+   * @see `StopWordsRemover.loadDefaultStopWords()`
+   * @group param
+   */
+  @Since("2.4.0")
+  val locale: Param[Locale] = new Param[Locale](this, "locale",
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...

2018-06-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21477#discussion_r193634436
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, 
continuous=None):
 self._jwrite = self._jwrite.trigger(jTrigger)
 return self
 
+def foreach(self, f):
+"""
+Sets the output of the streaming query to be processed using the 
provided writer ``f``.
+This is often used to write the output of a streaming query to 
arbitrary storage systems.
+The processing logic can be specified in two ways.
+
+#. A **function** that takes a row as input.
+This is a simple way to express your processing logic. Note 
that this does
+not allow you to deduplicate generated data when failures 
cause reprocessing of
+some input data. That would require you to specify the 
processing logic in the next
+way.
+
+#. An **object** with a ``process`` method and optional ``open`` 
and ``close`` methods.
+The object can have the following methods.
+
+* ``open(partition_id, epoch_id)``: *Optional* method that 
initializes the processing
+(for example, open a connection, start a transaction, 
etc). Additionally, you can
+use the `partition_id` and `epoch_id` to deduplicate 
regenerated data
+(discussed later).
+
+* ``process(row)``: *Non-optional* method that processes each 
:class:`Row`.
+
+* ``close(error)``: *Optional* method that finalizes and 
cleans up (for example,
+close connection, commit transaction, etc.) after all rows 
have been processed.
+
+The object will be used by Spark in the following way.
+
+* A single copy of this object is responsible of all the data 
generated by a
+single task in a query. In other words, one instance is 
responsible for
+processing one partition of the data generated in a 
distributed manner.
+
+* This object must be serializable because each task will get 
a fresh
+serialized-deserializedcopy of the provided object. Hence, 
it is strongly
+recommended that any initialization for writing data (e.g. 
opening a
+connection or starting a transaction) be done open after 
the `open(...)`
+method has been called, which signifies that the task is 
ready to generate data.
+
+* The lifecycle of the methods are as follows.
+
+For each partition with ``partition_id``:
+
+... For each batch/epoch of streaming data with 
``epoch_id``:
+
+... Method ``open(partitionId, epochId)`` is called.
+
+... If ``open(...)`` returns true, for each row in the 
partition and
+batch/epoch, method ``process(row)`` is called.
+
+... Method ``close(errorOrNull)`` is called with error 
(if any) seen while
+processing rows.
+
+Important points to note:
+
+* The `partitionId` and `epochId` can be used to deduplicate 
generated data when
+failures cause reprocessing of some input data. This 
depends on the execution
+mode of the query. If the streaming query is being 
executed in the micro-batch
+mode, then every partition represented by a unique tuple 
(partition_id, epoch_id)
+is guaranteed to have the same data. Hence, (partition_id, 
epoch_id) can be used
+to deduplicate and/or transactionally commit data and 
achieve exactly-once
+guarantees. However, if the streaming query is being 
executed in the continuous
+mode, then this guarantee does not hold and therefore 
should not be used for
+deduplication.
+
+* The ``close()`` method (if exists) is will be called if 
`open()` method exists and
+returns successfully (irrespective of the return value), 
except if the Python
+crashes in the middle.
+
+.. note:: Evolving.
+
+>>> # Print every row using a function
+>>> writer = sdf.writeStream.foreach(lambda x: print(x))
+>>> # Print every row using a object with process() method
+>>> class RowPrinter:
+... def open(self, partition_id, epoch_id):
+... print("Opened %d, %d" % (partition_id, epoch_id))
+... return True
+... d

[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...

2018-06-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21477#discussion_r193633540
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, 
continuous=None):
 self._jwrite = self._jwrite.trigger(jTrigger)
 return self
 
+def foreach(self, f):
+"""
+Sets the output of the streaming query to be processed using the 
provided writer ``f``.
+This is often used to write the output of a streaming query to 
arbitrary storage systems.
+The processing logic can be specified in two ways.
+
+#. A **function** that takes a row as input.
--- End diff --

I mean, we could maybe consider the other ways but wouldn't it better to 
have the consistent support as the primary, and then see if the other ways are 
really requested by users? I think we could still incrementally add 
attribute-checking way or the lambda (or function to be more correct) way later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...

2018-06-06 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/21477#discussion_r193632947
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, 
continuous=None):
 self._jwrite = self._jwrite.trigger(jTrigger)
 return self
 
+def foreach(self, f):
+"""
+Sets the output of the streaming query to be processed using the 
provided writer ``f``.
+This is often used to write the output of a streaming query to 
arbitrary storage systems.
+The processing logic can be specified in two ways.
+
+#. A **function** that takes a row as input.
+This is a simple way to express your processing logic. Note 
that this does
+not allow you to deduplicate generated data when failures 
cause reprocessing of
+some input data. That would require you to specify the 
processing logic in the next
+way.
+
+#. An **object** with a ``process`` method and optional ``open`` 
and ``close`` methods.
+The object can have the following methods.
+
+* ``open(partition_id, epoch_id)``: *Optional* method that 
initializes the processing
+(for example, open a connection, start a transaction, 
etc). Additionally, you can
+use the `partition_id` and `epoch_id` to deduplicate 
regenerated data
+(discussed later).
+
+* ``process(row)``: *Non-optional* method that processes each 
:class:`Row`.
+
+* ``close(error)``: *Optional* method that finalizes and 
cleans up (for example,
+close connection, commit transaction, etc.) after all rows 
have been processed.
+
+The object will be used by Spark in the following way.
+
+* A single copy of this object is responsible of all the data 
generated by a
+single task in a query. In other words, one instance is 
responsible for
+processing one partition of the data generated in a 
distributed manner.
+
+* This object must be serializable because each task will get 
a fresh
+serialized-deserializedcopy of the provided object. Hence, 
it is strongly
+recommended that any initialization for writing data (e.g. 
opening a
+connection or starting a transaction) be done open after 
the `open(...)`
+method has been called, which signifies that the task is 
ready to generate data.
+
+* The lifecycle of the methods are as follows.
+
+For each partition with ``partition_id``:
+
+... For each batch/epoch of streaming data with 
``epoch_id``:
+
+... Method ``open(partitionId, epochId)`` is called.
+
+... If ``open(...)`` returns true, for each row in the 
partition and
+batch/epoch, method ``process(row)`` is called.
+
+... Method ``close(errorOrNull)`` is called with error 
(if any) seen while
+processing rows.
+
+Important points to note:
+
+* The `partitionId` and `epochId` can be used to deduplicate 
generated data when
+failures cause reprocessing of some input data. This 
depends on the execution
+mode of the query. If the streaming query is being 
executed in the micro-batch
+mode, then every partition represented by a unique tuple 
(partition_id, epoch_id)
+is guaranteed to have the same data. Hence, (partition_id, 
epoch_id) can be used
+to deduplicate and/or transactionally commit data and 
achieve exactly-once
+guarantees. However, if the streaming query is being 
executed in the continuous
+mode, then this guarantee does not hold and therefore 
should not be used for
+deduplication.
+
+* The ``close()`` method (if exists) is will be called if 
`open()` method exists and
+returns successfully (irrespective of the return value), 
except if the Python
+crashes in the middle.
+
+.. note:: Evolving.
+
+>>> # Print every row using a function
+>>> writer = sdf.writeStream.foreach(lambda x: print(x))
+>>> # Print every row using a object with process() method
+>>> class RowPrinter:
+... def open(self, partition_id, epoch_id):
+... print("Opened %d, %d" % (partition_id, epoch_id))
+... return True
+... def proc

[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...

2018-06-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21477#discussion_r193632883
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, 
continuous=None):
 self._jwrite = self._jwrite.trigger(jTrigger)
 return self
 
+def foreach(self, f):
+"""
+Sets the output of the streaming query to be processed using the 
provided writer ``f``.
+This is often used to write the output of a streaming query to 
arbitrary storage systems.
+The processing logic can be specified in two ways.
+
+#. A **function** that takes a row as input.
--- End diff --

(including the response to 
https://github.com/apache/spark/pull/21477#discussion_r193631209) I kind of 
agree that it's a-okay idea but I think we usually provide a consistent API 
support so far unless it's language specific, for example, ContextManager, 
decorator in Python and etc.

Just for clarification, does Scala side support function only support too?

Also, I know attribute-checking way is kind of more like "Pythonic" way but 
I am seeing the documentation is already diverted between Scala vs Python. It 
costs maintaining overhead on the other hand.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21483
  
**[Test build #91516 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91516/testReport)**
 for PR 21483 at commit 
[`49323a6`](https://github.com/apache/spark/commit/49323a6d3207e7d4c6b04ab3c3b1729d9eeaca16).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91516/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...

2018-06-06 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/21477#discussion_r193631566
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, 
continuous=None):
 self._jwrite = self._jwrite.trigger(jTrigger)
 return self
 
+def foreach(self, f):
+"""
+Sets the output of the streaming query to be processed using the 
provided writer ``f``.
+This is often used to write the output of a streaming query to 
arbitrary storage systems.
+The processing logic can be specified in two ways.
+
+#. A **function** that takes a row as input.
--- End diff --

This is superset of what we support in scala. Python users are more likely 
to use simple lambdas instead of defining classes. But they may also want to 
write transactional stuff in python with open and close methods. Hence 
providing both alternatives seems to be a good idea.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21482
  
**[Test build #91518 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91518/testReport)**
 for PR 21482 at commit 
[`6a4d46e`](https://github.com/apache/spark/commit/6a4d46e0a9ab403364e26a7b8f16c9ca94c31a2e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...

2018-06-06 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/21477#discussion_r193631209
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, 
continuous=None):
 self._jwrite = self._jwrite.trigger(jTrigger)
 return self
 
+def foreach(self, f):
+"""
+Sets the output of the streaming query to be processed using the 
provided writer ``f``.
+This is often used to write the output of a streaming query to 
arbitrary storage systems.
+The processing logic can be specified in two ways.
+
+#. A **function** that takes a row as input.
+This is a simple way to express your processing logic. Note 
that this does
+not allow you to deduplicate generated data when failures 
cause reprocessing of
+some input data. That would require you to specify the 
processing logic in the next
+way.
+
+#. An **object** with a ``process`` method and optional ``open`` 
and ``close`` methods.
--- End diff --

I discussed this with @marmbrus . If there is a ForeachWriter class in 
python, then uses will have to additionally import it. That's just another 
overhead that can be avoided by just allowing any class with the appropriate 
methods. One less step for python users.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21483
  
**[Test build #91515 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91515/testReport)**
 for PR 21483 at commit 
[`5e293d5`](https://github.com/apache/spark/commit/5e293d5a543bef35e422e7f8d816853a6f37314b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91515/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3829/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21482: [SPARK-24393][SQL] SQL builtin: isinf

2018-06-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21482#discussion_r193630041
  
--- Diff: R/pkg/NAMESPACE ---
@@ -281,6 +281,8 @@ exportMethods("%<=>%",
   "initcap",
   "input_file_name",
   "instr",
+  "isInf",
+  "isinf",
--- End diff --

I really don't understand why we have both. It usually has the ones 
matching to Scala side or R specific function. Otherwise, I don't think we 
should have both.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21483
  
**[Test build #91516 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91516/testReport)**
 for PR 21483 at commit 
[`49323a6`](https://github.com/apache/spark/commit/49323a6d3207e7d4c6b04ab3c3b1729d9eeaca16).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21482
  
**[Test build #91517 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91517/testReport)**
 for PR 21482 at commit 
[`f240fdf`](https://github.com/apache/spark/commit/f240fdf3a410e2fdec1fa668bc0218ac61078423).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91514/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21483
  
**[Test build #91514 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91514/testReport)**
 for PR 21483 at commit 
[`55eef7c`](https://github.com/apache/spark/commit/55eef7ca382a7eab7c80404b2b3d8c6ba95d806c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3828/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21483
  
**[Test build #91515 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91515/testReport)**
 for PR 21483 at commit 
[`5e293d5`](https://github.com/apache/spark/commit/5e293d5a543bef35e422e7f8d816853a6f37314b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21483
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3827/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21483
  
**[Test build #91514 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91514/testReport)**
 for PR 21483 at commit 
[`55eef7c`](https://github.com/apache/spark/commit/55eef7ca382a7eab7c80404b2b3d8c6ba95d806c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21504: SPARK-24479: Added config for registering streamingQuery...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21504
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91510/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21504: SPARK-24479: Added config for registering streamingQuery...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21504
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21504: SPARK-24479: Added config for registering streamingQuery...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21504
  
**[Test build #91510 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91510/testReport)**
 for PR 21504 at commit 
[`d3a3baa`](https://github.com/apache/spark/commit/d3a3baa8bba65da6a30d1d449f97cbeb467ca14b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `.doc(\"List of class names implementing StreamingQueryListener 
that will be automatically \" +`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-06 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/spark/pull/21500
  
@TomaszGaweda @aalobaidi 
Please correct me if I'm missing here.

From every start of batch, state store loads previous version of state so 
that it can be read and written. If we unload all the version "after 
committing" the cache will no longer contain previous version of state and it 
will try to load the state via reading files, adding huge latency on starting 
batch. That's why I stated about three cases before to avoid loading state from 
files when starting a new batch.

Please apply #21469 manually and see how much HDFSBackedStateStoreProvider 
consumes memory due to storing multiple versions (it will show the state size 
on the latest version as well as overall state size in cache). Please also 
observe and provide numbers of latency to show how much it is and how much it 
will be after the patch. We always have to ask ourselves that we are addressing 
the issue correctly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21469: [SPARK-24441][SS] Expose total estimated size of ...

2018-06-06 Thread HeartSaVioR
Github user HeartSaVioR commented on a diff in the pull request:

https://github.com/apache/spark/pull/21469#discussion_r193622940
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenerSuite.scala
 ---
@@ -231,7 +231,7 @@ class StreamingQueryListenerSuite extends StreamTest 
with BeforeAndAfter {
   test("event ordering") {
 val listener = new EventCollector
 withListenerAdded(listener) {
-  for (i <- 1 to 100) {
+  for (i <- 1 to 50) {
--- End diff --

After the patch this test starts failing: it just means there's more time 
needed to run this loop 100 times, and doesn't mean the logic is broken. 
Decreasing number works for me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18900
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91513/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18900
  
**[Test build #91513 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91513/testReport)**
 for PR 18900 at commit 
[`e3a0cc4`](https://github.com/apache/spark/commit/e3a0cc43b10828b8111f7cd9523391cd3a2fdb6f).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18900
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18900
  
**[Test build #91513 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91513/testReport)**
 for PR 18900 at commit 
[`e3a0cc4`](https://github.com/apache/spark/commit/e3a0cc43b10828b8111f7cd9523391cd3a2fdb6f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18900
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18900
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91512/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18900
  
**[Test build #91512 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91512/testReport)**
 for PR 18900 at commit 
[`a00e943`](https://github.com/apache/spark/commit/a00e943a7097f386c842fd725cb1474e3a7f74c8).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18900
  
**[Test build #91512 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91512/testReport)**
 for PR 18900 at commit 
[`a00e943`](https://github.com/apache/spark/commit/a00e943a7097f386c842fd725cb1474e3a7f74c8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21499: [SPARK-24468][SQL] Handle negative scale when adj...

2018-06-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21499#discussion_r193618762
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -161,13 +161,17 @@ object DecimalType extends AbstractDataType {
* This method is used only when 
`spark.sql.decimalOperations.allowPrecisionLoss` is set to true.
*/
   private[sql] def adjustPrecisionScale(precision: Int, scale: Int): 
DecimalType = {
-// Assumptions:
+// Assumption:
 assert(precision >= scale)
-assert(scale >= 0)
 
 if (precision <= MAX_PRECISION) {
   // Adjustment only needed when we exceed max precision
   DecimalType(precision, scale)
+} else if (scale < 0) {
+  // Decimal can have negative scale (SPARK-24468). In this case, we 
cannot allow a precision
+  // loss since we would cause a loss of digits in the integer part.
--- End diff --

ok makes sense, do we have an end-to-end test case for returning null?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21469
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91509/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21469
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21469
  
**[Test build #91509 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91509/testReport)**
 for PR 21469 at commit 
[`7ec3242`](https://github.com/apache/spark/commit/7ec32427cf0fda82d2f936fa0aab62e274a8c034).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21498: [SPARK-24410][SQL][Core] Optimization for Union o...

2018-06-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21498#discussion_r193618338
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1099,6 +1099,17 @@ object SQLConf {
   .intConf
   
.createWithDefault(SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD.defaultValue.get)
 
+  val UNION_IN_SAME_PARTITION =
+buildConf("spark.sql.unionInSamePartition")
+  .internal()
+  .doc("When true, Union operator will union children results in the 
same corresponding " +
+"partitions if they have same partitioning. This eliminates 
unnecessary shuffle in later " +
+"operators like aggregation. Note that because non-deterministic 
functions such as " +
+"monotonically_increasing_id are depended on partition id. By 
doing this, the values of " +
--- End diff --

I'm a bit not convinced by the reason and behavior of keeping the value of 
non-deterministic functions after an union. Like in the following queries:

```scala
val df1 = spark.range(10).select(monotonically_increasing_id())
val df2 = spark.range(10).select(monotonically_increasing_id())
val union = df1.union(df2)
```

Now we keep the values of `monotonically_increasing_id` returned by `df1`, 
`df2` and `union` are the same. However, as non-deterministic functions, the 
values changing by data layout/sequence sounds still reasonable.

Anyway that is current behavior and this config need users to enable this 
feature explicitly.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-06 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21498
  
I set up a Spark cluster with 5 nodes on EC2.

```scala
def benchmark(func: () => Unit): Unit = {
val t0 = System.nanoTime()
func()
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
}

val N = 1L

spark.range(N).selectExpr("id as key", "id % 2 as t1", "id % 3 as 
t2").repartition(col("key")).write.mode("overwrite").bucketBy(3, 
"key").sortBy("t1").saveAsTable("a1")
spark.range(N).selectExpr("id as key", "id % 2 as t1", "id % 3 as 
t2").repartition(col("key")).write.mode("overwrite").bucketBy(3, 
"key").sortBy("t1").saveAsTable("a2")

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val df = sql("select key,count(*) from (select * from a1 union all select * 
from a2)z group by key")
val df2 = sql("select * from a1 union all select * from a2")

val df3 = df.sample(0.8).filter($"key" > 100).sample(0.4)
val df4 = df2.sample(0.8).filter($"key" > 100).sample(0.4)

benchmark(() => df.collect)
benchmark(() => df2.collect)
benchmark(() => df3.collect)
benchmark(() => df4.collect)
```

Before:
```
scala> benchmark(() => df.collect)
Elapsed time: 371007018ns
scala> benchmark(() => df2.collect)
Elapsed time: 93056619ns
scala> benchmark(() => df3.collect)
Elapsed time: 477603242ns
scala> benchmark(() => df4.collect)
Elapsed time: 150106354ns
```

After:
```
scala> benchmark(() => df.collect)
Elapsed time: 101199791ns
scala> benchmark(() => df2.collect)
Elapsed time: 80275158ns
scala> benchmark(() => df3.collect)
Elapsed time: 292775244ns
scala> benchmark(() => df4.collect)
Elapsed time: 151129518ns
```

It improves the queries of `df` and `df3` by eliminating shuffle. For `df2` 
and `df4` which don't involve shuffle, there is no performance regression.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18900
  
**[Test build #91511 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91511/testReport)**
 for PR 18900 at commit 
[`478e205`](https://github.com/apache/spark/commit/478e2051c775a594ad729256c3ef78cc311c992d).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18900
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91511/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18900
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18900
  
**[Test build #91511 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91511/testReport)**
 for PR 18900 at commit 
[`478e205`](https://github.com/apache/spark/commit/478e2051c775a594ad729256c3ef78cc311c992d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18900
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18900
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18900
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18900: [SPARK-21687][SQL] Spark SQL should set createTim...

2018-06-06 Thread debugger87
GitHub user debugger87 reopened a pull request:

https://github.com/apache/spark/pull/18900

[SPARK-21687][SQL] Spark SQL should set createTime for Hive partition

## What changes were proposed in this pull request?

Set createTime for every hive partition created in Spark SQL, which could 
be used to manage data lifecycle in Hive warehouse. We found  that almost every 
partition created by spark sql has not been set createTime.

```
mysql> select * from partitions where create_time=0 limit 1\G;
*** 1. row ***
 PART_ID: 1028584
 CREATE_TIME: 0
LAST_ACCESS_TIME: 1502203611
   PART_NAME: date=20170130
   SD_ID: 1543605
  TBL_ID: 211605
  LINK_TARGET_ID: NULL
1 row in set (0.27 sec)
```

## How was this patch tested?
 N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/debugger87/spark 
fix/set-create-time-for-hive-partition

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18900.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18900


commit 71a660ac8dad869d9ba3b4e206b74f5c44660ee6
Author: debugger87 
Date:   2017-08-10T04:17:00Z

[SPARK-21687][SQL] Spark SQL should set createTime for Hive partition

commit f668ce8837ee553c61687bd03d04cddd32e5f36f
Author: debugger87 
Date:   2017-08-11T07:50:26Z

added createTime and lastAccessTime into CatalogTablePartition

commit 2fb1ddabdb2ab8f7b585ee7aea93280f96a23467
Author: debugger87 
Date:   2017-08-11T07:54:26Z

minor tweak

commit c833ce7aa5f2ba0b684494fd1b24b7995f1c09c9
Author: debugger87 
Date:   2017-08-11T08:07:57Z

fix type missmatch

commit bf2a1052f807a7ae36004c819e66fff5c4b45820
Author: debugger87 
Date:   2017-08-11T23:26:29Z

added createTime and lastAccessTime into partition map for display




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...

2018-06-06 Thread debugger87
Github user debugger87 commented on the issue:

https://github.com/apache/spark/pull/18900
  
@cxzl25 OK, reopen it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193604620
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,27 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive is false",
+  typeConverter=TypeConverters.toString)
--- End diff --

nit: indentation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21504: SPARK-24479: Added config for registering streami...

2018-06-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21504#discussion_r193604356
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala
 ---
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import scala.language.reflectiveCalls
+
+import org.scalatest.BeforeAndAfter
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.execution.streaming._
+import 
org.apache.spark.sql.streaming.StreamingQueryListener.{QueryStartedEvent, 
QueryTerminatedEvent, _}
--- End diff --

nit: import looks a bit odd `{QueryStartedEvent, QueryTerminatedEvent, _}`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21504: SPARK-24479: Added config for registering streami...

2018-06-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21504#discussion_r193603810
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala
 ---
@@ -55,6 +56,12 @@ class StreamingQueryManager private[sql] (sparkSession: 
SparkSession) extends Lo
   @GuardedBy("awaitTerminationLock")
   private var lastTerminatedQuery: StreamingQuery = null
 
+  sparkSession.sparkContext.conf.get(STREAMING_QUERY_LISTENERS)
+  .foreach { classNames =>
--- End diff --

nit:

```scala
sparkSession.sparkContext.conf.get(STREAMING_QUERY_LISTENERS).foreach { 
classNames =>
  ...
}
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21504: SPARK-24479: Added config for registering streamingQuery...

2018-06-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21504
  
Mind fixing the PR title to `[SPARK-24479][SS] Added config for registering 
streamingQueryListeners`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf

2018-06-06 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21482
  
@henryr 1.0/0.0 also returns null in Spark SQL ...

```
scala> sql("select cast(1.0 as double)/cast(0 as double)").show()
+-+
|(CAST(1.0 AS DOUBLE) / CAST(0 AS DOUBLE))|
+-+
| null|
+-+
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21504: SPARK-24480: Added config for registering streamingQuery...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21504
  
**[Test build #91510 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91510/testReport)**
 for PR 21504 at commit 
[`d3a3baa`](https://github.com/apache/spark/commit/d3a3baa8bba65da6a30d1d449f97cbeb467ca14b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21504: SPARK-24480: Added config for registering streami...

2018-06-06 Thread arunmahadevan
GitHub user arunmahadevan opened a pull request:

https://github.com/apache/spark/pull/21504

SPARK-24480: Added config for registering streamingQueryListeners

## What changes were proposed in this pull request?

Currently a "StreamingQueryListener" can only be registered 
programatically. We could have a new config "spark.sql.streamingQueryListeners" 
similar to  "spark.sql.queryExecutionListeners" and "spark.extraListeners" for 
users to register custom streaming listeners.

## How was this patch tested?

New unit test and running example programs.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/arunmahadevan/spark SPARK-24480

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21504.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21504


commit d3a3baa8bba65da6a30d1d449f97cbeb467ca14b
Author: Arun Mahadevan 
Date:   2018-06-07T00:57:22Z

SPARK-24480: Added config for registering streamingQueryListeners




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21504: SPARK-24480: Added config for registering streamingQuery...

2018-06-06 Thread arunmahadevan
Github user arunmahadevan commented on the issue:

https://github.com/apache/spark/pull/21504
  
ping @tdas @jose-torres @HeartSaVioR 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21504: SPARK-24480: Added config for registering streamingQuery...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21504
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21503
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91507/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21503
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21503
  
**[Test build #91507 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91507/testReport)**
 for PR 21503 at commit 
[`9d3a11e`](https://github.com/apache/spark/commit/9d3a11e68bca6c5a56a2be47fb09395350362ac5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait SupportsPhysicalStats extends LeafNode `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21469
  
**[Test build #91509 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91509/testReport)**
 for PR 21469 at commit 
[`7ec3242`](https://github.com/apache/spark/commit/7ec32427cf0fda82d2f936fa0aab62e274a8c034).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf

2018-06-06 Thread henryr
Github user henryr commented on the issue:

https://github.com/apache/spark/pull/21482
  
@rxin Other engines are all over the place:

* MySQL doesn't have support for infinity (based on my cursory look) - 1.0 
/ 0.0 is written as `null`. Also seems to be true of SQLite.
* Postgres has type-specific literals (e.g. `FLOAT8 '+Infinity'`) which you 
can use for comparison checks. 
* Impala has `is_inf()`
* Oracle has a literal value, and also a built-in predicate `SELECT f FROM 
foo WHERE f is infinite`
* SQL Server does not appear to support infinity (but that's based on 
anecdotal evidence)

One of my motivations suggesting this builtin was to make sharing workloads 
with Impala easier. I think it's convenient to have as well, since it's more 
intuitive than figuring out what the right literal incantation should be.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-06-06 Thread tedyu
Github user tedyu commented on the issue:

https://github.com/apache/spark/pull/21488
  
There is only 
target/surefire-reports/TEST-org.apache.spark.sql.kafka010.KafkaMicroBatchV2SourceSuite.xml
 under target/surefire-reports

That file doesn't contain test output.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20929: [SPARK-23772][SQL] Provide an option to ignore column of...

2018-06-06 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/20929
  
@maropu Thanks for updating this PR! It would be easier to maintain the 
logic in one place. I think it should be feasible to do everything inside 
`canonicalizeType` without modifying `JsonParser` or other methods in 
`JsonInferSchema`. The following code outlines my logic, though I didn't test 
it ...:

~~~scala
 /**
   * Canonicalize data types and remove StructTypes with no fields.
   * @return Some(canonicalizedType) or None if nothing left.
   */
  private def canonicalizeType(tpe: DataType, options: JSONOptions): 
Option[DataType] = tpe match {
case at @ ArrayType(elementType, _) =>
  canonicalizeType(elementType, options).map(t => at.copy(elementType = 
t))

case StructType(fields) =>
  val canonicalizedFields = fields.flatMap { f =>
canonicalizeType(f.dataType, options).map(t => f.copy(dataType = t))
  }
  // per SPARK-8093: empty structs should be deleted
  if (canonicalizedFields.isEmpty) {
None
  } else {
StructType(canonicalizedFields)
  }

case NullType => 
  if (options.dropFieldIfAllNull) {
None
  else {
Some(StringType)
  }
}

case other => Some(other)
  }  
~~~

In the test, we should also include scenarios with nested "null" fields 
like `[[], null, [[]]]`.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21488: SPARK-18057 Update structured streaming kafka fro...

2018-06-06 Thread guozhangwang
Github user guozhangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21488#discussion_r193574097
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala
 ---
@@ -203,7 +215,13 @@ class KafkaTestUtils(withBrokerProps: Map[String, 
Object] = Map.empty) extends L
 
   /** Add new partitions to a Kafka topic */
   def addPartitions(topic: String, partitions: Int): Unit = {
-AdminUtils.addPartitions(zkUtils, topic, partitions)
+val existingAssignment = zkClient.getReplicaAssignmentForTopics(
+  collection.immutable.Set(topic)).map {
+case (topicPartition, replicas) => topicPartition.partition -> 
replicas
+}
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21194: [SPARK-24046][SS] Fix rate source when rowsPerSecond <= ...

2018-06-06 Thread maasg
Github user maasg commented on the issue:

https://github.com/apache/spark/pull/21194
  
@zsxwing Thanks for dropping by. This patch is about fixing the rate ramp 
up when `rowsPerSecond <= rampUpTime`, which makes the Rate Source produce no 
data until `rampUpTime` (See  
[SPARK-24046](https://issues.apache.org/jira/browse/SPARK-24046)).

The review discussion in this PR is that, while fixing this issue, I 
introduced a new way of calculating the `rampUp` that makes the previously 
working scenario of `rowsPerSecond > rampUpTime` smoother and more consistent 
(as shown in the charts above). 
The original tests verified the ramp-up against some hard-coded values that 
are changed by the new formula. While the semantics of the 'ramp up' behavior 
are preserved, the intermediate ramp up values produced are different, which is 
evidenced in the test. 

I believe the overall code approach is an improvement over the original and 
the behavior it shows is what we would expect from the description of the 'ramp 
up'  feature.

What do you think?  Could you review the code changes?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3826/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21488
  
**[Test build #91508 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91508/testReport)**
 for PR 21488 at commit 
[`b773982`](https://github.com/apache/spark/commit/b77398229a190fac7587970a803cfee974f5f5f4).
 * This patch **fails build dependency tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21488
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91508/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21488
  
**[Test build #91508 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91508/testReport)**
 for PR 21488 at commit 
[`b773982`](https://github.com/apache/spark/commit/b77398229a190fac7587970a803cfee974f5f5f4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21503
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3825/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21503
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21488: SPARK-18057 Update structured streaming kafka fro...

2018-06-06 Thread eric-maynard
Github user eric-maynard commented on a diff in the pull request:

https://github.com/apache/spark/pull/21488#discussion_r193549547
  
--- Diff: external/kafka-0-10-sql/pom.xml ---
@@ -29,7 +29,7 @@
   spark-sql-kafka-0-10_2.11
   
 sql-kafka-0-10
-0.10.0.1
+2.0.0-SNAPSHOT
   
   jar
   Kafka 0.10 Source for Structured Streaming
--- End diff --

We should change this line to reflect the change too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...

2018-06-06 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21319
  
Here's the commit with my changes to support v2 stats in the visitor, sorry 
it took so long for me to find the time!

https://github.com/apache/spark/pull/21503/commits/9d3a11e68bca6c5a56a2be47fb09395350362ac5

The stats visitor now matches on PhysicalOperation to get accurate stats 
for v2, or any other data source that wants to report more accurate stats that 
aren't available because push-down happens when converting from logical plan to 
physical plan.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21503
  
**[Test build #91507 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91507/testReport)**
 for PR 21503 at commit 
[`9d3a11e`](https://github.com/apache/spark/commit/9d3a11e68bca6c5a56a2be47fb09395350362ac5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-06 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/21503

[SPARK-24478][SQL] Move projection and filter push down to physical 
conversion

## What changes were proposed in this pull request?

This removes the v2 optimizer rule for push-down and instead pushes filters 
and required columns when converting to a physical plan, as suggested by 
@marmbrus. This makes the v2 relation cleaner because the output and filters do 
not change in the logical plan.

To solve the problem of getting accurate statistics in the optimizer 
(push-down happens later, now), this adds a new trait, SupportsPhysicalStats 
that calculates LeafNode stats using the filters and projection. This trait can 
also be implemented by v1 data sources to get more accurate stats for CBO.

The first commit was proposed in #21262. This PR replaces #21262.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark 
SPARK-24478-move-push-down-to-physical-conversion

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21503.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21503


commit c8517e145b1a460a8be07164c17ce20b1db86659
Author: Ryan Blue 
Date:   2018-05-07T20:08:02Z

DataSourceV2: push projection, filters when converting to physical plan.

commit 9d3a11e68bca6c5a56a2be47fb09395350362ac5
Author: Ryan Blue 
Date:   2018-06-06T20:17:16Z

SPARK-24478: Add trait to report stats with filters and projection.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-06-06 Thread superbobry
Github user superbobry commented on the issue:

https://github.com/apache/spark/pull/21180
  
Friendly ping @HyukjinKwon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21502
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91505/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21502
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21502
  
**[Test build #91505 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91505/testReport)**
 for PR 21502 at commit 
[`ec365d6`](https://github.com/apache/spark/commit/ec365d628126255cea3eadd4a8357030e5bf1f2e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21483: [SPARK-24454][ML][PYTHON] Imports image module in ml/__i...

2018-06-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21483
  
Ah, maybe I rushed to read the JIRA. For 2. there are already single 
attribute being loaded `ImageSchema` since all other attributes has an 
underscore; however, sure, it should be the best to explicitly define. Will do 
this both here if you are fine with it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >