[GitHub] spark pull request #21092: [SPARK-23984][K8S] Initial Python Bindings for Py...
Github user kokes commented on a diff in the pull request: https://github.com/apache/spark/pull/21092#discussion_r193639554 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala --- @@ -154,6 +176,24 @@ private[spark] object Config extends Logging { .checkValue(interval => interval > 0, s"Logging interval must be a positive time value.") .createWithDefaultString("1s") + val MEMORY_OVERHEAD_FACTOR = +ConfigBuilder("spark.kubernetes.memoryOverheadFactor") + .doc("This sets the Memory Overhead Factor that will allocate memory to non-JVM jobs " + +"which in the case of JVM tasks will default to 0.10 and 0.40 for non-JVM jobs") + .doubleConf + .checkValue(mem_overhead => mem_overhead >= 0 && mem_overhead < 1, +"Ensure that memory overhead is a double between 0 --> 1.0") + .createWithDefault(0.1) + + val PYSPARK_MAJOR_PYTHON_VERSION = +ConfigBuilder("spark.kubernetes.pyspark.pythonversion") + .doc("This sets the python version. Either 2 or 3. (Python2 or Python3)") + .stringConf + .checkValue(pv => List("2", "3").contains(pv), +"Ensure that Python Version is either Python2 or Python3") + .createWithDefault("2") --- End diff -- Am I reading this right that the default is Python 2? Is there a reason for that? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20929: [SPARK-23772][SQL] Provide an option to ignore column of...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20929 yea, thanks for the comments! I'll try to fix based on the comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...
Github user galv commented on a diff in the pull request: https://github.com/apache/spark/pull/21494#discussion_r193290266 --- Diff: python/pyspark/worker.py --- @@ -232,6 +236,13 @@ def main(infile, outfile): shuffle.DiskBytesSpilled = 0 _accumulatorRegistry.clear() +if (isBarrier): +port = 25333 + 2 + 2 * taskContext._partitionId +paras = GatewayParameters(port=port) --- End diff -- paras -> params --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...
Github user galv commented on a diff in the pull request: https://github.com/apache/spark/pull/21494#discussion_r193269255 --- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala --- @@ -627,6 +627,52 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu assert(exc.getCause() != null) stream.close() } + + test("support barrier sync under local mode") { +val conf = new SparkConf().setAppName("test").setMaster("local[2]") +sc = new SparkContext(conf) +val rdd = sc.makeRDD(Seq(1, 2, 3, 4), 2).barrier() +val rdd2 = rdd.mapPartitions { it => + val tc = TaskContext.get.asInstanceOf[org.apache.spark.barrier.BarrierTaskContext] + // If we don't get the expected taskInfos, the job shall abort due to stage failure. + if (tc.hosts().length != 2) { +throw new SparkException("Expected taksInfos length is 2, actual length is " + --- End diff -- `taksInfos` -> `taskInfos` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...
Github user galv commented on a diff in the pull request: https://github.com/apache/spark/pull/21494#discussion_r193289530 --- Diff: python/pyspark/worker.py --- @@ -232,6 +236,13 @@ def main(infile, outfile): shuffle.DiskBytesSpilled = 0 _accumulatorRegistry.clear() +if (isBarrier): --- End diff -- Style: `if (isBarrier):` -> `if isBarrier:` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...
Github user galv commented on a diff in the pull request: https://github.com/apache/spark/pull/21494#discussion_r193291076 --- Diff: python/pyspark/worker.py --- @@ -232,6 +236,13 @@ def main(infile, outfile): shuffle.DiskBytesSpilled = 0 _accumulatorRegistry.clear() +if (isBarrier): +port = 25333 + 2 + 2 * taskContext._partitionId --- End diff -- I recommend using DEFAULT_PORT and DEFAULT_PYTHON_PORT. They are exposed as part of the public API of py4j: https://github.com/bartdag/py4j/blob/216432d859de41441f0d1a0d55b31b5d8d09dd28/py4j-python/src/py4j/java_gateway.py#L54 By the way, acquiring ports like this is a little hacky and may require more thought. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...
Github user galv commented on a diff in the pull request: https://github.com/apache/spark/pull/21494#discussion_r193555968 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -123,6 +124,21 @@ private[spark] class TaskSetManager( // TODO: We should kill any running task attempts when the task set manager becomes a zombie. private[scheduler] var isZombie = false + private[scheduler] lazy val barrierCoordinator = { --- End diff -- I recommend adding a return type here for readability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21494: [WIP][SPARK-24375][Prototype] Support barrier sch...
Github user galv commented on a diff in the pull request: https://github.com/apache/spark/pull/21494#discussion_r193269297 --- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala --- @@ -627,6 +627,52 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu assert(exc.getCause() != null) stream.close() } + + test("support barrier sync under local mode") { +val conf = new SparkConf().setAppName("test").setMaster("local[2]") +sc = new SparkContext(conf) +val rdd = sc.makeRDD(Seq(1, 2, 3, 4), 2).barrier() +val rdd2 = rdd.mapPartitions { it => + val tc = TaskContext.get.asInstanceOf[org.apache.spark.barrier.BarrierTaskContext] + // If we don't get the expected taskInfos, the job shall abort due to stage failure. + if (tc.hosts().length != 2) { +throw new SparkException("Expected taksInfos length is 2, actual length is " + + s"${tc.hosts().length}.") + } + // println(tc.getTaskInfos().toList) --- End diff -- Remove comment --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21500 Retaining versions of state is also relevant to do snapshotting the last version in files: HDFSBackedStateStoreProvider doesn't snapshot if the version doesn't exist in loadedMaps. So we may want to check whether this option also works with current approach of snapshotting. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193635361 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html Locale]] of the input. + * Ignored when [[caseSensitive]] is false. + * Default: Locale.English + * @see `StopWordsRemover.loadDefaultStopWords()` + * @group param + */ + @Since("2.4.0") + val locale: Param[Locale] = new Param[Locale](this, "locale", --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21477#discussion_r193634436 --- Diff: python/pyspark/sql/streaming.py --- @@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, continuous=None): self._jwrite = self._jwrite.trigger(jTrigger) return self +def foreach(self, f): +""" +Sets the output of the streaming query to be processed using the provided writer ``f``. +This is often used to write the output of a streaming query to arbitrary storage systems. +The processing logic can be specified in two ways. + +#. A **function** that takes a row as input. +This is a simple way to express your processing logic. Note that this does +not allow you to deduplicate generated data when failures cause reprocessing of +some input data. That would require you to specify the processing logic in the next +way. + +#. An **object** with a ``process`` method and optional ``open`` and ``close`` methods. +The object can have the following methods. + +* ``open(partition_id, epoch_id)``: *Optional* method that initializes the processing +(for example, open a connection, start a transaction, etc). Additionally, you can +use the `partition_id` and `epoch_id` to deduplicate regenerated data +(discussed later). + +* ``process(row)``: *Non-optional* method that processes each :class:`Row`. + +* ``close(error)``: *Optional* method that finalizes and cleans up (for example, +close connection, commit transaction, etc.) after all rows have been processed. + +The object will be used by Spark in the following way. + +* A single copy of this object is responsible of all the data generated by a +single task in a query. In other words, one instance is responsible for +processing one partition of the data generated in a distributed manner. + +* This object must be serializable because each task will get a fresh +serialized-deserializedcopy of the provided object. Hence, it is strongly +recommended that any initialization for writing data (e.g. opening a +connection or starting a transaction) be done open after the `open(...)` +method has been called, which signifies that the task is ready to generate data. + +* The lifecycle of the methods are as follows. + +For each partition with ``partition_id``: + +... For each batch/epoch of streaming data with ``epoch_id``: + +... Method ``open(partitionId, epochId)`` is called. + +... If ``open(...)`` returns true, for each row in the partition and +batch/epoch, method ``process(row)`` is called. + +... Method ``close(errorOrNull)`` is called with error (if any) seen while +processing rows. + +Important points to note: + +* The `partitionId` and `epochId` can be used to deduplicate generated data when +failures cause reprocessing of some input data. This depends on the execution +mode of the query. If the streaming query is being executed in the micro-batch +mode, then every partition represented by a unique tuple (partition_id, epoch_id) +is guaranteed to have the same data. Hence, (partition_id, epoch_id) can be used +to deduplicate and/or transactionally commit data and achieve exactly-once +guarantees. However, if the streaming query is being executed in the continuous +mode, then this guarantee does not hold and therefore should not be used for +deduplication. + +* The ``close()`` method (if exists) is will be called if `open()` method exists and +returns successfully (irrespective of the return value), except if the Python +crashes in the middle. + +.. note:: Evolving. + +>>> # Print every row using a function +>>> writer = sdf.writeStream.foreach(lambda x: print(x)) +>>> # Print every row using a object with process() method +>>> class RowPrinter: +... def open(self, partition_id, epoch_id): +... print("Opened %d, %d" % (partition_id, epoch_id)) +... return True +... d
[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21477#discussion_r193633540 --- Diff: python/pyspark/sql/streaming.py --- @@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, continuous=None): self._jwrite = self._jwrite.trigger(jTrigger) return self +def foreach(self, f): +""" +Sets the output of the streaming query to be processed using the provided writer ``f``. +This is often used to write the output of a streaming query to arbitrary storage systems. +The processing logic can be specified in two ways. + +#. A **function** that takes a row as input. --- End diff -- I mean, we could maybe consider the other ways but wouldn't it better to have the consistent support as the primary, and then see if the other ways are really requested by users? I think we could still incrementally add attribute-checking way or the lambda (or function to be more correct) way later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21477#discussion_r193632947 --- Diff: python/pyspark/sql/streaming.py --- @@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, continuous=None): self._jwrite = self._jwrite.trigger(jTrigger) return self +def foreach(self, f): +""" +Sets the output of the streaming query to be processed using the provided writer ``f``. +This is often used to write the output of a streaming query to arbitrary storage systems. +The processing logic can be specified in two ways. + +#. A **function** that takes a row as input. +This is a simple way to express your processing logic. Note that this does +not allow you to deduplicate generated data when failures cause reprocessing of +some input data. That would require you to specify the processing logic in the next +way. + +#. An **object** with a ``process`` method and optional ``open`` and ``close`` methods. +The object can have the following methods. + +* ``open(partition_id, epoch_id)``: *Optional* method that initializes the processing +(for example, open a connection, start a transaction, etc). Additionally, you can +use the `partition_id` and `epoch_id` to deduplicate regenerated data +(discussed later). + +* ``process(row)``: *Non-optional* method that processes each :class:`Row`. + +* ``close(error)``: *Optional* method that finalizes and cleans up (for example, +close connection, commit transaction, etc.) after all rows have been processed. + +The object will be used by Spark in the following way. + +* A single copy of this object is responsible of all the data generated by a +single task in a query. In other words, one instance is responsible for +processing one partition of the data generated in a distributed manner. + +* This object must be serializable because each task will get a fresh +serialized-deserializedcopy of the provided object. Hence, it is strongly +recommended that any initialization for writing data (e.g. opening a +connection or starting a transaction) be done open after the `open(...)` +method has been called, which signifies that the task is ready to generate data. + +* The lifecycle of the methods are as follows. + +For each partition with ``partition_id``: + +... For each batch/epoch of streaming data with ``epoch_id``: + +... Method ``open(partitionId, epochId)`` is called. + +... If ``open(...)`` returns true, for each row in the partition and +batch/epoch, method ``process(row)`` is called. + +... Method ``close(errorOrNull)`` is called with error (if any) seen while +processing rows. + +Important points to note: + +* The `partitionId` and `epochId` can be used to deduplicate generated data when +failures cause reprocessing of some input data. This depends on the execution +mode of the query. If the streaming query is being executed in the micro-batch +mode, then every partition represented by a unique tuple (partition_id, epoch_id) +is guaranteed to have the same data. Hence, (partition_id, epoch_id) can be used +to deduplicate and/or transactionally commit data and achieve exactly-once +guarantees. However, if the streaming query is being executed in the continuous +mode, then this guarantee does not hold and therefore should not be used for +deduplication. + +* The ``close()`` method (if exists) is will be called if `open()` method exists and +returns successfully (irrespective of the return value), except if the Python +crashes in the middle. + +.. note:: Evolving. + +>>> # Print every row using a function +>>> writer = sdf.writeStream.foreach(lambda x: print(x)) +>>> # Print every row using a object with process() method +>>> class RowPrinter: +... def open(self, partition_id, epoch_id): +... print("Opened %d, %d" % (partition_id, epoch_id)) +... return True +... def proc
[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21477#discussion_r193632883 --- Diff: python/pyspark/sql/streaming.py --- @@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, continuous=None): self._jwrite = self._jwrite.trigger(jTrigger) return self +def foreach(self, f): +""" +Sets the output of the streaming query to be processed using the provided writer ``f``. +This is often used to write the output of a streaming query to arbitrary storage systems. +The processing logic can be specified in two ways. + +#. A **function** that takes a row as input. --- End diff -- (including the response to https://github.com/apache/spark/pull/21477#discussion_r193631209) I kind of agree that it's a-okay idea but I think we usually provide a consistent API support so far unless it's language specific, for example, ContextManager, decorator in Python and etc. Just for clarification, does Scala side support function only support too? Also, I know attribute-checking way is kind of more like "Pythonic" way but I am seeing the documentation is already diverted between Scala vs Python. It costs maintaining overhead on the other hand. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21483 **[Test build #91516 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91516/testReport)** for PR 21483 at commit [`49323a6`](https://github.com/apache/spark/commit/49323a6d3207e7d4c6b04ab3c3b1729d9eeaca16). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91516/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21477#discussion_r193631566 --- Diff: python/pyspark/sql/streaming.py --- @@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, continuous=None): self._jwrite = self._jwrite.trigger(jTrigger) return self +def foreach(self, f): +""" +Sets the output of the streaming query to be processed using the provided writer ``f``. +This is often used to write the output of a streaming query to arbitrary storage systems. +The processing logic can be specified in two ways. + +#. A **function** that takes a row as input. --- End diff -- This is superset of what we support in scala. Python users are more likely to use simple lambdas instead of defining classes. But they may also want to write transactional stuff in python with open and close methods. Hence providing both alternatives seems to be a good idea. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21482 **[Test build #91518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91518/testReport)** for PR 21482 at commit [`6a4d46e`](https://github.com/apache/spark/commit/6a4d46e0a9ab403364e26a7b8f16c9ca94c31a2e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21477: [WIP] [SPARK-24396] [SS] [PYSPARK] Add Structured...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21477#discussion_r193631209 --- Diff: python/pyspark/sql/streaming.py --- @@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, continuous=None): self._jwrite = self._jwrite.trigger(jTrigger) return self +def foreach(self, f): +""" +Sets the output of the streaming query to be processed using the provided writer ``f``. +This is often used to write the output of a streaming query to arbitrary storage systems. +The processing logic can be specified in two ways. + +#. A **function** that takes a row as input. +This is a simple way to express your processing logic. Note that this does +not allow you to deduplicate generated data when failures cause reprocessing of +some input data. That would require you to specify the processing logic in the next +way. + +#. An **object** with a ``process`` method and optional ``open`` and ``close`` methods. --- End diff -- I discussed this with @marmbrus . If there is a ForeachWriter class in python, then uses will have to additionally import it. That's just another overhead that can be avoided by just allowing any class with the appropriate methods. One less step for python users. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21483 **[Test build #91515 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91515/testReport)** for PR 21483 at commit [`5e293d5`](https://github.com/apache/spark/commit/5e293d5a543bef35e422e7f8d816853a6f37314b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91515/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3829/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21482: [SPARK-24393][SQL] SQL builtin: isinf
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21482#discussion_r193630041 --- Diff: R/pkg/NAMESPACE --- @@ -281,6 +281,8 @@ exportMethods("%<=>%", "initcap", "input_file_name", "instr", + "isInf", + "isinf", --- End diff -- I really don't understand why we have both. It usually has the ones matching to Scala side or R specific function. Otherwise, I don't think we should have both. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21483 **[Test build #91516 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91516/testReport)** for PR 21483 at commit [`49323a6`](https://github.com/apache/spark/commit/49323a6d3207e7d4c6b04ab3c3b1729d9eeaca16). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21482 **[Test build #91517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91517/testReport)** for PR 21482 at commit [`f240fdf`](https://github.com/apache/spark/commit/f240fdf3a410e2fdec1fa668bc0218ac61078423). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91514/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21483 **[Test build #91514 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91514/testReport)** for PR 21483 at commit [`55eef7c`](https://github.com/apache/spark/commit/55eef7ca382a7eab7c80404b2b3d8c6ba95d806c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3828/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21483 **[Test build #91515 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91515/testReport)** for PR 21483 at commit [`5e293d5`](https://github.com/apache/spark/commit/5e293d5a543bef35e422e7f8d816853a6f37314b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21483 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3827/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24477][SPARK-24454][ML][PYTHON] Imports image mod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21483 **[Test build #91514 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91514/testReport)** for PR 21483 at commit [`55eef7c`](https://github.com/apache/spark/commit/55eef7ca382a7eab7c80404b2b3d8c6ba95d806c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21504: SPARK-24479: Added config for registering streamingQuery...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21504 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91510/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21504: SPARK-24479: Added config for registering streamingQuery...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21504 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21504: SPARK-24479: Added config for registering streamingQuery...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21504 **[Test build #91510 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91510/testReport)** for PR 21504 at commit [`d3a3baa`](https://github.com/apache/spark/commit/d3a3baa8bba65da6a30d1d449f97cbeb467ca14b). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `.doc(\"List of class names implementing StreamingQueryListener that will be automatically \" +` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21500 @TomaszGaweda @aalobaidi Please correct me if I'm missing here. From every start of batch, state store loads previous version of state so that it can be read and written. If we unload all the version "after committing" the cache will no longer contain previous version of state and it will try to load the state via reading files, adding huge latency on starting batch. That's why I stated about three cases before to avoid loading state from files when starting a new batch. Please apply #21469 manually and see how much HDFSBackedStateStoreProvider consumes memory due to storing multiple versions (it will show the state size on the latest version as well as overall state size in cache). Please also observe and provide numbers of latency to show how much it is and how much it will be after the patch. We always have to ask ourselves that we are addressing the issue correctly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21469: [SPARK-24441][SS] Expose total estimated size of ...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/21469#discussion_r193622940 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenerSuite.scala --- @@ -231,7 +231,7 @@ class StreamingQueryListenerSuite extends StreamTest with BeforeAndAfter { test("event ordering") { val listener = new EventCollector withListenerAdded(listener) { - for (i <- 1 to 100) { + for (i <- 1 to 50) { --- End diff -- After the patch this test starts failing: it just means there's more time needed to run this loop 100 times, and doesn't mean the logic is broken. Decreasing number works for me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18900 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91513/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18900 **[Test build #91513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91513/testReport)** for PR 18900 at commit [`e3a0cc4`](https://github.com/apache/spark/commit/e3a0cc43b10828b8111f7cd9523391cd3a2fdb6f). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18900 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18900 **[Test build #91513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91513/testReport)** for PR 18900 at commit [`e3a0cc4`](https://github.com/apache/spark/commit/e3a0cc43b10828b8111f7cd9523391cd3a2fdb6f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18900 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18900 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91512/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18900 **[Test build #91512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91512/testReport)** for PR 18900 at commit [`a00e943`](https://github.com/apache/spark/commit/a00e943a7097f386c842fd725cb1474e3a7f74c8). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18900 **[Test build #91512 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91512/testReport)** for PR 18900 at commit [`a00e943`](https://github.com/apache/spark/commit/a00e943a7097f386c842fd725cb1474e3a7f74c8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21499: [SPARK-24468][SQL] Handle negative scale when adj...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21499#discussion_r193618762 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -161,13 +161,17 @@ object DecimalType extends AbstractDataType { * This method is used only when `spark.sql.decimalOperations.allowPrecisionLoss` is set to true. */ private[sql] def adjustPrecisionScale(precision: Int, scale: Int): DecimalType = { -// Assumptions: +// Assumption: assert(precision >= scale) -assert(scale >= 0) if (precision <= MAX_PRECISION) { // Adjustment only needed when we exceed max precision DecimalType(precision, scale) +} else if (scale < 0) { + // Decimal can have negative scale (SPARK-24468). In this case, we cannot allow a precision + // loss since we would cause a loss of digits in the integer part. --- End diff -- ok makes sense, do we have an end-to-end test case for returning null? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21469 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91509/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21469 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21469 **[Test build #91509 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91509/testReport)** for PR 21469 at commit [`7ec3242`](https://github.com/apache/spark/commit/7ec32427cf0fda82d2f936fa0aab62e274a8c034). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21498: [SPARK-24410][SQL][Core] Optimization for Union o...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21498#discussion_r193618338 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1099,6 +1099,17 @@ object SQLConf { .intConf .createWithDefault(SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD.defaultValue.get) + val UNION_IN_SAME_PARTITION = +buildConf("spark.sql.unionInSamePartition") + .internal() + .doc("When true, Union operator will union children results in the same corresponding " + +"partitions if they have same partitioning. This eliminates unnecessary shuffle in later " + +"operators like aggregation. Note that because non-deterministic functions such as " + +"monotonically_increasing_id are depended on partition id. By doing this, the values of " + --- End diff -- I'm a bit not convinced by the reason and behavior of keeping the value of non-deterministic functions after an union. Like in the following queries: ```scala val df1 = spark.range(10).select(monotonically_increasing_id()) val df2 = spark.range(10).select(monotonically_increasing_id()) val union = df1.union(df2) ``` Now we keep the values of `monotonically_increasing_id` returned by `df1`, `df2` and `union` are the same. However, as non-deterministic functions, the values changing by data layout/sequence sounds still reasonable. Anyway that is current behavior and this config need users to enable this feature explicitly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21498 I set up a Spark cluster with 5 nodes on EC2. ```scala def benchmark(func: () => Unit): Unit = { val t0 = System.nanoTime() func() val t1 = System.nanoTime() println("Elapsed time: " + (t1 - t0) + "ns") } val N = 1L spark.range(N).selectExpr("id as key", "id % 2 as t1", "id % 3 as t2").repartition(col("key")).write.mode("overwrite").bucketBy(3, "key").sortBy("t1").saveAsTable("a1") spark.range(N).selectExpr("id as key", "id % 2 as t1", "id % 3 as t2").repartition(col("key")).write.mode("overwrite").bucketBy(3, "key").sortBy("t1").saveAsTable("a2") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) val df = sql("select key,count(*) from (select * from a1 union all select * from a2)z group by key") val df2 = sql("select * from a1 union all select * from a2") val df3 = df.sample(0.8).filter($"key" > 100).sample(0.4) val df4 = df2.sample(0.8).filter($"key" > 100).sample(0.4) benchmark(() => df.collect) benchmark(() => df2.collect) benchmark(() => df3.collect) benchmark(() => df4.collect) ``` Before: ``` scala> benchmark(() => df.collect) Elapsed time: 371007018ns scala> benchmark(() => df2.collect) Elapsed time: 93056619ns scala> benchmark(() => df3.collect) Elapsed time: 477603242ns scala> benchmark(() => df4.collect) Elapsed time: 150106354ns ``` After: ``` scala> benchmark(() => df.collect) Elapsed time: 101199791ns scala> benchmark(() => df2.collect) Elapsed time: 80275158ns scala> benchmark(() => df3.collect) Elapsed time: 292775244ns scala> benchmark(() => df4.collect) Elapsed time: 151129518ns ``` It improves the queries of `df` and `df3` by eliminating shuffle. For `df2` and `df4` which don't involve shuffle, there is no performance regression. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18900 **[Test build #91511 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91511/testReport)** for PR 18900 at commit [`478e205`](https://github.com/apache/spark/commit/478e2051c775a594ad729256c3ef78cc311c992d). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18900 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91511/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18900 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18900 **[Test build #91511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91511/testReport)** for PR 18900 at commit [`478e205`](https://github.com/apache/spark/commit/478e2051c775a594ad729256c3ef78cc311c992d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18900 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18900 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18900 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18900: [SPARK-21687][SQL] Spark SQL should set createTim...
GitHub user debugger87 reopened a pull request: https://github.com/apache/spark/pull/18900 [SPARK-21687][SQL] Spark SQL should set createTime for Hive partition ## What changes were proposed in this pull request? Set createTime for every hive partition created in Spark SQL, which could be used to manage data lifecycle in Hive warehouse. We found that almost every partition created by spark sql has not been set createTime. ``` mysql> select * from partitions where create_time=0 limit 1\G; *** 1. row *** PART_ID: 1028584 CREATE_TIME: 0 LAST_ACCESS_TIME: 1502203611 PART_NAME: date=20170130 SD_ID: 1543605 TBL_ID: 211605 LINK_TARGET_ID: NULL 1 row in set (0.27 sec) ``` ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/debugger87/spark fix/set-create-time-for-hive-partition Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18900.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18900 commit 71a660ac8dad869d9ba3b4e206b74f5c44660ee6 Author: debugger87 Date: 2017-08-10T04:17:00Z [SPARK-21687][SQL] Spark SQL should set createTime for Hive partition commit f668ce8837ee553c61687bd03d04cddd32e5f36f Author: debugger87 Date: 2017-08-11T07:50:26Z added createTime and lastAccessTime into CatalogTablePartition commit 2fb1ddabdb2ab8f7b585ee7aea93280f96a23467 Author: debugger87 Date: 2017-08-11T07:54:26Z minor tweak commit c833ce7aa5f2ba0b684494fd1b24b7995f1c09c9 Author: debugger87 Date: 2017-08-11T08:07:57Z fix type missmatch commit bf2a1052f807a7ae36004c819e66fff5c4b45820 Author: debugger87 Date: 2017-08-11T23:26:29Z added createTime and lastAccessTime into partition map for display --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user debugger87 commented on the issue: https://github.com/apache/spark/pull/18900 @cxzl25 OK, reopen it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193604620 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,27 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive is false", + typeConverter=TypeConverters.toString) --- End diff -- nit: indentation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21504: SPARK-24479: Added config for registering streami...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21504#discussion_r193604356 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala --- @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import scala.language.reflectiveCalls + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.SparkConf +import org.apache.spark.sql.execution.streaming._ +import org.apache.spark.sql.streaming.StreamingQueryListener.{QueryStartedEvent, QueryTerminatedEvent, _} --- End diff -- nit: import looks a bit odd `{QueryStartedEvent, QueryTerminatedEvent, _}`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21504: SPARK-24479: Added config for registering streami...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21504#discussion_r193603810 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala --- @@ -55,6 +56,12 @@ class StreamingQueryManager private[sql] (sparkSession: SparkSession) extends Lo @GuardedBy("awaitTerminationLock") private var lastTerminatedQuery: StreamingQuery = null + sparkSession.sparkContext.conf.get(STREAMING_QUERY_LISTENERS) + .foreach { classNames => --- End diff -- nit: ```scala sparkSession.sparkContext.conf.get(STREAMING_QUERY_LISTENERS).foreach { classNames => ... } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21504: SPARK-24479: Added config for registering streamingQuery...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21504 Mind fixing the PR title to `[SPARK-24479][SS] Added config for registering streamingQueryListeners`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21482 @henryr 1.0/0.0 also returns null in Spark SQL ... ``` scala> sql("select cast(1.0 as double)/cast(0 as double)").show() +-+ |(CAST(1.0 AS DOUBLE) / CAST(0 AS DOUBLE))| +-+ | null| +-+ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21504: SPARK-24480: Added config for registering streamingQuery...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21504 **[Test build #91510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91510/testReport)** for PR 21504 at commit [`d3a3baa`](https://github.com/apache/spark/commit/d3a3baa8bba65da6a30d1d449f97cbeb467ca14b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21504: SPARK-24480: Added config for registering streami...
GitHub user arunmahadevan opened a pull request: https://github.com/apache/spark/pull/21504 SPARK-24480: Added config for registering streamingQueryListeners ## What changes were proposed in this pull request? Currently a "StreamingQueryListener" can only be registered programatically. We could have a new config "spark.sql.streamingQueryListeners" similar to "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to register custom streaming listeners. ## How was this patch tested? New unit test and running example programs. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/arunmahadevan/spark SPARK-24480 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21504.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21504 commit d3a3baa8bba65da6a30d1d449f97cbeb467ca14b Author: Arun Mahadevan Date: 2018-06-07T00:57:22Z SPARK-24480: Added config for registering streamingQueryListeners --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21504: SPARK-24480: Added config for registering streamingQuery...
Github user arunmahadevan commented on the issue: https://github.com/apache/spark/pull/21504 ping @tdas @jose-torres @HeartSaVioR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21504: SPARK-24480: Added config for registering streamingQuery...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21504 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21503 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91507/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21503 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21503 **[Test build #91507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91507/testReport)** for PR 21503 at commit [`9d3a11e`](https://github.com/apache/spark/commit/9d3a11e68bca6c5a56a2be47fb09395350362ac5). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait SupportsPhysicalStats extends LeafNode ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21469 **[Test build #91509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91509/testReport)** for PR 21469 at commit [`7ec3242`](https://github.com/apache/spark/commit/7ec32427cf0fda82d2f936fa0aab62e274a8c034). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21482: [SPARK-24393][SQL] SQL builtin: isinf
Github user henryr commented on the issue: https://github.com/apache/spark/pull/21482 @rxin Other engines are all over the place: * MySQL doesn't have support for infinity (based on my cursory look) - 1.0 / 0.0 is written as `null`. Also seems to be true of SQLite. * Postgres has type-specific literals (e.g. `FLOAT8 '+Infinity'`) which you can use for comparison checks. * Impala has `is_inf()` * Oracle has a literal value, and also a built-in predicate `SELECT f FROM foo WHERE f is infinite` * SQL Server does not appear to support infinity (but that's based on anecdotal evidence) One of my motivations suggesting this builtin was to make sharing workloads with Impala easier. I think it's convenient to have as well, since it's more intuitive than figuring out what the right literal incantation should be. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user tedyu commented on the issue: https://github.com/apache/spark/pull/21488 There is only target/surefire-reports/TEST-org.apache.spark.sql.kafka010.KafkaMicroBatchV2SourceSuite.xml under target/surefire-reports That file doesn't contain test output. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20929: [SPARK-23772][SQL] Provide an option to ignore column of...
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/20929 @maropu Thanks for updating this PR! It would be easier to maintain the logic in one place. I think it should be feasible to do everything inside `canonicalizeType` without modifying `JsonParser` or other methods in `JsonInferSchema`. The following code outlines my logic, though I didn't test it ...: ~~~scala /** * Canonicalize data types and remove StructTypes with no fields. * @return Some(canonicalizedType) or None if nothing left. */ private def canonicalizeType(tpe: DataType, options: JSONOptions): Option[DataType] = tpe match { case at @ ArrayType(elementType, _) => canonicalizeType(elementType, options).map(t => at.copy(elementType = t)) case StructType(fields) => val canonicalizedFields = fields.flatMap { f => canonicalizeType(f.dataType, options).map(t => f.copy(dataType = t)) } // per SPARK-8093: empty structs should be deleted if (canonicalizedFields.isEmpty) { None } else { StructType(canonicalizedFields) } case NullType => if (options.dropFieldIfAllNull) { None else { Some(StringType) } } case other => Some(other) } ~~~ In the test, we should also include scenarios with nested "null" fields like `[[], null, [[]]]`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21488: SPARK-18057 Update structured streaming kafka fro...
Github user guozhangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21488#discussion_r193574097 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala --- @@ -203,7 +215,13 @@ class KafkaTestUtils(withBrokerProps: Map[String, Object] = Map.empty) extends L /** Add new partitions to a Kafka topic */ def addPartitions(topic: String, partitions: Int): Unit = { -AdminUtils.addPartitions(zkUtils, topic, partitions) +val existingAssignment = zkClient.getReplicaAssignmentForTopics( + collection.immutable.Set(topic)).map { +case (topicPartition, replicas) => topicPartition.partition -> replicas +} --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21194: [SPARK-24046][SS] Fix rate source when rowsPerSecond <= ...
Github user maasg commented on the issue: https://github.com/apache/spark/pull/21194 @zsxwing Thanks for dropping by. This patch is about fixing the rate ramp up when `rowsPerSecond <= rampUpTime`, which makes the Rate Source produce no data until `rampUpTime` (See [SPARK-24046](https://issues.apache.org/jira/browse/SPARK-24046)). The review discussion in this PR is that, while fixing this issue, I introduced a new way of calculating the `rampUp` that makes the previously working scenario of `rowsPerSecond > rampUpTime` smoother and more consistent (as shown in the charts above). The original tests verified the ramp-up against some hard-coded values that are changed by the new formula. While the semantics of the 'ramp up' behavior are preserved, the intermediate ramp up values produced are different, which is evidenced in the test. I believe the overall code approach is an improvement over the original and the behavior it shows is what we would expect from the description of the 'ramp up' feature. What do you think? Could you review the code changes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3826/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21488 **[Test build #91508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91508/testReport)** for PR 21488 at commit [`b773982`](https://github.com/apache/spark/commit/b77398229a190fac7587970a803cfee974f5f5f4). * This patch **fails build dependency tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21488 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91508/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21488: SPARK-18057 Update structured streaming kafka from 0.10....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21488 **[Test build #91508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91508/testReport)** for PR 21488 at commit [`b773982`](https://github.com/apache/spark/commit/b77398229a190fac7587970a803cfee974f5f5f4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21503 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3825/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21503 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21488: SPARK-18057 Update structured streaming kafka fro...
Github user eric-maynard commented on a diff in the pull request: https://github.com/apache/spark/pull/21488#discussion_r193549547 --- Diff: external/kafka-0-10-sql/pom.xml --- @@ -29,7 +29,7 @@ spark-sql-kafka-0-10_2.11 sql-kafka-0-10 -0.10.0.1 +2.0.0-SNAPSHOT jar Kafka 0.10 Source for Structured Streaming --- End diff -- We should change this line to reflect the change too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21319 Here's the commit with my changes to support v2 stats in the visitor, sorry it took so long for me to find the time! https://github.com/apache/spark/pull/21503/commits/9d3a11e68bca6c5a56a2be47fb09395350362ac5 The stats visitor now matches on PhysicalOperation to get accurate stats for v2, or any other data source that wants to report more accurate stats that aren't available because push-down happens when converting from logical plan to physical plan. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21503 **[Test build #91507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91507/testReport)** for PR 21503 at commit [`9d3a11e`](https://github.com/apache/spark/commit/9d3a11e68bca6c5a56a2be47fb09395350362ac5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21503 [SPARK-24478][SQL] Move projection and filter push down to physical conversion ## What changes were proposed in this pull request? This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by @marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan. To solve the problem of getting accurate statistics in the optimizer (push-down happens later, now), this adds a new trait, SupportsPhysicalStats that calculates LeafNode stats using the filters and projection. This trait can also be implemented by v1 data sources to get more accurate stats for CBO. The first commit was proposed in #21262. This PR replaces #21262. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-24478-move-push-down-to-physical-conversion Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21503.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21503 commit c8517e145b1a460a8be07164c17ce20b1db86659 Author: Ryan Blue Date: 2018-05-07T20:08:02Z DataSourceV2: push projection, filters when converting to physical plan. commit 9d3a11e68bca6c5a56a2be47fb09395350362ac5 Author: Ryan Blue Date: 2018-06-06T20:17:16Z SPARK-24478: Add trait to report stats with filters and projection. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 Friendly ping @HyukjinKwon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21502 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91505/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21502 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21502: [SPARK-22575][SQL] Add destroy to Dataset
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21502 **[Test build #91505 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91505/testReport)** for PR 21502 at commit [`ec365d6`](https://github.com/apache/spark/commit/ec365d628126255cea3eadd4a8357030e5bf1f2e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21483: [SPARK-24454][ML][PYTHON] Imports image module in ml/__i...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21483 Ah, maybe I rushed to read the JIRA. For 2. there are already single attribute being loaded `ImageSchema` since all other attributes has an underscore; however, sure, it should be the best to explicitly define. Will do this both here if you are fine with it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org