[spark] branch master updated (04f66bf -> e9337f5)
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 04f66bf [MINOR][SS][DOCS] fileNameOnly parameter description re-unite add e9337f5 [SPARK-30119][WEBUI] Add Pagination Support to Streaming Page No new revisions were added by this update. Summary of changes: .../resources/org/apache/spark/ui/static/webui.js | 3 +- .../spark/streaming/ui/AllBatchesTable.scala | 282 +++-- .../apache/spark/streaming/ui/StreamingPage.scala | 113 ++--- .../apache/spark/streaming/UISeleniumSuite.scala | 45 +++- 4 files changed, 267 insertions(+), 176 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (04f66bf -> e9337f5)
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 04f66bf [MINOR][SS][DOCS] fileNameOnly parameter description re-unite add e9337f5 [SPARK-30119][WEBUI] Add Pagination Support to Streaming Page No new revisions were added by this update. Summary of changes: .../resources/org/apache/spark/ui/static/webui.js | 3 +- .../spark/streaming/ui/AllBatchesTable.scala | 282 +++-- .../apache/spark/streaming/ui/StreamingPage.scala | 113 ++--- .../apache/spark/streaming/UISeleniumSuite.scala | 45 +++- 4 files changed, 267 insertions(+), 176 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-30119][WEBUI] Add Pagination Support to Streaming Page
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e9337f5 [SPARK-30119][WEBUI] Add Pagination Support to Streaming Page e9337f5 is described below commit e9337f505b737f4501d4173baa9c5739626b06a8 Author: iRakson AuthorDate: Sun Jun 7 13:08:50 2020 +0900 [SPARK-30119][WEBUI] Add Pagination Support to Streaming Page ### What changes were proposed in this pull request? * Pagination Support is added to all tables of streaming page in spark web UI. For adding pagination support, existing classes from #7399 were used. * Earlier streaming page has two tables `Active Batches` and `Completed Batches`. Now, we will have three tables `Running Batches`, `Waiting Batches` and `Completed Batches`. If we have large number of waiting and running batches then keeping track in a single table is difficult. Also other pages have different table for different type type of data. * Earlier empty tables were shown. Now only non-empty tables will be shown. `Active Batches` table used to show details of waiting batches followed by running batches. ### Why are the changes needed? Pagination will allow users to analyse the table in much better way. All spark web UI pages support pagination apart from streaming pages, so this will add consistency as well. Also it might fix the potential OOM errors that can arise. ### Does this PR introduce _any_ user-facing change? Yes. `Active Batches` table is split into two tables `Running Batches` and `Waiting Batches`. Pagination Support is added to the all the tables. Every other functionality is unchanged. ### How was this patch tested? Manually. Before changes: https://user-images.githubusercontent.com/15366835/80915680-8fb44b80-8d71-11ea-9957-c4a3769b8b67.png";> After Changes: https://user-images.githubusercontent.com/15366835/80915694-a9ee2980-8d71-11ea-8fc5-246413a4951d.png";> Closes #28439 from iRakson/streamingPagination. Authored-by: iRakson Signed-off-by: Kousuke Saruta --- .../resources/org/apache/spark/ui/static/webui.js | 3 +- .../spark/streaming/ui/AllBatchesTable.scala | 282 +++-- .../apache/spark/streaming/ui/StreamingPage.scala | 113 ++--- .../apache/spark/streaming/UISeleniumSuite.scala | 45 +++- 4 files changed, 267 insertions(+), 176 deletions(-) diff --git a/core/src/main/resources/org/apache/spark/ui/static/webui.js b/core/src/main/resources/org/apache/spark/ui/static/webui.js index 4f8409c..bb37256 100644 --- a/core/src/main/resources/org/apache/spark/ui/static/webui.js +++ b/core/src/main/resources/org/apache/spark/ui/static/webui.js @@ -87,7 +87,8 @@ $(function() { collapseTablePageLoad('collapse-aggregated-poolActiveStages','aggregated-poolActiveStages'); collapseTablePageLoad('collapse-aggregated-tasks','aggregated-tasks'); collapseTablePageLoad('collapse-aggregated-rdds','aggregated-rdds'); - collapseTablePageLoad('collapse-aggregated-activeBatches','aggregated-activeBatches'); + collapseTablePageLoad('collapse-aggregated-waitingBatches','aggregated-waitingBatches'); + collapseTablePageLoad('collapse-aggregated-runningBatches','aggregated-runningBatches'); collapseTablePageLoad('collapse-aggregated-completedBatches','aggregated-completedBatches'); collapseTablePageLoad('collapse-aggregated-runningExecutions','aggregated-runningExecutions'); collapseTablePageLoad('collapse-aggregated-completedExecutions','aggregated-completedExecutions'); diff --git a/streaming/src/main/scala/org/apache/spark/streaming/ui/AllBatchesTable.scala b/streaming/src/main/scala/org/apache/spark/streaming/ui/AllBatchesTable.scala index 1e443f6..c0eec0e 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/ui/AllBatchesTable.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/ui/AllBatchesTable.scala @@ -17,30 +17,41 @@ package org.apache.spark.streaming.ui -import scala.xml.Node - -import org.apache.spark.ui.{UIUtils => SparkUIUtils} +import java.net.URLEncoder +import java.nio.charset.StandardCharsets.UTF_8 +import javax.servlet.http.HttpServletRequest -private[ui] abstract class BatchTableBase(tableId: String, batchInterval: Long) { +import scala.xml.Node - protected def columns: Seq[Node] = { -Batch Time - Records - Scheduling Delay -{SparkUIUtils.tooltip("Time taken by Streaming scheduler to submit jobs of a batch", "top")} - - Processing Time -{SparkUIUtils.tooltip("Time taken to process all jobs of a batch", "top")} - } +import org.apache.spark.ui.{PagedDataSource, PagedTable, UIUtils => SparkUIUtils} + +private[ui] class StreamingPagedTable( +request: HttpServletRequest, +tableTag: String, +batches
svn commit: r39960 - in /dev/spark/v3.0.0-rc3-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _site/api/java/org/apache/parqu
Author: rxin Date: Sat Jun 6 14:03:25 2020 New Revision: 39960 Log: Apache Spark v3.0.0-rc3 docs [This commit notification would consist of 1920 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r39959 - /dev/spark/v3.0.0-rc3-bin/
Author: rxin Date: Sat Jun 6 13:35:40 2020 New Revision: 39959 Log: Apache Spark v3.0.0-rc3 Added: dev/spark/v3.0.0-rc3-bin/ dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz (with props) dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.asc dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.sha512 dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz (with props) dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz.asc dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz.sha512 dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7-hive1.2.tgz (with props) dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7-hive1.2.tgz.asc dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7-hive1.2.tgz.sha512 dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7.tgz (with props) dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7.tgz.asc dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop2.7.tgz.sha512 dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop3.2.tgz (with props) dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop3.2.tgz.asc dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-hadoop3.2.tgz.sha512 dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-without-hadoop.tgz (with props) dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-without-hadoop.tgz.asc dev/spark/v3.0.0-rc3-bin/spark-3.0.0-bin-without-hadoop.tgz.sha512 dev/spark/v3.0.0-rc3-bin/spark-3.0.0.tgz (with props) dev/spark/v3.0.0-rc3-bin/spark-3.0.0.tgz.asc dev/spark/v3.0.0-rc3-bin/spark-3.0.0.tgz.sha512 Added: dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz == Binary file - no diff available. Propchange: dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.asc == --- dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.asc (added) +++ dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.asc Sat Jun 6 13:35:40 2020 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJEBAABCgAuFiEESovaSObiEqc0YyUC3qlj4uk0fWYFAl7bh3gQHHJ4aW5AYXBh +Y2hlLm9yZwAKCRDeqWPi6TR9ZjGIEACG3gsdARN8puRHS2YL+brOmjbrS4wVY/Av +l+ZR59moZ7QuwjYoixyqNnztIKgIyleYJq9DL5TqqMxFgGpuoDrnuWVqI+8MngVA +gau/QDmYINabZsJxFfDn1IjxxSQBsgf6pwfqQbB+fGSjLSPnDq+u3DIWr3fRMh4X +DrTuATNewKiiBIwQHUKAtPMAbsdDvXv0DRL7CGTiIJri43opAntQzHec3sP9hgRU +J5J2HnjOlamgv58S7zrUw/Wo1xPLmz2PGIsP0aq9DRRw0bLnesrtEaWAKFp2HL5E +QlbjfboaDQz/X+meruW57/sO/DDwA90/XvF44z4Gu6kbS8nRuTsU5wVfZ/1iyWZk +PLP2nFoWl7O85k/DLB5ADYgce3e6k2qD2obKxzsEx0nr0Wu13cxCR2+IBQmv05jb +4Kwi7iE0iKIxt3cESDH6j9GqZoTrcxt6Jb88KSQ+YM2TBNUr1ZZNmkjgYdmLvm7a +wH6vLtdpZzUKIGd6bt1grEwoQJBMnQjkoDYxhx+ugjbs8CwwxcdUNd2Q5xz0WaSn +p443ZlMR5lbGf6D6U4PUigaIrdD8d+ef/rRTDtXdoDqC+FdNuepyS9+2+dUZGErx +N2IMNunKIdKw57GZGcILey1hY45SSuQFw5JAe+nWqCAzCmFX72ulkv9The7rLdlE +YdLu6XQIBA== +=HhHH +-END PGP SIGNATURE- Added: dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.sha512 == --- dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.sha512 (added) +++ dev/spark/v3.0.0-rc3-bin/SparkR_3.0.0.tar.gz.sha512 Sat Jun 6 13:35:40 2020 @@ -0,0 +1,3 @@ +SparkR_3.0.0.tar.gz: 394DCFEB 4E202A8E 5C58BF94 A77548FD 79A00F92 34538535 + B0242E1B 96068E3E 80F78188 D71831F8 4A350224 41AA14B1 + D72AE704 F2390842 DBEAB41F 5AC9859A Added: dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz == Binary file - no diff available. Propchange: dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz.asc == --- dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz.asc (added) +++ dev/spark/v3.0.0-rc3-bin/pyspark-3.0.0.tar.gz.asc Sat Jun 6 13:35:40 2020 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJEBAABCgAuFiEESovaSObiEqc0YyUC3qlj4uk0fWYFAl7bh3oQHHJ4aW5AYXBh +Y2hlLm9yZwAKCRDeqWPi6TR9ZvhPD/9Vyrywk4kYFaUo36Kg6nickTvWMm+yDfGZ +tKrUo3dAoMta7OwUjT6W0Roo8a4BBgumaDv47Dm6mlquF2DuLuBrFsqFo8c5VNA/ +jT1tdSdHiTzjq7LfY9GQDn8Wkgp1gyIKON70XFdZifduW0gcFDkJ+FjhPYWcA6jy +GGOGK5qboCdi9C+KowUVj4VB9bbxPbWvW7FVF3+VlcrKvkmNx+EmqmIrqsh72w8O +EL70za2uBRUUiFcaOpY/wpmEN1raCAkMzQ+dPl7p1PFgmLFrMN9RaRXJ1stF+fXO +rDLBLNPqb85TvvOOHpcr4PSP38GrdZvDAvljCOEbBzacF719bewu/IVRcNi9lPZE +HDPUcZLgnocNIF6kafykrm3JhagzmPIhQ8d4DFTuH6ePxgWqdUa9lWKQL54z3mjU +LT2CJ8gMDY0Wz5zSKc/sI/ZwL+Q6U8xiIGYSzQgT9yPztbhDd5AM2DgohJkZSD4b +jOrEsSyNRJiwwRAHlbeOOVPb4UNYzsx1USPbPEBeXTt8X8VUb8jsU84o/RhXexk9 +EMJjxz/aChB+NefbmUjBZmXSaa/zYubprJrWnUgPw7hFxAnmtgIUdjSWSNIOJ6bp +EV1M6xwuvrmGhOa3D0C+lYyAuYZca2FQrcAtzNiL6iOMQ6USFZvzjxGWQiV2CDGQ +O8CNfkwOGA=
svn commit: r39958 - /dev/spark/v3.0.0-rc3-bin/
Author: rxin Date: Sat Jun 6 11:18:32 2020 New Revision: 39958 Log: remove 3.0 rc3 binary Removed: dev/spark/v3.0.0-rc3-bin/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 476010a [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI 476010a is described below commit 476010aedd101e1a807c202d71328415109660ae Author: Takuya UESHIN AuthorDate: Sat Jun 6 16:50:40 2020 +0900 [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI ### What changes were proposed in this pull request? This is a backport of #28730. In `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x yz 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-05 at 10 58 30 AM](https://user-images.githubusercontent.com/506656/83914110-6f3b3080-a725-11ea-88c0-de83a833b05c.png) but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.enabled', True) >>> df.toPandas() x yz 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-05 at 10 58 42 AM](https://user-images.githubusercontent.com/506656/83914127-782c0200-a725-11ea-84e4-74d861d5c20a.png) ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local. ![Screen Shot 2020-06-05 at 11 29 48 AM](https://user-images.githubusercontent.com/506656/83914142-7e21e300-a725-11ea-8925-edc22df16388.png) Closes #28740 from ueshin/issues/SPARK-31903/2.4/to_pandas_with_arrow_query_ui. Authored-by: Takuya UESHIN Signed-off-by: HyukjinKwon --- sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 6bc0d0b..a755a6f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -3283,8 +3283,8 @@ class Dataset[T] private[sql]( private[sql] def collectAsArrowToPython(): Array[Any] = { val timeZoneId = sparkSession.sessionState.conf.sessionLocalTimeZone -withAction("collectAsArrowToPython", queryExecution) { plan => - PythonRDD.serveToStreamWithSync("serve-Arrow") { out => +PythonRDD.serveToStreamWithSync("serve-Arrow") { out => + withAction("collectAsArrowToPython", queryExecution) { plan => val batchWriter = new ArrowBatchStreamWriter(schema, out, timeZoneId) val arrowBatchRdd = toArrowBatchRdd(plan) val numPartitions = arrowBatchRdd.partitions.length - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 476010a [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI 476010a is described below commit 476010aedd101e1a807c202d71328415109660ae Author: Takuya UESHIN AuthorDate: Sat Jun 6 16:50:40 2020 +0900 [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI ### What changes were proposed in this pull request? This is a backport of #28730. In `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x yz 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-05 at 10 58 30 AM](https://user-images.githubusercontent.com/506656/83914110-6f3b3080-a725-11ea-88c0-de83a833b05c.png) but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.enabled', True) >>> df.toPandas() x yz 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-05 at 10 58 42 AM](https://user-images.githubusercontent.com/506656/83914127-782c0200-a725-11ea-84e4-74d861d5c20a.png) ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local. ![Screen Shot 2020-06-05 at 11 29 48 AM](https://user-images.githubusercontent.com/506656/83914142-7e21e300-a725-11ea-8925-edc22df16388.png) Closes #28740 from ueshin/issues/SPARK-31903/2.4/to_pandas_with_arrow_query_ui. Authored-by: Takuya UESHIN Signed-off-by: HyukjinKwon --- sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 6bc0d0b..a755a6f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -3283,8 +3283,8 @@ class Dataset[T] private[sql]( private[sql] def collectAsArrowToPython(): Array[Any] = { val timeZoneId = sparkSession.sessionState.conf.sessionLocalTimeZone -withAction("collectAsArrowToPython", queryExecution) { plan => - PythonRDD.serveToStreamWithSync("serve-Arrow") { out => +PythonRDD.serveToStreamWithSync("serve-Arrow") { out => + withAction("collectAsArrowToPython", queryExecution) { plan => val batchWriter = new ArrowBatchStreamWriter(schema, out, timeZoneId) val arrowBatchRdd = toArrowBatchRdd(plan) val numPartitions = arrowBatchRdd.partitions.length - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 476010a [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI 476010a is described below commit 476010aedd101e1a807c202d71328415109660ae Author: Takuya UESHIN AuthorDate: Sat Jun 6 16:50:40 2020 +0900 [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI ### What changes were proposed in this pull request? This is a backport of #28730. In `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x yz 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-05 at 10 58 30 AM](https://user-images.githubusercontent.com/506656/83914110-6f3b3080-a725-11ea-88c0-de83a833b05c.png) but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.enabled', True) >>> df.toPandas() x yz 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-05 at 10 58 42 AM](https://user-images.githubusercontent.com/506656/83914127-782c0200-a725-11ea-84e4-74d861d5c20a.png) ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local. ![Screen Shot 2020-06-05 at 11 29 48 AM](https://user-images.githubusercontent.com/506656/83914142-7e21e300-a725-11ea-8925-edc22df16388.png) Closes #28740 from ueshin/issues/SPARK-31903/2.4/to_pandas_with_arrow_query_ui. Authored-by: Takuya UESHIN Signed-off-by: HyukjinKwon --- sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 6bc0d0b..a755a6f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -3283,8 +3283,8 @@ class Dataset[T] private[sql]( private[sql] def collectAsArrowToPython(): Array[Any] = { val timeZoneId = sparkSession.sessionState.conf.sessionLocalTimeZone -withAction("collectAsArrowToPython", queryExecution) { plan => - PythonRDD.serveToStreamWithSync("serve-Arrow") { out => +PythonRDD.serveToStreamWithSync("serve-Arrow") { out => + withAction("collectAsArrowToPython", queryExecution) { plan => val batchWriter = new ArrowBatchStreamWriter(schema, out, timeZoneId) val arrowBatchRdd = toArrowBatchRdd(plan) val numPartitions = arrowBatchRdd.partitions.length - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [MINOR][SS][DOCS] fileNameOnly parameter description re-unite
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new c5683fc [MINOR][SS][DOCS] fileNameOnly parameter description re-unite c5683fc is described below commit c5683fce6f43dd561677830d74af196bc6c97134 Author: Gabor Somogyi AuthorDate: Sat Jun 6 16:49:48 2020 +0900 [MINOR][SS][DOCS] fileNameOnly parameter description re-unite ### What changes were proposed in this pull request? `fileNameOnly` parameter is split to 2 pieces in [this](https://github.com/apache/spark/commit/dbb8143501ab87865d6e202c17297b9a73a0b1c3) commit. This PR re-unites it. ### Why are the changes needed? Parameter description split in doc. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #28739 from gaborgsomogyi/datasettxtfix. Authored-by: Gabor Somogyi Signed-off-by: HyukjinKwon (cherry picked from commit 04f66bfd4eb863253ac9c30594055b8d5997c321) Signed-off-by: HyukjinKwon --- docs/structured-streaming-programming-guide.md | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 1776d23..69d744d 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -540,12 +540,13 @@ Here are the details of all the sources in Spark. fileNameOnly: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same: -maxFileAge: Maximum age of a file that can be found in this directory, before it is ignored. For the first batch all files will be considered valid. If latestFirst is set to `true` and maxFilesPerTrigger is set, then this parameter will be ignored, because old files that are valid, and should be processed, may be ignored. The max age is specified with respect to the timestamp of the latest file, and not the timestamp of the current system.(d [...] - "file:///dataset.txt" "s3://a/dataset.txt" "s3n://a/b/dataset.txt" -"s3a://a/b/c/dataset.txt" +"s3a://a/b/c/dataset.txt" + +maxFileAge: Maximum age of a file that can be found in this directory, before it is ignored. For the first batch all files will be considered valid. If latestFirst is set to `true` and maxFilesPerTrigger is set, then this parameter will be ignored, because old files that are valid, and should be processed, may be ignored. The max age is specified with respect to the timestamp of the latest file, and not the timestamp of the current system.(d [...] + cleanSource: option to clean up completed files after processing. Available options are "archive", "delete", "off". If the option is not provided, the default value is "off". When "archive" is provided, additional option sourceArchiveDir must be provided as well. The value of "sourceArchiveDir" must not match with source pattern in depth (the number of directories from the root directory), where the depth is minimum of depth on both paths. This will ensure archived files are never included as new source files. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [MINOR][SS][DOCS] fileNameOnly parameter description re-unite
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new c5683fc [MINOR][SS][DOCS] fileNameOnly parameter description re-unite c5683fc is described below commit c5683fce6f43dd561677830d74af196bc6c97134 Author: Gabor Somogyi AuthorDate: Sat Jun 6 16:49:48 2020 +0900 [MINOR][SS][DOCS] fileNameOnly parameter description re-unite ### What changes were proposed in this pull request? `fileNameOnly` parameter is split to 2 pieces in [this](https://github.com/apache/spark/commit/dbb8143501ab87865d6e202c17297b9a73a0b1c3) commit. This PR re-unites it. ### Why are the changes needed? Parameter description split in doc. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #28739 from gaborgsomogyi/datasettxtfix. Authored-by: Gabor Somogyi Signed-off-by: HyukjinKwon (cherry picked from commit 04f66bfd4eb863253ac9c30594055b8d5997c321) Signed-off-by: HyukjinKwon --- docs/structured-streaming-programming-guide.md | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 1776d23..69d744d 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -540,12 +540,13 @@ Here are the details of all the sources in Spark. fileNameOnly: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same: -maxFileAge: Maximum age of a file that can be found in this directory, before it is ignored. For the first batch all files will be considered valid. If latestFirst is set to `true` and maxFilesPerTrigger is set, then this parameter will be ignored, because old files that are valid, and should be processed, may be ignored. The max age is specified with respect to the timestamp of the latest file, and not the timestamp of the current system.(d [...] - "file:///dataset.txt" "s3://a/dataset.txt" "s3n://a/b/dataset.txt" -"s3a://a/b/c/dataset.txt" +"s3a://a/b/c/dataset.txt" + +maxFileAge: Maximum age of a file that can be found in this directory, before it is ignored. For the first batch all files will be considered valid. If latestFirst is set to `true` and maxFilesPerTrigger is set, then this parameter will be ignored, because old files that are valid, and should be processed, may be ignored. The max age is specified with respect to the timestamp of the latest file, and not the timestamp of the current system.(d [...] + cleanSource: option to clean up completed files after processing. Available options are "archive", "delete", "off". If the option is not provided, the default value is "off". When "archive" is provided, additional option sourceArchiveDir must be provided as well. The value of "sourceArchiveDir" must not match with source pattern in depth (the number of directories from the root directory), where the depth is minimum of depth on both paths. This will ensure archived files are never included as new source files. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5079831 -> 04f66bf)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 5079831 [SPARK-31904][SQL] Fix case sensitive problem of char and varchar partition columns add 04f66bf [MINOR][SS][DOCS] fileNameOnly parameter description re-unite No new revisions were added by this update. Summary of changes: docs/structured-streaming-programming-guide.md | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5079831 -> 04f66bf)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 5079831 [SPARK-31904][SQL] Fix case sensitive problem of char and varchar partition columns add 04f66bf [MINOR][SS][DOCS] fileNameOnly parameter description re-unite No new revisions were added by this update. Summary of changes: docs/structured-streaming-programming-guide.md | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org