Re: [PR] [SPARK-45826][SQL] Add a SQL config for stack traces in DataFrame query context [spark]

2023-11-25 Thread via GitHub


MaxGekk commented on PR #43695:
URL: https://github.com/apache/spark/pull/43695#issuecomment-1826715119

   @cloud-fan Quite the same:
   ```scala
   scala> val divCol = lit(1) / lit(0)
   val divCol: org.apache.spark.sql.Column = `/`(1, 0)
   
   scala> spark.range(1).select(divCol).collect()
   org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by 
zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If 
necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. 
SQLSTATE: 22012
   == DataFrame ==
   "div" was called from
   (:1)
   (:15)
   .(:1)
   ```
   
   but when I create it in an object:
   ```scala
   scala> object Obj1 {
| val divCol = lit(1) / lit(0)
| }
   object Obj1
   
   scala> spark.range(1).select(Obj1.divCol).collect()
   org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by 
zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If 
necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. 
SQLSTATE: 22012
   == DataFrame ==
   "div" was called from
   Obj1$.(:2)
   Obj1$lzycompute$1(:1)
   Obj1(:1)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45826][SQL] Add a SQL config for stack traces in DataFrame query context [spark]

2023-11-25 Thread via GitHub


cloud-fan commented on PR #43695:
URL: https://github.com/apache/spark/pull/43695#issuecomment-1826686951

   @MaxGekk not quite related in this PR, but what if the expression creation 
is different from the df creation? like 
   ```
   val divCol = lit(1) / lit(0)
   spark.range(1).select(divCol).collect()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46084][PS][FOLLOWUP] More refactoring by using `create_map` [spark]

2023-11-25 Thread via GitHub


itholic commented on PR #44015:
URL: https://github.com/apache/spark/pull/44015#issuecomment-1826538856

   I believe this is the last refactoring for the current code base of 
`CategoricalOps`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-46084][PS][FOLLOWUP] More refactoring by using `create_map` [spark]

2023-11-25 Thread via GitHub


itholic opened a new pull request, #44015:
URL: https://github.com/apache/spark/pull/44015

   
   
   ### What changes were proposed in this pull request?
   
   This PR follows-up for https://github.com/apache/spark/pull/43993 to make 
more refactoring for `CategoricalOps`.
   
   ### Why are the changes needed?
   
   To optimize performance/debuggability/readability by using official API
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it's internal refactoring
   
   
   ### How was this patch tested?
   
   The existing CI should pass.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][INFRA][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405234673


##
python/docs/source/conf.py:
##
@@ -194,7 +194,11 @@
 # further.  For a list of options available for each theme, see the
 # documentation.
 html_theme_options = {
-"navbar_end": ["version-switcher"]
+"navbar_end": ["version-switcher", "theme-switcher"],
+"logo": {
+"image_light": "_static/spark-logo-light.png",
+"image_dark": "_static/spark-logo-dark.png",

Review Comment:
   FYI: The default mode for light/dark is `auto`, which will choose a theme 
based on the system settings from user, but we can specify one of `dark` or 
`light` as default manually if we want.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][INFRA][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405239729


##
dev/requirements.txt:
##
@@ -31,12 +31,12 @@ pandas-stubs<1.2.0.54
 mkdocs
 
 # Documentation (Python)
-pydata_sphinx_theme
+pydata_sphinx_theme==0.13
 ipython
 nbsphinx
 numpydoc
-jinja2<3.0.0
-sphinx<3.1.0
+jinja2
+sphinx==4.2.0

Review Comment:
   I see that other Sphinx versions do not generate documentation properly for 
some reason. I have tested as many combinations as possible with Jinja2 and 
pydata_sphinx_theme, but I have confirmed that Sphinx version 4.2.0 currently 
renders documents in the most optimal form. Will investigate further in the 
future to support the latest Sphinx if necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][INFRA][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405239729


##
dev/requirements.txt:
##
@@ -31,12 +31,12 @@ pandas-stubs<1.2.0.54
 mkdocs
 
 # Documentation (Python)
-pydata_sphinx_theme
+pydata_sphinx_theme==0.13
 ipython
 nbsphinx
 numpydoc
-jinja2<3.0.0
-sphinx<3.1.0
+jinja2
+sphinx==4.2.0

Review Comment:
   I see that other Sphinx versions do not generate documentation properly for 
some reason. I have tested as many combinations as possible with Jinja2 and 
pydata_sphinx_theme, but I have confirmed that Sphinx version 4.2.0 currently 
renders documents in the most optimal form. Will investigate further in the 
future to support the latest Sphinx If necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46100][CORE][SQL][CONNECT][AVRO] Replace the remaining (string|array).size with (string|array).length [spark]

2023-11-25 Thread via GitHub


LuciferYang commented on PR #44014:
URL: https://github.com/apache/spark/pull/44014#issuecomment-1826508000

   https://github.com/LuciferYang/spark/actions/runs/6993380440/job/19026004116


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-46100][CORE][SQL][CONNECT][AVRO] Replace the remaining (string|array).size with (string|array).length [spark]

2023-11-25 Thread via GitHub


LuciferYang opened a new pull request, #44014:
URL: https://github.com/apache/spark/pull/44014

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][INFRA][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on PR #44012:
URL: https://github.com/apache/spark/pull/44012#issuecomment-1826501628

   Documentation build passed: 
https://github.com/itholic/spark/actions/runs/6991575840/job/19023836477


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46096][BUILD] Upgrade `sbt` to 1.9.7 [spark]

2023-11-25 Thread via GitHub


LuciferYang commented on PR #44008:
URL: https://github.com/apache/spark/pull/44008#issuecomment-1826480761

   https://github.com/panbingkun/spark/actions/runs/6758634955/job/18370477191
   
   
![image](https://github.com/apache/spark/assets/1475305/938771ba-65b0-4d97-8e85-d249a600e6a6)
   
   
   https://github.com/apache/spark/pull/43079
   
   As I recall, there were some dependencies in the lint task that could not be 
downloaded, including the MIMA check and some subsequent subtasks, but 
@panbingkun would know more about the specific details.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46100][CORE][PYTHON] Reduce stack depth by replace (string|array).size with (string|array).length [spark]

2023-11-25 Thread via GitHub


beliefer commented on PR #44011:
URL: https://github.com/apache/spark/pull/44011#issuecomment-1826478614

   @srowen @LuciferYang Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-46090][SQL] Support plan fragment level SQL configs in AQE [spark]

2023-11-25 Thread via GitHub


ulysses-you opened a new pull request, #44013:
URL: https://github.com/apache/spark/pull/44013

   
   
   ### What changes were proposed in this pull request?
   
   This pr introduces `case class AdaptiveRuleContext(isSubquery: Boolean, 
isFinalStage: Boolean)` which can be used inside adaptive sql extension rules 
through thread local, so that developers can modify the next plan fragment 
configs using `AdaptiveRuleContext.get()`.
   
   To make it easy to modify the shuffle partitions config, this pr adds a new 
extension entrance `postPlannerStrategyRules` which will be applied between 
`plannerStrategy` and `queryStagePrepRules`, so it can get the whole plan 
before injecting exchanges.
   
   The plan fragment configs can be propagated through multi-phases, e.g., if 
set a config in `postPlannerStrategyRules` then the config can be gotten in 
`queryStagePrepRules`, `queryStageOptimizerRules` and `columnarRules`. The 
configs will be cleanup before going to execute, so in next round the configs 
will be empty.
   
   ### Why are the changes needed?
   
   To support modify the plan fragment level SQL configs through AQE rules.
   
   ### Does this PR introduce _any_ user-facing change?
   
   no, only affect developers.
   
   ### How was this patch tested?
   
   add new tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46097][SQL] Push down limit 1 though Union and Aggregate [spark]

2023-11-25 Thread via GitHub


wangyum commented on code in PR #44009:
URL: https://github.com/apache/spark/pull/44009#discussion_r140529


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -771,6 +771,17 @@ object LimitPushDown extends Rule[LogicalPlan] {
   Limit(le, Project(a.aggregateExpressions, LocalLimit(le, a.child)))
 case Limit(le @ IntegerLiteral(1), p @ Project(_, a: Aggregate)) if 
a.groupOnly =>
   Limit(le, p.copy(child = Project(a.aggregateExpressions, LocalLimit(le, 
a.child
+// Push down limit 1 though Union and Aggregate
+case Limit(le @ IntegerLiteral(1), u: Union) =>
+  val newUnionChildren = u.children.map {
+case a: Aggregate if a.groupOnly =>

Review Comment:
   We have supported this before:
   
https://github.com/apache/spark/blob/ddb36f9e0151d3bff8b9c8a2e3bb14a6f6ece218/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L769-L773



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][INFRA][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


HyukjinKwon commented on PR #44012:
URL: https://github.com/apache/spark/pull/44012#issuecomment-1826463424

   This is a nice improvement!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-42746][SQL] Add the LISTAGG() aggregate function [spark]

2023-11-25 Thread via GitHub


Hisoka-X commented on code in PR #42398:
URL: https://github.com/apache/spark/pull/42398#discussion_r1405287835


##
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##
@@ -158,6 +158,69 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
 }
   }
 
+  test("SPARK-42746: listagg function") {
+withTempView("df", "df2") {
+  Seq(("a", "b"), ("a", "c"), ("b", "c"), ("b", "d"), (null, 
null)).toDF("a", "b")
+.createOrReplaceTempView("df")
+  checkAnswer(
+sql("select listagg(b) from df group by a"),
+Row("") :: Row("b,c") :: Row("c,d") :: Nil)
+
+  checkAnswer(
+sql("select listagg(b) from df where 1 != 1"),
+Row("") :: Nil)
+
+  checkAnswer(
+sql("select listagg(b, '|') from df group by a"),
+Row("b|c") :: Row("c|d") :: Row("") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) FROM df"),
+Row("a,a,b,b") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(DISTINCT a) FROM df"),
+Row("a,b") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY a) FROM df"),
+Row("a,a,b,b") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY a DESC) FROM df"),
+Row("b,b,a,a") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY a DESC) " +
+  "OVER (PARTITION BY b) FROM df"),
+Row("a") :: Row("b,a") :: Row("b,a") :: Row("b") :: Row("") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY b) FROM df"),
+Row("a,a,b,b") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY b DESC) FROM df"),

Review Comment:
   Thanks @hopefulnick , sorry for late response, let me fix it today.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44209] Expose amount of shuffle data available on the node [spark]

2023-11-25 Thread via GitHub


github-actions[bot] commented on PR #42071:
URL: https://github.com/apache/spark/pull/42071#issuecomment-1826450649

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-44850][CONNECT] Heartbeat in scala's Spark Connect [spark]

2023-11-25 Thread via GitHub


github-actions[bot] commented on PR #42538:
URL: https://github.com/apache/spark/pull/42538#issuecomment-1826450643

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405240579


##
dev/requirements.txt:
##
@@ -31,12 +31,12 @@ pandas-stubs<1.2.0.54
 mkdocs
 
 # Documentation (Python)
-pydata_sphinx_theme
+pydata_sphinx_theme==0.13

Review Comment:
   I followed [the version used in 
Pandas](https://github.com/pandas-dev/pandas/blob/main/requirements-dev.txt#L64),
 and actually I believe this version render the document in the most optimal 
form after doing several version testing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405240579


##
dev/requirements.txt:
##
@@ -31,12 +31,12 @@ pandas-stubs<1.2.0.54
 mkdocs
 
 # Documentation (Python)
-pydata_sphinx_theme
+pydata_sphinx_theme==0.13

Review Comment:
   I followed [the version used in 
Pandas](https://github.com/pandas-dev/pandas/blob/main/requirements-dev.txt#L64),
 and actually I believe this version render the document in the most optimal 
form.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405240335


##
dev/requirements.txt:
##
@@ -31,12 +31,12 @@ pandas-stubs<1.2.0.54
 mkdocs
 
 # Documentation (Python)
-pydata_sphinx_theme
+pydata_sphinx_theme==0.13

Review Comment:
   I followed [the version used in 
Pandas](https://github.com/pandas-dev/pandas/blob/main/requirements-dev.txt#L64),
 and actually tested several versions, rendering the document in the most 
attractive form.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405240335


##
dev/requirements.txt:
##
@@ -31,12 +31,12 @@ pandas-stubs<1.2.0.54
 mkdocs
 
 # Documentation (Python)
-pydata_sphinx_theme
+pydata_sphinx_theme==0.13

Review Comment:
   I followed [the version used in 
Pandas](https://github.com/pandas-dev/pandas/blob/main/requirements-dev.txt#L64),
 and actually tested several versions, rendering the document in the most 
attractive form.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405239729


##
dev/requirements.txt:
##
@@ -31,12 +31,12 @@ pandas-stubs<1.2.0.54
 mkdocs
 
 # Documentation (Python)
-pydata_sphinx_theme
+pydata_sphinx_theme==0.13
 ipython
 nbsphinx
 numpydoc
-jinja2<3.0.0
-sphinx<3.1.0
+jinja2
+sphinx==4.2.0

Review Comment:
   I tested that other Sphinx versions do not generate documentation properly 
for some reason. I have tested as many combinations as possible with Jinja2 and 
pydata_sphinx_theme, but I have confirmed that Sphinx version 4.2.0 currently 
renders documents in the most optimal form. Will investigate further in the 
future to support the latest Sphinx If necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405238199


##
python/docs/source/reference/pyspark.pandas/frame.rst:
##
@@ -319,8 +320,8 @@ specific plotting methods of the form 
``DataFrame.plot.``.
 
 .. autosummary::
:toctree: api/
+   :template: autosummary/accessor_method.rst
 
-   DataFrame.plot

Review Comment:
   In newer versions of Sphinx, the build will fail because `DataFrame.plot` 
and `Series.plot` are determined to be duplicates of the list of functions 
described below such as `DataFrame.plot.area`, `DataFrame.plot.barh`, 
`DataFrame.plot.bar`, etc.
   
   In fact, this behavior seems reasonable since `.plot` is simply an accessor 
keyword and not a function, so I believe we can just simply leave it out of the 
document.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405238199


##
python/docs/source/reference/pyspark.pandas/frame.rst:
##
@@ -319,8 +320,8 @@ specific plotting methods of the form 
``DataFrame.plot.``.
 
 .. autosummary::
:toctree: api/
+   :template: autosummary/accessor_method.rst
 
-   DataFrame.plot

Review Comment:
   In newer versions of Sphinx, the build will fail because `DataFrame.plot` 
and `Series.plot` are determined to be duplicates of the list of functions 
described below.
   
   In fact, this behavior seems reasonable since `.plot` is simply an accessor 
keyword and not a function, so I believe we can just simply leave it out of the 
document.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405236858


##
python/docs/source/_templates/autosummary/class_with_docs.rst:
##
@@ -47,7 +47,9 @@
 
 .. autosummary::
 {% for item in attributes %}
-   ~{{ name }}.{{ item }}
+{% if not (item == 'uid') %}

Review Comment:
   We should manually exclude `uid` from documentation because it is an 
internal property. We don't include them our current documentation as well, but 
for some reason newer Sphinx version trying to generate the internal property 
unexpectedly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405236561


##
python/docs/source/_templates/autosummary/class_with_docs.rst:
##
@@ -47,7 +47,9 @@
 
 .. autosummary::
 {% for item in attributes %}
-   ~{{ name }}.{{ item }}
+{% if not (item == 'uid') %}

Review Comment:
   We should manually exclude `uid` from documentation because it is an 
internal property. We don't include them our current documentation as well, but 
for some reason newer Sphinx version trying to generate the internal property 
unexpectedly.



##
python/docs/source/_templates/autosummary/class_with_docs.rst:
##
@@ -47,7 +47,9 @@
 
 .. autosummary::
 {% for item in attributes %}
-   ~{{ name }}.{{ item }}
+{% if not (item == 'uid') %}

Review Comment:
   We should manually exclude `uid` from documentation because it is an 
internal property. We don't include them our current documentation as well, but 
for some reason newer Sphinx version trying to generate the internal property 
unexpectedly.



##
python/docs/source/_templates/autosummary/class_with_docs.rst:
##
@@ -47,7 +47,9 @@
 
 .. autosummary::
 {% for item in attributes %}
-   ~{{ name }}.{{ item }}
+{% if not (item == 'uid') %}

Review Comment:
   We should manually exclude uid from documentation because it is an internal 
property. We don't include them our current documentation as well, but for some 
reason newer Sphinx version trying to generate the internal property 
unexpectedly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405236561


##
python/docs/source/_templates/autosummary/class_with_docs.rst:
##
@@ -47,7 +47,9 @@
 
 .. autosummary::
 {% for item in attributes %}
-   ~{{ name }}.{{ item }}
+{% if not (item == 'uid') %}

Review Comment:
   We should manually exclude `uid` from documentation because it is an 
internal property. We don't include them our current documentation as well, but 
for some reason newer Sphinx version trying to generate the internal property 
which is not correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405236561


##
python/docs/source/_templates/autosummary/class_with_docs.rst:
##
@@ -47,7 +47,9 @@
 
 .. autosummary::
 {% for item in attributes %}
-   ~{{ name }}.{{ item }}
+{% if not (item == 'uid') %}

Review Comment:
   We should manually exclude `uid` from documentation because it is an 
internal property. We don't include them our current documentation as well, but 
for some reason newer Sphinx version trying to generate the internal property.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405234673


##
python/docs/source/conf.py:
##
@@ -194,7 +194,11 @@
 # further.  For a list of options available for each theme, see the
 # documentation.
 html_theme_options = {
-"navbar_end": ["version-switcher"]
+"navbar_end": ["version-switcher", "theme-switcher"],
+"logo": {
+"image_light": "_static/spark-logo-light.png",
+"image_dark": "_static/spark-logo-dark.png",

Review Comment:
   FYI: The default mode for light/dark is `auto`, which will choose a theme 
based on the system settings from user, but we can specify one of `dark` or 
`light` as default manually.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405234673


##
python/docs/source/conf.py:
##
@@ -194,7 +194,11 @@
 # further.  For a list of options available for each theme, see the
 # documentation.
 html_theme_options = {
-"navbar_end": ["version-switcher"]
+"navbar_end": ["version-switcher", "theme-switcher"],
+"logo": {
+"image_light": "_static/spark-logo-light.png",
+"image_dark": "_static/spark-logo-dark.png",

Review Comment:
   FYI: The default mode for light/dark is `auto`, which will choose a theme 
based on the system settings from user, but we can choose one of `dark` or 
`light` as default manually.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405233405


##
python/docs/source/reference/pyspark.pandas/frame.rst:
##
@@ -299,6 +299,7 @@ in Spark. These can be accessed by 
``DataFrame.spark.``.
 
 .. autosummary::
:toctree: api/
+   :template: autosummary/accessor_method.rst

Review Comment:
   For example, previously the rst file was created as follows:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql.SparkSession
   
   .. automethod:: builder.appName
   ```
   However, in newer Sphinx versions it is generated like this:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql
   
   .. automethod:: SparkSession.builder.appName
   ```
   
   In the case of functions used through internal classes or accessors like 
this, the package paths created in a new way will cause Sphinx build to fail. 
That's why we should use the customized template to correct the module path.
   
   See also https://github.com/sphinx-doc/sphinx/issues/7551.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405232095


##
python/docs/source/reference/pyspark.pandas/frame.rst:
##
@@ -299,6 +299,7 @@ in Spark. These can be accessed by 
``DataFrame.spark.``.
 
 .. autosummary::
:toctree: api/
+   :template: autosummary/accessor_method.rst

Review Comment:
   For example, previously the rst file was created as follows:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql.SparkSession
   
   .. automethod:: builder.appName
   ```
   However, in newer Sphinx versions it is generated like this:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql
   
   .. automethod:: SparkSession.builder.appName
   ```
   
   In the case of functions used through internal classes or accessors like 
this, the package paths created in a new way will cause Sphinx build to fail.
   
   See also https://github.com/sphinx-doc/sphinx/issues/7551.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405232095


##
python/docs/source/reference/pyspark.pandas/frame.rst:
##
@@ -299,6 +299,7 @@ in Spark. These can be accessed by 
``DataFrame.spark.``.
 
 .. autosummary::
:toctree: api/
+   :template: autosummary/accessor_method.rst

Review Comment:
   For example, previously the rst file was created as follows:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql.SparkSession
   
   .. automethod:: builder.appName
   ```
   However, in newer Sphinx versions it is generated like this:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql
   
   .. automethod:: SparkSession.builder.appName
   ```
   
   In the case of functions used through internal classes or accessors like 
this, the package paths created in a new way will cause Sphinx build to fail. 
See https://github.com/sphinx-doc/sphinx/issues/7551 related issue. Therefore, 
we should use the customized template such as Pandas does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405232095


##
python/docs/source/reference/pyspark.pandas/frame.rst:
##
@@ -299,6 +299,7 @@ in Spark. These can be accessed by 
``DataFrame.spark.``.
 
 .. autosummary::
:toctree: api/
+   :template: autosummary/accessor_method.rst

Review Comment:
   For example, previously the rst file was created as follows:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql.SparkSession
   
   .. automethod:: builder.appName
   ```
   However, in newer Sphinx versions it is generated like this:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql
   
   .. automethod:: SparkSession.builder.appName
   ```
   
   In the case of functions used through internal classes or accessors like 
this, the package paths created in a new way will cause Sphinx build to fail.
   
   See also https://github.com/sphinx-doc/sphinx/issues/7551.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405232095


##
python/docs/source/reference/pyspark.pandas/frame.rst:
##
@@ -299,6 +299,7 @@ in Spark. These can be accessed by 
``DataFrame.spark.``.
 
 .. autosummary::
:toctree: api/
+   :template: autosummary/accessor_method.rst

Review Comment:
   For example, previously the rst file was created as follows:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql.SparkSession
   
   .. automethod:: builder.appName
   ```
   However, in newer Sphinx versions it is generated like this:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql
   
   .. automethod:: SparkSession.builder.appName
   ```
   
   In the case of functions used through internal classes or accessors like 
this, the package paths created in a new way will cause Sphinx build to fail. 
See https://github.com/sphinx-doc/sphinx/issues/7551 related issue.
   
   Therefore, I modified the path to be customized for these cases using a 
custom template such as Pandas does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405232095


##
python/docs/source/reference/pyspark.pandas/frame.rst:
##
@@ -299,6 +299,7 @@ in Spark. These can be accessed by 
``DataFrame.spark.``.
 
 .. autosummary::
:toctree: api/
+   :template: autosummary/accessor_method.rst

Review Comment:
   For example, previously the rst file was created as follows:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql.SparkSession
   
   .. automethod:: builder.appName
   ```
   However, in newer Sphinx versions it is generated like this:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql
   
   .. automethod:: SparkSession.builder.appName
   ```
   
   In the case of functions used through internal classes or accessors like 
this, the package paths created in a new way will cause Sphinx build to fail.
   
   Therefore, I modified the path to be customized for these cases using a 
custom template such as Pandas does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405232095


##
python/docs/source/reference/pyspark.pandas/frame.rst:
##
@@ -299,6 +299,7 @@ in Spark. These can be accessed by 
``DataFrame.spark.``.
 
 .. autosummary::
:toctree: api/
+   :template: autosummary/accessor_method.rst

Review Comment:
   For example, previously the rst file was created as follows:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql.SparkSession
   
   .. automethod:: builder.appName
   ```
   However, in newer Sphinx versions it is generated like this:
   
   ```rst
   pyspark.sql.SparkSession.builder.appName
   
   
   .. currentmodule:: pyspark.sql
   
   .. automethod:: SparkSession.builder.appName
   ```
   
   In the case of functions used through internal classes or accessors like 
this, the package paths created in a new way will cause Sphinx build to fail.
   
   Therefore, I modified the path to be customized for these cases using a 
custom template such as Pandas does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46103][PYTHON][BUILD][DOCS] Enhancing PySpark documentation [spark]

2023-11-25 Thread via GitHub


itholic commented on code in PR #44012:
URL: https://github.com/apache/spark/pull/44012#discussion_r1405231062


##
python/docs/source/reference/pyspark.pandas/frame.rst:
##
@@ -299,6 +299,7 @@ in Spark. These can be accessed by 
``DataFrame.spark.``.
 
 .. autosummary::
:toctree: api/
+   :template: autosummary/accessor_method.rst

Review Comment:
   With the new version of Sphinx, the package name creation rules for `rst` 
files that are automatically created when building documents have changed, so 
we must manually adjust the package path using these templates.
   
   This behavior is [used in the same way in 
Pandas](https://github.com/pandas-dev/pandas/tree/main/doc/_templates/autosummary),
 so I referred to it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46100][CORE][PYTHON] Reduce stack depth by replace (string|array).size with (string|array).length [spark]

2023-11-25 Thread via GitHub


srowen commented on PR #44011:
URL: https://github.com/apache/spark/pull/44011#issuecomment-1826415496

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46100][CORE][PYTHON] Reduce stack depth by replace (string|array).size with (string|array).length [spark]

2023-11-25 Thread via GitHub


srowen closed pull request #44011: [SPARK-46100][CORE][PYTHON] Reduce stack 
depth by replace (string|array).size with (string|array).length
URL: https://github.com/apache/spark/pull/44011


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46096][BUILD] Upgrade `sbt` to 1.9.7 [spark]

2023-11-25 Thread via GitHub


dongjoon-hyun commented on PR #44008:
URL: https://github.com/apache/spark/pull/44008#issuecomment-1826410068

   Is that a MIMA issue? 
   
   https://github.com/apache/spark/assets/9700541/ef55b97a-c13a-49eb-ad7b-987cd74e93b3;>
   
   From GitHub Log, only `mima` complains while the others are green. 
   https://github.com/apache/spark/assets/9700541/472800a2-f210-46a1-9cfb-581eec50f18c;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46075][CONNECT] Improvements to SparkConnectSessionManager [spark]

2023-11-25 Thread via GitHub


rangadi commented on code in PR #43985:
URL: https://github.com/apache/spark/pull/43985#discussion_r1405191450


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala:
##
@@ -47,9 +47,19 @@ case class SessionKey(userId: String, sessionId: String)
 case class SessionHolder(userId: String, sessionId: String, session: 
SparkSession)
 extends Logging {
 
-  @volatile private var lastRpcAccessTime: Option[Long] = None
+  // Time when the session was started.
+  private val startTime: Long = System.currentTimeMillis()

Review Comment:
   Convention is to use unit suffix (`startTimeMs` for milliseconds here) when 
the type is long. `Time` is ok if this was a type like `Instant`.  



##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectSessionManager.scala:
##
@@ -95,47 +92,134 @@ class SparkConnectSessionManager extends Logging {
 Option(getSession(key, None))
   }
 
-  private def getSession(
-  key: SessionKey,
-  default: Option[Callable[SessionHolder]]): SessionHolder = {
-val session = default match {
-  case Some(callable) => sessionStore.get(key, callable)
-  case None => sessionStore.getIfPresent(key)
-}
-// record access time before returning
-session match {
-  case null =>
-null
-  case s: SessionHolder =>
-s.updateAccessTime()
-s
+  private def getSession(key: SessionKey, default: Option[() => 
SessionHolder]): SessionHolder = {
+sessionsLock.synchronized {
+  // try to get existing session from store
+  val sessionOpt = sessionStore.get(key)
+  // create using default if missing
+  val session = sessionOpt match {
+case Some(s) => s
+case None =>
+  default match {
+case Some(callable) =>
+  val session = callable()
+  sessionStore.put(key, session)
+  session
+case None =>
+  null
+  }
+  }
+  // record access time before returning
+  session match {
+case null =>
+  null
+case s: SessionHolder =>
+  s.updateAccessTime()
+  s
+  }
 }
   }
 
   def closeSession(key: SessionKey): Unit = {
-// Invalidate will trigger RemoveSessionListener
-sessionStore.invalidate(key)
-  }
-
-  private class RemoveSessionListener extends RemovalListener[SessionKey, 
SessionHolder] {
-override def onRemoval(notification: RemovalNotification[SessionKey, 
SessionHolder]): Unit = {
-  val sessionHolder = notification.getValue
-  sessionsLock.synchronized {
+var sessionHolder: Option[SessionHolder] = None
+sessionsLock.synchronized {
+  sessionHolder = sessionStore.remove(key)
+  sessionHolder.foreach { s =>
 // First put into closedSessionsCache, so that it cannot get 
accidentally recreated by
 // getOrCreateIsolatedSession.
-closedSessionsCache.put(sessionHolder.key, 
sessionHolder.getSessionHolderInfo)
+closedSessionsCache.put(s.key, s.getSessionHolderInfo)
   }
-  // Rest of the cleanup outside sessionLock - the session cannot be 
accessed anymore by
-  // getOrCreateIsolatedSession.
-  sessionHolder.close()
+}
+// Rest of the cleanup outside sessionLock - the session cannot be 
accessed anymore by
+// getOrCreateIsolatedSession.
+sessionHolder.foreach { s =>
+  s.close()
+  // Update in closedSessionsCache: above it wasn't updated with 
closedTime etc. yet.

Review Comment:
   Any concerns if `close()` above throws? 
   Btw, when do we clean up `closedSessionsCache`?



##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala:
##
@@ -186,7 +196,13 @@ case class SessionHolder(userId: String, sessionId: 
String, session: SparkSessio
   def classloader: ClassLoader = artifactManager.classloader
 
   private[connect] def updateAccessTime(): Unit = {
-lastRpcAccessTime = Some(System.currentTimeMillis())
+lastAccessTime = System.currentTimeMillis()
+logInfo(s"Session $key accessed, time $lastAccessTime.")

Review Comment:
   Temporary log?



##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala:
##
@@ -205,14 +220,26 @@ case class SessionHolder(userId: String, sessionId: 
String, session: SparkSessio
* Called only by SparkConnectSessionManager.
*/
   private[connect] def close(): Unit = {
+// It is called only by SparkConnectSessionManager.closeSession, only 
after it's removed from
+// the sessionStore there guarantees that it is called only once.
+if (closedTime.isDefined) {
+  throw new IllegalStateException(s"Session $key is already closed.")
+}
+
 logInfo(s"Closing session with userId: $userId and sessionId: $sessionId")
 
-// After isClosing=true, 

Re: [PR] [SPARK-45826][SQL] Add a SQL config for stack traces in DataFrame query context [spark]

2023-11-25 Thread via GitHub


MaxGekk commented on PR #43695:
URL: https://github.com/apache/spark/pull/43695#issuecomment-1826388775

   > Can we show the impact to the real error message
   
   @cloud-fan I added an example, please, take a look at the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45943][SQL] Add a tag to mark resolved subquery plan for ResolveSubquery [spark]

2023-11-25 Thread via GitHub


cloud-fan commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1826351872

   > This may also affect other Rules. Since HiveTableRelation is not resolved, 
the Project.projectList of the parent plan will not be resolved.
   
   Good point. Why is `DetermineTableStats` a post-hoc resolution rule then...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45649][SQL] Unify the prepare framework for OffsetWindowFunctionFrame [spark]

2023-11-25 Thread via GitHub


cloud-fan commented on code in PR #43958:
URL: https://github.com/apache/spark/pull/43958#discussion_r1405122383


##
sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala:
##
@@ -175,6 +178,23 @@ abstract class OffsetWindowFunctionFrameBase(
 }
   }
 
+  override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
+resetStates(rows)

Review Comment:
   so the fix is always call `resetStates`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45760][SQL][FOLLOWUP] Inline With inside conditional branches [spark]

2023-11-25 Thread via GitHub


cloud-fan commented on code in PR #43978:
URL: https://github.com/apache/spark/pull/43978#discussion_r1405081903


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala:
##
@@ -35,56 +35,82 @@ object RewriteWithExpression extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = {
 plan.transformWithPruning(_.containsPattern(WITH_EXPRESSION)) {
   case p if p.expressions.exists(_.containsPattern(WITH_EXPRESSION)) =>
-var newChildren = p.children
-var newPlan: LogicalPlan = p.transformExpressionsUp {
-  case With(child, defs) =>
-val refToExpr = mutable.HashMap.empty[Long, Expression]
-val childProjections = 
Array.fill(newChildren.size)(mutable.ArrayBuffer.empty[Alias])
+val inputPlans = p.children.toArray
+var newPlan: LogicalPlan = p.mapExpressions { expr =>
+  rewriteWithExprAndInputPlans(expr, inputPlans)
+}
+newPlan = newPlan.withNewChildren(inputPlans.toIndexedSeq)
+if (p.output == newPlan.output) {
+  newPlan
+} else {
+  Project(p.output, newPlan)
+}
+}
+  }
 
-defs.zipWithIndex.foreach { case (CommonExpressionDef(child, id), 
index) =>
-  if (CollapseProject.isCheap(child)) {
-refToExpr(id) = child
-  } else {
-val childProjectionIndex = newChildren.indexWhere(
-  c => child.references.subsetOf(c.outputSet)
-)
-if (childProjectionIndex == -1) {
-  // When we cannot rewrite the common expressions, force to 
inline them so that the
-  // query can still run. This can happen if the join 
condition contains `With` and
-  // the common expression references columns from both join 
sides.
-  // TODO: things can go wrong if the common expression is 
nondeterministic. We
-  //   don't fix it for now to match the old buggy 
behavior when certain
-  //   `RuntimeReplaceable` did not use the `With` 
expression.
-  // TODO: we should calculate the ref count and also inline 
the common expression
-  //   if it's ref count is 1.
-  refToExpr(id) = child
-} else {
-  val alias = Alias(child, s"_common_expr_$index")()
-  childProjections(childProjectionIndex) += alias
-  refToExpr(id) = alias.toAttribute
-}
-  }
-}
+  private def rewriteWithExprAndInputPlans(
+  e: Expression,
+  inputPlans: Array[LogicalPlan]): Expression = {
+if (!e.containsPattern(WITH_EXPRESSION)) return e
+e match {
+  case w: With =>
+// Rewrite nested With expression in CommonExpressionDef first.
+val defs = w.defs.map(rewriteWithExprAndInputPlans(_, inputPlans))

Review Comment:
   oh correlated nested `With`! I'm not sure if we want to support it or not... 
But at least we should fail if we don't want to support it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46036][SQL] Removing error-class from raise_error function [spark]

2023-11-25 Thread via GitHub


cloud-fan commented on code in PR #44004:
URL: https://github.com/apache/spark/pull/44004#discussion_r1405045604


##
sql/core/src/test/resources/sql-tests/inputs/misc-functions.sql:
##
@@ -21,21 +21,30 @@ CREATE TEMPORARY VIEW tbl_misc AS SELECT * FROM (VALUES 
(1), (8), (2)) AS T(v);
 SELECT raise_error('error message');
 SELECT if(v > 5, raise_error('too big: ' || v), v + 1) FROM tbl_misc;
 
-SELECT raise_error('VIEW_NOT_FOUND', Map('relationName', '`v`'));
--- Error class is case insensitive
-SELECT raise_error('VIEW_NOT_FOund', Map('relationName', '`v`'));
--- parameters are case sensitive
-SELECT raise_error('VIEW_NOT_FOund', Map('relationNAME', '`v`'));
--- Too few parameters
-SELECT raise_error('VIEW_NOT_FOUND', Map());
 -- Too many parameters
-SELECT raise_error('VIEW_NOT_FOUND', Map('relationName', '`v`', 
'totallymadeup', '5'));
+SELECT raise_error('error message', Map());
+
+-- Too many parameters
+SELECT raise_error('error message', 'some args');
+
+-- Too few parameters
+SELECT raise_error();
+
+-- Passing null as message
+SELECT raise_error(NULL);
+
+-- Passing non-string type
+SELECT raise_error(1);
 
--- Empty parameter list
-SELECT raise_error('ALL_PARTITION_COLUMNS_NOT_ALLOWED', Map());
-SELECT raise_error('ALL_PARTITION_COLUMNS_NOT_ALLOWED', NULL);
+-- Passing expression
+SELECT raise_error(1 + 1);
 
-SELECT raise_error(NULL, NULL);
+-- TODO: Need feedback on proper behaviour here:

Review Comment:
   The implicit cast does not support converting complex type to string type. 
Actually we can skip these 3 tests as implicit cast is orthogonal to the 
raise_error function



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46036][SQL] Removing error-class from raise_error function [spark]

2023-11-25 Thread via GitHub


cloud-fan commented on PR #44004:
URL: https://github.com/apache/spark/pull/44004#issuecomment-1826332136

   cc @srielau 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46036][SQL] Removing error-class from raise_error function [spark]

2023-11-25 Thread via GitHub


cloud-fan commented on code in PR #44004:
URL: https://github.com/apache/spark/pull/44004#discussion_r1405037748


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala:
##
@@ -139,6 +128,31 @@ object RaiseError {
 new RaiseError(errorClass, parms)
 }
 
+/**
+ * Throw with the result of an expression (used for debugging).
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_( expr ) - Throws a USER_RAISED_EXCEPTION with `expr` as 
message.",
+  examples = """
+Examples:
+  > SELECT _FUNC_('custom error message');
+   [USER_RAISED_EXCEPTION] custom error message
+  """,
+  since = "3.1.0",
+  group = "misc_funcs")
+// scalastyle:on line.size.limit
+object RaiseErrorExpressionBuilder extends ExpressionBuilder {
+  override def build(funcName: String, expressions: Seq[Expression]): 
Expression = {
+// for some reason pattern matching doesn't work here...

Review Comment:
   I think explicit length check is better than pattern match here. We don't 
need this comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46091][KUBERNETES] Respect the defined container SPARK_LOCAL_DIRS env [spark]

2023-11-25 Thread via GitHub


turboFei commented on PR #44005:
URL: https://github.com/apache/spark/pull/44005#issuecomment-1826300772

   gentle ping @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46036][SQL] Removing error-class from raise_error function [spark]

2023-11-25 Thread via GitHub


beliefer commented on code in PR #44004:
URL: https://github.com/apache/spark/pull/44004#discussion_r1404849612


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala:
##
@@ -139,6 +128,31 @@ object RaiseError {
 new RaiseError(errorClass, parms)
 }
 
+/**
+ * Throw with the result of an expression (used for debugging).
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_( expr ) - Throws a USER_RAISED_EXCEPTION with `expr` as 
message.",
+  examples = """
+Examples:
+  > SELECT _FUNC_('custom error message');
+   [USER_RAISED_EXCEPTION] custom error message
+  """,
+  since = "3.1.0",
+  group = "misc_funcs")
+// scalastyle:on line.size.limit
+object RaiseErrorExpressionBuilder extends ExpressionBuilder {

Review Comment:
   `RaiseError` works good, why you create the builder?



##
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -3269,14 +3269,6 @@ object functions {
*/
   def raise_error(c: Column): Column = Column.fn("raise_error", c)
 
-  /**
-   * Throws an exception with the provided error class and parameter map.
-   *
-   * @group misc_funcs
-   * @since 4.0.0
-   */
-  def raise_error(c: Column, e: Column): Column = Column.fn("raise_error", c, 
e)

Review Comment:
   It seems this API added no long.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46100][CORE][PYTHON] Reduce stack depth by replace (string|array).size with (string|array).length [spark]

2023-11-25 Thread via GitHub


beliefer commented on PR #44011:
URL: https://github.com/apache/spark/pull/44011#issuecomment-1826279786

   > Is this all the similar cases in the project? I think we can fix all cases 
in one PR :)
   
   It's very hard. I fix the cases in core module paid two hours.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46097][SQL] Push down limit 1 though Union and Aggregate [spark]

2023-11-25 Thread via GitHub


beliefer commented on code in PR #44009:
URL: https://github.com/apache/spark/pull/44009#discussion_r1404844214


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LimitPushdownSuite.scala:
##
@@ -352,4 +352,21 @@ class LimitPushdownSuite extends PlanTest {
   comparePlans(Optimize.execute(originalQuery2), originalQuery2)
 }
   }
+
+  test("SPARK-46097: Push down limit 1 though Union and Aggregate") {
+val unionQuery = Union(
+  Union(
+testRelation.groupBy($"a", $"b")($"a", $"b"),
+testRelation2.groupBy($"d", $"e")($"d", $"e"),
+  ),
+  testRelation2.groupBy($"e", $"f")($"e", $"f")).limit(1)
+
+val correctAnswer = Union(
+  Union(
+LocalLimit(1, testRelation).select($"a", $"b"),
+LocalLimit(1, testRelation2).select($"d", $"e")).limit(1),
+  LocalLimit(1, testRelation2).select($"e", $"f")).limit(1)
+
+comparePlans(Optimize.execute(unionQuery.analyze), correctAnswer.analyze)

Review Comment:
   This test case only ensure the logical plan. Could you add a test case to 
compare the output data?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -771,6 +771,17 @@ object LimitPushDown extends Rule[LogicalPlan] {
   Limit(le, Project(a.aggregateExpressions, LocalLimit(le, a.child)))
 case Limit(le @ IntegerLiteral(1), p @ Project(_, a: Aggregate)) if 
a.groupOnly =>
   Limit(le, p.copy(child = Project(a.aggregateExpressions, LocalLimit(le, 
a.child
+// Push down limit 1 though Union and Aggregate
+case Limit(le @ IntegerLiteral(1), u: Union) =>
+  val newUnionChildren = u.children.map {
+case a: Aggregate if a.groupOnly =>

Review Comment:
   I doubt that the output will be changed.
   `limit 1 after shuffle` is not the same as `limit 1 before shuffle`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46100][CORE][PYTHON] Reduce stack depth by replace (string|array).size with (string|array).length [spark]

2023-11-25 Thread via GitHub


LuciferYang commented on PR #44011:
URL: https://github.com/apache/spark/pull/44011#issuecomment-1826278609

   Is this all the similar cases in the project? I think we can fix all cases 
in one PR :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46096][BUILD] Upgrade `sbt` to 1.9.7 [spark]

2023-11-25 Thread via GitHub


LuciferYang commented on PR #44008:
URL: https://github.com/apache/spark/pull/44008#issuecomment-1826278081

   IIRC, Did @panbingkun  encounter some issues that have not been resolved 
when he trying to upgrade this version before?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45954][SQL] Remove redundant shuffles [spark]

2023-11-25 Thread via GitHub


beliefer commented on code in PR #43841:
URL: https://github.com/apache/spark/pull/43841#discussion_r1404840863


##
sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantShufflesSuite.scala:
##
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.{DataFrame, QueryTest}
+import org.apache.spark.sql.execution.adaptive.{AdaptiveSparkPlanHelper, 
DisableAdaptiveExecutionSuite, EnableAdaptiveExecutionSuite}
+import org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSparkSession
+
+abstract class RemoveRedundantShufflesSuiteBase
+extends QueryTest
+with SharedSparkSession
+with AdaptiveSparkPlanHelper {
+  import testImplicits._
+
+  private def getShuffleExchangeNum(df: DataFrame): Int = {
+collect(df.queryExecution.executedPlan) {
+  case s: ShuffleExchangeExec => s
+}.size
+  }
+
+  test("Remove redundant shuffle") {
+withTempView("t1", "t2") {
+  spark.range(10).select($"id").createOrReplaceTempView("t1")
+  spark.range(20).select($"id").createOrReplaceTempView("t2")
+  Seq(-1, 100).foreach { threshold =>
+withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> 
threshold.toString) {
+  val query = 
spark.table("t1").repartition(10).join(spark.table("t2"), "id")
+  query.collect()
+  val shuffleNum = getShuffleExchangeNum(query)
+  if (threshold > 0) {

Review Comment:
   The test condition looks not good.
   How about `threshold == 100`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-46099][PS][DOCS] Refactor "Supported pandas API" generation script [spark]

2023-11-25 Thread via GitHub


itholic opened a new pull request, #44010:
URL: https://github.com/apache/spark/pull/44010

   ### What changes were proposed in this pull request?
   
   This PR proposes to refactor the script used to generate the [Supported 
pandas 
API](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html)
 documentation. The script has been restructured for better readability and 
maintainability. The refactoring includes:
   
   - Simplifying complex functions and breaking them into smaller, more 
manageable pieces.
   - Improving variable and function naming for clarity.
   - Adding comprehensive docstrings in the NumPy docstyle.
   - Streamlining the flow of the script to enhance logical coherence.
   
   ### Why are the changes needed?
   
   The previous version of the script was hard to understand and maintain due 
to its complexity and lack of documentation. This refactoring makes the script 
more accessible to new contributors and easier to modify or extend in the 
future. It also ensures that the script adheres to best practices in Python 
coding, making it a more reliable tool for generating accurate and up-to-date 
documentation.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No user-facing changes. This PR only affects the internal documentation 
generation process.
   
   ### How was this patch tested?
   
   Tested by generating the documentation manually and verifying that the 
output remains consistent with the previous version.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org