date:20230920

[GitHub] [spark] zhengruifeng commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics

2023-09-20 Thread via GitHub



zhengruifeng commented on code in PR #43010:
URL: https://github.com/apache/spark/pull/43010#discussion_r1332491666


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -990,6 +990,9 @@ message CollectMetrics {
 
   // (Required) The metric sequence.
   repeated Expression metrics = 3;
+
+  // (Required) A unique DataFrame id.
+  int64 dataframe_id = 4;

Review Comment:
   since `dataframe_id` is set the `plan_id`
   
   why not reusing the `plan_id` in `RelationCommon`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics

2023-09-20 Thread via GitHub



zhengruifeng commented on code in PR #43010:
URL: https://github.com/apache/spark/pull/43010#discussion_r1332489254


##
python/pyspark/sql/connect/plan.py:
##
@@ -1197,6 +1197,7 @@ def plan(self, session: "SparkConnectClient") -> 
proto.Relation:
 plan.collect_metrics.input.CopyFrom(self._child.plan(session))
 plan.collect_metrics.name = self._name
 plan.collect_metrics.metrics.extend([self.col_to_expr(x, session) for 
x in self._exprs])
+plan.collect_metrics.dataframe_id = self._child._plan_id

Review Comment:
   the `_plan_id` is unique:
   
   
https://github.com/apache/spark/blob/5299e5423f439ce702d14666c79f9d42eb446943/python/pyspark/sql/connect/plan.py#L57-L72



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #42612:
URL: https://github.com/apache/spark/pull/42612#discussion_r1332486360


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala:
##
@@ -270,6 +271,8 @@ object SerializerSupport {
  *   non-null value.
  * @param isDeterministic Whether the method invocation is deterministic or 
not. If false, Spark
  *will not apply certain optimizations such as 
constant folding.
+ * @param scalarFunction the [[ScalarFunction]] object if this is calling the 
magic method of the
+ *   [[ScalarFunction]] otherwise is unset.
  */
 case class StaticInvoke(

Review Comment:
   we can override `stringArgs` in `StaticInvoke`, to exclude the new parameter 
if it's None, to avoid the golden file changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #43010:
URL: https://github.com/apache/spark/pull/43010#discussion_r1332484719


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:
##
@@ -779,34 +779,35 @@ class AnalysisSuite extends AnalysisTest with Matchers {
 val literal = Literal(1).as("lit")
 
 // Ok
-assert(CollectMetrics("event", literal :: sum :: random_sum :: Nil, 
testRelation).resolved)
+assert(CollectMetrics("event", literal :: sum :: random_sum :: Nil, 
testRelation, 0).resolved)
 
 // Bad name
-assert(!CollectMetrics("", sum :: Nil, testRelation).resolved)
+assert(!CollectMetrics("", sum :: Nil, testRelation, 1).resolved)
 assertAnalysisErrorClass(
-  CollectMetrics("", sum :: Nil, testRelation),
+  CollectMetrics("", sum :: Nil, testRelation, 1),
   expectedErrorClass = "INVALID_OBSERVED_METRICS.MISSING_NAME",
   expectedMessageParameters = Map(
-"operator" -> "'CollectMetrics , [sum(a#x) AS sum#xL]\n+- 
LocalRelation , [a#x]\n")
+"operator" ->
+  "'CollectMetrics , [sum(a#x) AS sum#xL], 1\n+- LocalRelation 
, [a#x]\n")
 )
 
 // No columns
-assert(!CollectMetrics("evt", Nil, testRelation).resolved)
+assert(!CollectMetrics("evt", Nil, testRelation, 2).resolved)
 
 def checkAnalysisError(exprs: Seq[NamedExpression], errors: String*): Unit 
= {
-  assertAnalysisError(CollectMetrics("event", exprs, testRelation), errors)
+  assertAnalysisError(CollectMetrics("event", exprs, testRelation, 3), 
errors)
 }
 
 // Unwrapped attribute
 assertAnalysisErrorClass(
-  CollectMetrics("event", a :: Nil, testRelation),
+  CollectMetrics("event", a :: Nil, testRelation, 4),
   expectedErrorClass = 
"INVALID_OBSERVED_METRICS.NON_AGGREGATE_FUNC_ARG_IS_ATTRIBUTE",
   expectedMessageParameters = Map("expr" -> "\"a\"")
 )
 
 // Unwrapped non-deterministic expression
 assertAnalysisErrorClass(
-  CollectMetrics("event", Rand(10).as("rnd") :: Nil, testRelation),
+  CollectMetrics("event", Rand(10).as("rnd") :: Nil, testRelation, 5),

Review Comment:
   I think we can always use `0` in this test suite? Unless we have multiple 
`CollectMetrics` in one plan tree



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #43010:
URL: https://github.com/apache/spark/pull/43010#discussion_r1332484719


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:
##
@@ -779,34 +779,35 @@ class AnalysisSuite extends AnalysisTest with Matchers {
 val literal = Literal(1).as("lit")
 
 // Ok
-assert(CollectMetrics("event", literal :: sum :: random_sum :: Nil, 
testRelation).resolved)
+assert(CollectMetrics("event", literal :: sum :: random_sum :: Nil, 
testRelation, 0).resolved)
 
 // Bad name
-assert(!CollectMetrics("", sum :: Nil, testRelation).resolved)
+assert(!CollectMetrics("", sum :: Nil, testRelation, 1).resolved)
 assertAnalysisErrorClass(
-  CollectMetrics("", sum :: Nil, testRelation),
+  CollectMetrics("", sum :: Nil, testRelation, 1),
   expectedErrorClass = "INVALID_OBSERVED_METRICS.MISSING_NAME",
   expectedMessageParameters = Map(
-"operator" -> "'CollectMetrics , [sum(a#x) AS sum#xL]\n+- 
LocalRelation , [a#x]\n")
+"operator" ->
+  "'CollectMetrics , [sum(a#x) AS sum#xL], 1\n+- LocalRelation 
, [a#x]\n")
 )
 
 // No columns
-assert(!CollectMetrics("evt", Nil, testRelation).resolved)
+assert(!CollectMetrics("evt", Nil, testRelation, 2).resolved)
 
 def checkAnalysisError(exprs: Seq[NamedExpression], errors: String*): Unit 
= {
-  assertAnalysisError(CollectMetrics("event", exprs, testRelation), errors)
+  assertAnalysisError(CollectMetrics("event", exprs, testRelation, 3), 
errors)
 }
 
 // Unwrapped attribute
 assertAnalysisErrorClass(
-  CollectMetrics("event", a :: Nil, testRelation),
+  CollectMetrics("event", a :: Nil, testRelation, 4),
   expectedErrorClass = 
"INVALID_OBSERVED_METRICS.NON_AGGREGATE_FUNC_ARG_IS_ATTRIBUTE",
   expectedMessageParameters = Map("expr" -> "\"a\"")
 )
 
 // Unwrapped non-deterministic expression
 assertAnalysisErrorClass(
-  CollectMetrics("event", Rand(10).as("rnd") :: Nil, testRelation),
+  CollectMetrics("event", Rand(10).as("rnd") :: Nil, testRelation, 5),

Review Comment:
   I think we can always use `0` in this test suite?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #43010:
URL: https://github.com/apache/spark/pull/43010#discussion_r1332484136


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -1097,17 +1097,15 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
* are allowed (e.g. self-joins).
*/
   private def checkCollectedMetrics(plan: LogicalPlan): Unit = {
-val metricsMap = mutable.Map.empty[String, LogicalPlan]
+val metricsMap = mutable.Map.empty[String, CollectMetrics]
 def check(plan: LogicalPlan): Unit = plan.foreach { node =>
   node match {
-case metrics @ CollectMetrics(name, _, _) =>
-  val simplifiedMetrics = 
simplifyPlanForCollectedMetrics(metrics.canonicalized)

Review Comment:
   nit: we can remove `simplifyPlanForCollectedMetrics` now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #43010:
URL: https://github.com/apache/spark/pull/43010#discussion_r1332483781


##
python/pyspark/sql/connect/plan.py:
##
@@ -1197,6 +1197,7 @@ def plan(self, session: "SparkConnectClient") -> 
proto.Relation:
 plan.collect_metrics.input.CopyFrom(self._child.plan(session))
 plan.collect_metrics.name = self._name
 plan.collect_metrics.metrics.extend([self.col_to_expr(x, session) for 
x in self._exprs])
+plan.collect_metrics.dataframe_id = self._child._plan_id

Review Comment:
   does Spark Connect DataFrame also have the unique id? cc @HyukjinKwon 
@zhengruifeng 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #43024: [SPARK-45246][BUILD][PS] Encourage using latest `jinja2` other than documentation build

2023-09-20 Thread via GitHub



itholic opened a new pull request, #43024:
URL: https://github.com/apache/spark/pull/43024

   ### What changes were proposed in this pull request?
   
   This PR proposes to update the `jinja2` version from `dev/requirements.txt` 
to the latest version for PySpark.
   
   
   ### Why are the changes needed?
   
   Currently, `jinja2<3.0.0` is installed when running `pip install -r 
dev/requirements.txt`, but it's not working for some functionality for Pandas 
API on Spark.
   
   Because Pandas requires the latest version of `jinja2` for their LaTex 
render engine, so Pandas API on Spark also requires the same version since 
we're leverage the Pandas internally.
   
   https://github.com/apache/spark/assets/44108233/1f0bbb15-829c-473b-bb47-e602fa6cb04d;>
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   The existing CI should pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rangadi commented on a diff in pull request #43023: [SPARK-45245] PythonWorkerFactory: Timeout if worker does not connect back.

2023-09-20 Thread via GitHub



rangadi commented on code in PR #43023:
URL: https://github.com/apache/spark/pull/43023#discussion_r1332415585


##
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala:
##
@@ -184,10 +185,20 @@ private[spark] class PythonWorkerFactory(
   redirectStreamsToStderr(workerProcess.getInputStream, 
workerProcess.getErrorStream)
 
   // Wait for it to connect to our socket, and validate the auth secret.
-  serverSocketChannel.socket().setSoTimeout(1)

Review Comment:
   There was supposed to be 10 second timeout. But this call does not seem to 
affect `serverSocketChannel.accept()`.  
   This set might only take effect if we did 
`serverSocketChannel.socket().accept()`, but hat returns a socket, not a 
channel.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rangadi commented on pull request #43023: [SPARK-45245] PythonWorkerFactory: Timeout if worker does not connect back.

2023-09-20 Thread via GitHub



rangadi commented on PR #43023:
URL: https://github.com/apache/spark/pull/43023#issuecomment-1728734215

   cc: @HyukjinKwon, @WweiL 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rangadi opened a new pull request, #43023: [SPARK-45245] PythonWorkerFactory: Timeout if worker does not connect back.

2023-09-20 Thread via GitHub



rangadi opened a new pull request, #43023:
URL: https://github.com/apache/spark/pull/43023

   ### What changes were proposed in this pull request?
   
   `createSimpleWorker()` method in `PythonWorkerFactory` waits forever if the 
worker fails to connect back to the server. 
   
   This is because it calls accept() without a timeout. If the worker does not 
connect back, accept() waits forever. There is supposed to be 10 seconds 
timeout, but it was not implemented correctly.
   
   This PR adds a 10 second timeout. 
   
   ### Why are the changes needed?
   
   Otherwise create method could be stuck forever.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
- Unit test
- Manual
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: ChatGPT 4.0
   Asked ChatGPT to generate sample code to do non-blocking accept() on a 
socket channel in Java.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zwangsheng commented on pull request #43022: [SPARK-45244][IT] Correct spelling in VolcanoTestsSuite

2023-09-20 Thread via GitHub



zwangsheng commented on PR #43022:
URL: https://github.com/apache/spark/pull/43022#issuecomment-1728726734

   friendly ping @Yikun, PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zwangsheng opened a new pull request, #43022: [SPARK-45244][IT] Correct spelling in VolcanoTestsSuite

2023-09-20 Thread via GitHub



zwangsheng opened a new pull request, #43022:
URL: https://github.com/apache/spark/pull/43022

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   Correct typo in VolcanoTestsSuite, which naming methods with 
`checkAnnotaion`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Exited UT
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails

2023-09-20 Thread via GitHub



HeartSaVioR commented on PR #42895:
URL: https://github.com/apache/spark/pull/42895#issuecomment-1728709440

   UPDATE: we sorted the issue with the JIRA account out now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails

2023-09-20 Thread via GitHub



HeartSaVioR commented on PR #42895:
URL: https://github.com/apache/spark/pull/42895#issuecomment-1728702545

   Please send the mail to priv...@spark.apache.org and look for action from 
PMC members.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0

2023-09-20 Thread via GitHub



itholic commented on code in PR #40420:
URL: https://github.com/apache/spark/pull/40420#discussion_r1332390021


##
python/pyspark/pandas/datetimes.py:
##
@@ -116,26 +117,59 @@ def pandas_microsecond(s) -> ps.Series[np.int32]:  # 
type: ignore[no-untyped-def
 def nanosecond(self) -> "ps.Series":
 raise NotImplementedError()
 
-# TODO(SPARK-42617): Support isocalendar.week and replace it.
-# See also https://github.com/pandas-dev/pandas/pull/33595.
-@property
-def week(self) -> "ps.Series":
+def isocalendar(self) -> "ps.DataFrame":
 """
-The week ordinal of the year.
+Calculate year, week, and day according to the ISO 8601 standard.
 
-.. deprecated:: 3.4.0
-"""
-warnings.warn(
-"weekofyear and week have been deprecated.",
-FutureWarning,
-)
-return self._data.spark.transform(lambda c: 
F.weekofyear(c).cast(LongType()))
+.. versionadded:: 4.0.0
 
-@property
-def weekofyear(self) -> "ps.Series":
-return self.week
+Returns
+---
+DataFrame
+With columns year, week and day.
 
-weekofyear.__doc__ = week.__doc__
+.. note:: Returns have int64 type instead of UInt32 as is in pandas 
due to UInt32
+is not supported by spark
+
+Examples
+
+>>> dfs = ps.from_pandas(pd.date_range(start='2019-12-29', freq='D', 
periods=4).to_series())
+>>> dfs.dt.isocalendar()
+year  week  day
+2019-12-29  2019527
+2019-12-30  2020 11
+2019-12-31  2020 12
+2020-01-01  2020 13
+
+>>> dfs.dt.isocalendar().week
+2019-12-2952
+2019-12-30 1
+2019-12-31 1
+2020-01-01 1
+Name: week, dtype: int64
+"""
+
+return_types = [self._data.index.dtype, int, int, int]
+
+def pandas_isocalendar(  # type: ignore[no-untyped-def]
+pdf,
+) -> ps.DataFrame[return_types]:  # type: ignore[valid-type]

Review Comment:
   ```suggestion
   def pandas_isocalendar(
   pdf: pd.DataFrame,
   ) -> ps.DataFrame[Any]: 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] neilramaswamy commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails

2023-09-20 Thread via GitHub



neilramaswamy commented on PR #42895:
URL: https://github.com/apache/spark/pull/42895#issuecomment-1728699388

   My account hasn't been approved yet (submitted request 8 days ago). I 
vaguely remember there being an email/channel to ping about this, but I can't 
find the name. Do you recall?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails

2023-09-20 Thread via GitHub



HeartSaVioR commented on PR #42895:
URL: https://github.com/apache/spark/pull/42895#issuecomment-1728695227

   @neilramaswamy Could you please give your JIRA account so that I can add you 
as contributor and assign the ticket to you? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR closed pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails

2023-09-20 Thread via GitHub



HeartSaVioR closed pull request #42895: [SPARK-45138][SS] Define a new error 
class and apply it when checkpointing state to DFS fails
URL: https://github.com/apache/spark/pull/42895


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails

2023-09-20 Thread via GitHub



HeartSaVioR commented on PR #42895:
URL: https://github.com/apache/spark/pull/42895#issuecomment-1728693932

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails

2023-09-20 Thread via GitHub



HeartSaVioR commented on PR #42895:
URL: https://github.com/apache/spark/pull/42895#issuecomment-1728693569

   Looks like these test failures are flaky. As long as I see the previous test 
failure is gone, I could say this PR passes all tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0

2023-09-20 Thread via GitHub



itholic commented on code in PR #40420:
URL: https://github.com/apache/spark/pull/40420#discussion_r1332376664


##
python/pyspark/pandas/datetimes.py:
##
@@ -116,26 +117,59 @@ def pandas_microsecond(s) -> ps.Series[np.int32]:  # 
type: ignore[no-untyped-def
 def nanosecond(self) -> "ps.Series":
 raise NotImplementedError()
 
-# TODO(SPARK-42617): Support isocalendar.week and replace it.
-# See also https://github.com/pandas-dev/pandas/pull/33595.
-@property
-def week(self) -> "ps.Series":
+def isocalendar(self) -> "ps.DataFrame":
 """
-The week ordinal of the year.
+Calculate year, week, and day according to the ISO 8601 standard.
 
-.. deprecated:: 3.4.0
-"""
-warnings.warn(
-"weekofyear and week have been deprecated.",
-FutureWarning,
-)
-return self._data.spark.transform(lambda c: 
F.weekofyear(c).cast(LongType()))
+.. versionadded:: 4.0.0
 
-@property
-def weekofyear(self) -> "ps.Series":
-return self.week
+Returns
+---
+DataFrame
+With columns year, week and day.
 
-weekofyear.__doc__ = week.__doc__
+.. note:: Returns have int64 type instead of UInt32 as is in pandas 
due to UInt32
+is not supported by spark
+
+Examples
+

Review Comment:
   We can merge it before Spark 4.0.0, so we have enough time tho.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0

2023-09-20 Thread via GitHub



itholic commented on code in PR #40420:
URL: https://github.com/apache/spark/pull/40420#discussion_r1332376264


##
python/pyspark/pandas/datetimes.py:
##
@@ -116,26 +117,59 @@ def pandas_microsecond(s) -> ps.Series[np.int32]:  # 
type: ignore[no-untyped-def
 def nanosecond(self) -> "ps.Series":
 raise NotImplementedError()
 
-# TODO(SPARK-42617): Support isocalendar.week and replace it.
-# See also https://github.com/pandas-dev/pandas/pull/33595.
-@property
-def week(self) -> "ps.Series":
+def isocalendar(self) -> "ps.DataFrame":
 """
-The week ordinal of the year.
+Calculate year, week, and day according to the ISO 8601 standard.
 
-.. deprecated:: 3.4.0
-"""
-warnings.warn(
-"weekofyear and week have been deprecated.",
-FutureWarning,
-)
-return self._data.spark.transform(lambda c: 
F.weekofyear(c).cast(LongType()))
+.. versionadded:: 4.0.0
 
-@property
-def weekofyear(self) -> "ps.Series":
-return self.week
+Returns
+---
+DataFrame
+With columns year, week and day.
 
-weekofyear.__doc__ = week.__doc__
+.. note:: Returns have int64 type instead of UInt32 as is in pandas 
due to UInt32
+is not supported by spark
+
+Examples
+

Review Comment:
   Sure! It's not very urgent, so please take your time



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #42940: [SPARK-45178][SS] Fallback to execute a single batch for Trigger.AvailableNow with unsupported sources rather than using wrapper

2023-09-20 Thread via GitHub



HeartSaVioR commented on PR #42940:
URL: https://github.com/apache/spark/pull/42940#issuecomment-1728653787

   https://lists.apache.org/thread/ljronxf6bymvqjmlwpzy84gzgvnqrmoh
   ^^^ DISCUSSION thread in dev@.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #43019: [SPARK-45219][PYTHON][DOCS] Refine docstring of withColumn(s)Renamed

2023-09-20 Thread via GitHub



HyukjinKwon commented on code in PR #43019:
URL: https://github.com/apache/spark/pull/43019#discussion_r1332312402


##
python/pyspark/sql/dataframe.py:
##
@@ -5797,25 +5798,53 @@ def withColumnRenamed(self, existing: str, new: str) -> 
"DataFrame":
 Parameters
 --
 existing : str
-string, name of the existing column to rename.
+The name of the existing column to be renamed.
 new : str
-string, new name of the column.
+The new name to be assigned to the column.
 
 Returns
 ---
 :class:`DataFrame`
-DataFrame with renamed column.
+A new DataFrame with renamed column.
+
+See Also
+
+:meth:`withColumnsRenamed`
 
 Examples
 
 >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], 
schema=["age", "name"])
->>> df.withColumnRenamed('age', 'age2').show()
+
+Example 1: Rename a single column
+
+>>> df.withColumnRenamed("age", "age2").show()
 ++-+
 |age2| name|
 ++-+
 |   2|Alice|
 |   5|  Bob|
 ++-+
+
+Example 2: Rename a column that does not exist (no-op)
+
+>>> df.withColumnRenamed("non_existing", "new_name").show()
++---+-+
+|age| name|
++---+-+
+|  2|Alice|
+|  5|  Bob|
++---+-+
+
+Example 3: Rename multiple columns
+
+>>> df.withColumnRenamed("age", "age2") \\
+... .withColumnRenamed("name", "name2").show()

Review Comment:
   actually, let's don't use `\\`. That's sort of discouraged by PEP 8., and 
seems it fits within 100 characters.
   
   ```suggestion
   >>> df.withColumnRenamed("age", "age2").withColumnRenamed("name", 
"name2").show()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #43011: [WIP][SPARK-45232][DOCS] Add missing function groups to SQL references

2023-09-20 Thread via GitHub



zhengruifeng commented on PR #43011:
URL: https://github.com/apache/spark/pull/43011#issuecomment-1728625579

   we can check the documents built in the GA of this PR,  
https://github.com/zhengruifeng/spark/actions/runs/6249096629
   
   
![image](https://github.com/apache/spark/assets/7322292/ea1f1389-754b-479c-9f7c-1b9083180667)
   
   
   however, it expires after one day


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #43011: [WIP][SPARK-45232][DOCS] Add missing function groups to SQL references

2023-09-20 Thread via GitHub



zhengruifeng commented on PR #43011:
URL: https://github.com/apache/spark/pull/43011#issuecomment-1728624460

   
   
![image](https://github.com/apache/spark/assets/7322292/09ae6ec9-a2a6-4f00-b260-6c680ff3)
   
   
![image](https://github.com/apache/spark/assets/7322292/15308ec1-aa0a-4495-9769-3e8707662b89)
   
   
![image](https://github.com/apache/spark/assets/7322292/fdd0e404-bffd-4663-bec1-82ead49052ce)
   
   
![image](https://github.com/apache/spark/assets/7322292/b8c11c51-979a-4f8e-9c29-75769b52dae1)
   
   
![image](https://github.com/apache/spark/assets/7322292/0f5426c4-6086-46b2-8d67-05f57e10f0fb)
   
   
![image](https://github.com/apache/spark/assets/7322292/066e7bea-646d-4bca-8d14-7e72e91aefd1)
   
   
![image](https://github.com/apache/spark/assets/7322292/e94734d1-3894-453f-b2e0-968e619ab513)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

2023-09-20 Thread via GitHub



zhengruifeng commented on PR #42382:
URL: https://github.com/apache/spark/pull/42382#issuecomment-1728616265

   what about adding a implicit conversions `sc -> spark` in ml:
   ml only use `spark`; 3-rd lib will need to import this implicit conversion;
   
   or
   
   just keep two `saveMetadata` methods?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #43019: [SPARK-45219][PYTHON][DOCS] Refine docstring of withColumn(s)Renamed

2023-09-20 Thread via GitHub



HyukjinKwon commented on code in PR #43019:
URL: https://github.com/apache/spark/pull/43019#discussion_r1332312402


##
python/pyspark/sql/dataframe.py:
##
@@ -5797,25 +5798,53 @@ def withColumnRenamed(self, existing: str, new: str) -> 
"DataFrame":
 Parameters
 --
 existing : str
-string, name of the existing column to rename.
+The name of the existing column to be renamed.
 new : str
-string, new name of the column.
+The new name to be assigned to the column.
 
 Returns
 ---
 :class:`DataFrame`
-DataFrame with renamed column.
+A new DataFrame with renamed column.
+
+See Also
+
+:meth:`withColumnsRenamed`
 
 Examples
 
 >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], 
schema=["age", "name"])
->>> df.withColumnRenamed('age', 'age2').show()
+
+Example 1: Rename a single column
+
+>>> df.withColumnRenamed("age", "age2").show()
 ++-+
 |age2| name|
 ++-+
 |   2|Alice|
 |   5|  Bob|
 ++-+
+
+Example 2: Rename a column that does not exist (no-op)
+
+>>> df.withColumnRenamed("non_existing", "new_name").show()
++---+-+
+|age| name|
++---+-+
+|  2|Alice|
+|  5|  Bob|
++---+-+
+
+Example 3: Rename multiple columns
+
+>>> df.withColumnRenamed("age", "age2") \\
+... .withColumnRenamed("name", "name2").show()

Review Comment:
   actually, let's don't use `\\`. That's sort of discouraged by PEP 8., and 
seems it fits within 100 characters.
   
   ```suggestion
   >>> df.withColumnRenamed("age", "age2").withColumnRenamed("name", 
"name2").show()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiongbo-sjtu opened a new pull request, #43021: [SPARK-45227][CORE] Fix an issue with CoarseGrainedExecutorBackend wh…

2023-09-20 Thread via GitHub



xiongbo-sjtu opened a new pull request, #43021:
URL: https://github.com/apache/spark/pull/43021

   ### What changes were proposed in this pull request?
   Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an 
executor process randomly gets stuck
   
   ### Why are the changes needed?
   For each executor, the single-threaded dispatcher can run into an "infinite 
loop" (as explained in the SPARK-45227). Once an executor process runs into a 
state, it'd stop launching tasks from the driver or reporting task status back.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Existing unit tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference

2023-09-20 Thread via GitHub



HyukjinKwon closed pull request #43013: [MINOR][DOCS][CONNECT] Update notes 
about supported modules in PySpark API reference
URL: https://github.com/apache/spark/pull/43013


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference

2023-09-20 Thread via GitHub



HyukjinKwon commented on PR #43013:
URL: https://github.com/apache/spark/pull/43013#issuecomment-1728598145

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference

2023-09-20 Thread via GitHub



HyukjinKwon commented on PR #43013:
URL: https://github.com/apache/spark/pull/43013#issuecomment-1728597889

   Yeah. We should probably document `pyspark.ml.connect` separately but there 
haven't been examples out yet. Let's document that separately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

2023-09-20 Thread via GitHub



HyukjinKwon closed pull request #42997: [SPARK-45216][SQL] Fix 
non-deterministic seeded Dataset APIs
URL: https://github.com/apache/spark/pull/42997


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

2023-09-20 Thread via GitHub



HyukjinKwon commented on PR #42997:
URL: https://github.com/apache/spark/pull/42997#issuecomment-1728596271

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gatorsmile commented on pull request #43011: [WIP][SPARK-45232][DOCS] Add missing function groups to SQL references

2023-09-20 Thread via GitHub



gatorsmile commented on PR #43011:
URL: https://github.com/apache/spark/pull/43011#issuecomment-1728585646

   cc @srielau 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #41384: [SPARK-43297][ML]Use scala parallel collection ParVector to accelarate LocalKMeans.

2023-09-20 Thread via GitHub



github-actions[bot] commented on PR #41384:
URL: https://github.com/apache/spark/pull/41384#issuecomment-1728583712

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #43011: [WIP][SPARK-45232][DOCS] Add missing function groups to SQL references

2023-09-20 Thread via GitHub



zhengruifeng commented on PR #43011:
URL: https://github.com/apache/spark/pull/43011#issuecomment-1728581850

   @allisonwang-db I am not sure, I don't see document for FROM clause, you may 
check 3 places:
   
   - https://spark.apache.org/docs/latest/api/sql/index.html#explode
   - 
https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#generator-functions
   - https://spark.apache.org/docs/latest/sql-ref-syntax.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0

2023-09-20 Thread via GitHub



itholic commented on PR #40420:
URL: https://github.com/apache/spark/pull/40420#issuecomment-1728575805

   Yeah, mypy check always tricky   Let me take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on pull request #43017: [SPARK-45239] Reduce default spark.connect.jvmStacktrace.maxSize

2023-09-20 Thread via GitHub



heyihong commented on PR #43017:
URL: https://github.com/apache/spark/pull/43017#issuecomment-1728573383

   > Hi @heyihong , since you already have this error enrichment PR: #42987 
Does it make sense to still support this `spark.connect.jvmStacktrace.maxSize` 
in spark connect? Also, just wondering, why do we want to enable this 
`spark.sql.pyspark.jvmStacktrace.enabled` config?
   
   For long term, no. But for short term, we need to be backward compatible 
with the old clients. We should turn off 
`spark.sql.pyspark.jvmStacktrace.enabled` once error enrichment feature is 
done. But we still need to keep the config for a while since it is the only way 
for old clients to get jvmStacktrace in the error messages. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #42860: [SPARK-45107][PYTHON][DOCS] Refine docstring of explode

2023-09-20 Thread via GitHub



zhengruifeng commented on PR #42860:
URL: https://github.com/apache/spark/pull/42860#issuecomment-1728571438

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #42860: [SPARK-45107][PYTHON][DOCS] Refine docstring of explode

2023-09-20 Thread via GitHub



zhengruifeng closed pull request #42860: [SPARK-45107][PYTHON][DOCS] Refine 
docstring of explode
URL: https://github.com/apache/spark/pull/42860


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #43001: [SPARK-45218][PYTHON][DOCS] Refine docstring of Column.isin

2023-09-20 Thread via GitHub



zhengruifeng commented on PR #43001:
URL: https://github.com/apache/spark/pull/43001#issuecomment-1728570548

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #43001: [SPARK-45218][PYTHON][DOCS] Refine docstring of Column.isin

2023-09-20 Thread via GitHub



zhengruifeng closed pull request #43001: [SPARK-45218][PYTHON][DOCS] Refine 
docstring of Column.isin
URL: https://github.com/apache/spark/pull/43001


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #43012: [SPARK-45234][PYTHON][DOCS] Refine DocString of `regr_*` functions

2023-09-20 Thread via GitHub



zhengruifeng commented on PR #43012:
URL: https://github.com/apache/spark/pull/43012#issuecomment-1728568181

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #43012: [SPARK-45234][PYTHON][DOCS] Refine DocString of `regr_*` functions

2023-09-20 Thread via GitHub



zhengruifeng closed pull request #43012: [SPARK-45234][PYTHON][DOCS] Refine 
DocString of `regr_*` functions
URL: https://github.com/apache/spark/pull/43012


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #43014: [SPARK-45235][CONNECT][PYTHON] Support map and array parameters by `sql()`

2023-09-20 Thread via GitHub



zhengruifeng commented on code in PR #43014:
URL: https://github.com/apache/spark/pull/43014#discussion_r1332291118


##
python/pyspark/sql/tests/connect/test_connect_basic.py:
##
@@ -1237,13 +1237,23 @@ def test_sql(self):
 self.assertEqual(1, len(pdf.index))
 
 def test_sql_with_named_args(self):
-df = self.connect.sql("SELECT * FROM range(10) WHERE id > :minId", 
args={"minId": 7})
-df2 = self.spark.sql("SELECT * FROM range(10) WHERE id > :minId", 
args={"minId": 7})
+from pyspark.sql.functions import create_map, lit
+from pyspark.sql.connect.functions import lit as clit
+from pyspark.sql.connect.functions import create_map as ccreate_map

Review Comment:
   @allisonwang-db since they are just used in tests, so I guess doesn't matter.
   
   for new test file, I think we can start with `sf` and `cf`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



HyukjinKwon commented on PR #42949:
URL: https://github.com/apache/spark/pull/42949#issuecomment-1728565241

   Mind running `dev/reformat-python` one more last time? Seems like there's a 
diff between old and new black versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiongbo-sjtu closed pull request #43020: [SPARK-45227][CORE] Fix an issue with CoarseGrainedExecutorBackend

2023-09-20 Thread via GitHub



xiongbo-sjtu closed pull request #43020: [SPARK-45227][CORE] Fix an issue with 
CoarseGrainedExecutorBackend
URL: https://github.com/apache/spark/pull/43020


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xiongbo-sjtu opened a new pull request, #43020: [SPARK-45227][CORE] Fix an issue with CoarseGrainedExecutorBackend

2023-09-20 Thread via GitHub



xiongbo-sjtu opened a new pull request, #43020:
URL: https://github.com/apache/spark/pull/43020

   ### What changes were proposed in this pull request?
   Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an 
executor process randomly gets stuck
   
   ### Why are the changes needed?
   For each executor, the single-threaded dispatcher can run into an "infinite 
loop" (as explained in the SPARK-45227). Once an executor process runs into a 
state, it'd stop launching tasks from the driver or reporting task status back.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Existing unit tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #43002: [SPARK-43498][PS][TESTS] Enable `StatsTests.test_axis_on_dataframe` for pandas 2.0.0.

2023-09-20 Thread via GitHub



dongjoon-hyun commented on PR #43002:
URL: https://github.com/apache/spark/pull/43002#issuecomment-1728516043

   +1, LGTM, too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics

2023-09-20 Thread via GitHub



amaliujia commented on PR #43010:
URL: https://github.com/apache/spark/pull/43010#issuecomment-1728458545

   @cloud-fan trying this idea


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42993: [SPARK-45231][INFRA] Remove unrecognized and meaningless command about `Ammonite` from the GA testing workflow.

2023-09-20 Thread via GitHub



dongjoon-hyun commented on PR #42993:
URL: https://github.com/apache/spark/pull/42993#issuecomment-1728412993

   cc @hvanhovell


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #43008: [SPARK-44113][BUILD][INFRA][DOCS] Drop support for Scala 2.12

2023-09-20 Thread via GitHub



dongjoon-hyun commented on PR #43008:
URL: https://github.com/apache/spark/pull/43008#issuecomment-1728334196

   I'm good with this single PR. Could you resolve the conflict?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21

2023-09-20 Thread via GitHub



dongjoon-hyun closed pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in 
`build_and_test` and `build_java21` GitHub Action and Java 21
URL: https://github.com/apache/spark/pull/43018


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21

2023-09-20 Thread via GitHub



dongjoon-hyun commented on PR #43018:
URL: https://github.com/apache/spark/pull/43018#issuecomment-1728265731

   The one Parquet test failure. I verified manually. It seems to be a flaky 
test.
   ```
   $ java -version
   openjdk version "21" 2023-09-19 LTS
   ...
   
   $ build/sbt "sql/testOnly *.ParquetV1FilterSuite -- -z StringEndsWith"
   [info] ParquetV1FilterSuite:
   11:49:50.643 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
   [info] - filter pushdown - StringEndsWith (6 seconds, 241 milliseconds)
   11:49:57.656 WARN 
org.apache.spark.sql.execution.datasources.parquet.ParquetV1FilterSuite:
   
   = POSSIBLE THREAD LEAK IN SUITE 
o.a.s.sql.execution.datasources.parquet.ParquetV1FilterSuite, threads: 
ForkJoinPool.commonPool-worker-6 (daemon=true), 
ForkJoinPool.commonPool-worker-4 (daemon=true), 
ForkJoinPool.commonPool-worker-7 (daemon=true), rpc-boss-3-1 (daemon=true), 
ForkJoinPool.commonPool-worker-8 (daemon=true), 
ForkJoinPool.commonPool-worker-5 (daemon=true), shuffle-boss-6-1 (daemon=true), 
ForkJoinPool.commonPool-worker-1 (daemon=true), 
ForkJoinPool.commonPool-worker-3 (daemon=true), ForkJoi...
   [info] Run completed in 8 seconds, 36 milliseconds.
   [info] Total number of tests run: 1
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   [success] Total time: 21 s, completed Sep 20, 2023, 11:49:57 AM
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21

2023-09-20 Thread via GitHub



dongjoon-hyun commented on PR #43018:
URL: https://github.com/apache/spark/pull/43018#issuecomment-1728258545

   Thank you, @viirya .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21

2023-09-20 Thread via GitHub



viirya commented on PR #43018:
URL: https://github.com/apache/spark/pull/43018#issuecomment-1728214949

   Looks good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] robreeves closed pull request #43006: [SPARK-TODO][CORE][SQL] Eliminate intermediate staging directory for dynamic partition overwrite commits

2023-09-20 Thread via GitHub



robreeves closed pull request #43006: [SPARK-TODO][CORE][SQL] Eliminate 
intermediate staging directory for dynamic partition overwrite commits
URL: https://github.com/apache/spark/pull/43006


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21

2023-09-20 Thread via GitHub



dongjoon-hyun commented on PR #43018:
URL: https://github.com/apache/spark/pull/43018#issuecomment-1728175236

   Could you review this PR, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21

2023-09-20 Thread via GitHub



dongjoon-hyun commented on PR #43018:
URL: https://github.com/apache/spark/pull/43018#issuecomment-1728174982

   Java 21 passed. 
   
   ![Screenshot 2023-09-20 at 10 39 23 
AM](https://github.com/apache/spark/assets/9700541/ee99d4e7-cad4-4ad3-9324-631a00b7730a)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Error Enrichment for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331977811


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   Here is the fix: https://github.com/apache/spark/pull/43017



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sunchao commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method

2023-09-20 Thread via GitHub



sunchao commented on code in PR #42612:
URL: https://github.com/apache/spark/pull/42612#discussion_r1331964553


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala:
##
@@ -279,7 +283,9 @@ case class StaticInvoke(
 inputTypes: Seq[AbstractDataType] = Nil,
 propagateNull: Boolean = true,
 returnNullable: Boolean = true,
-isDeterministic: Boolean = true) extends InvokeLike {
+isDeterministic: Boolean = true,
+scalarFunctionName: Option[String] = None,

Review Comment:
   I'm OK to keep as it is. For me it is just a nit and one parameter makes the 
change less intrusive.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

2023-09-20 Thread via GitHub



peter-toth commented on code in PR #42997:
URL: https://github.com/apache/spark/pull/42997#discussion_r1331955820


##
python/pyspark/sql/connect/functions.py:
##
@@ -388,7 +390,7 @@ def rand(seed: Optional[int] = None) -> Column:
 if seed is not None:
 return _invoke_function("rand", lit(seed))
 else:
-return _invoke_function("rand")
+return _invoke_function("rand", lit(random.randint(0, sys.maxsize)))

Review Comment:
   No, we don't. The sql Python functions just invoke the Scala implementation 
of the functions, which have been fixed.
   
   BTW, I covered sql Python functions with a test: 
https://github.com/apache/spark/pull/42997/files#diff-a4984706554eae562eb0c7a66d284b1855d9bc463b03474f73493c7694244b66R1356-R1366



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

2023-09-20 Thread via GitHub



peter-toth commented on code in PR #42997:
URL: https://github.com/apache/spark/pull/42997#discussion_r1331955820


##
python/pyspark/sql/connect/functions.py:
##
@@ -388,7 +390,7 @@ def rand(seed: Optional[int] = None) -> Column:
 if seed is not None:
 return _invoke_function("rand", lit(seed))
 else:
-return _invoke_function("rand")
+return _invoke_function("rand", lit(random.randint(0, sys.maxsize)))

Review Comment:
   No, we don't. The sql Python functions just invoke the Scala 
implementations, which has been fixed.
   
   BTW, I covered sql Python functions with a test: 
https://github.com/apache/spark/pull/42997/files#diff-a4984706554eae562eb0c7a66d284b1855d9bc463b03474f73493c7694244b66R1356-R1366



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #42997:
URL: https://github.com/apache/spark/pull/42997#discussion_r1331943027


##
python/pyspark/sql/connect/functions.py:
##
@@ -388,7 +390,7 @@ def rand(seed: Optional[int] = None) -> Column:
 if seed is not None:
 return _invoke_function("rand", lit(seed))
 else:
-return _invoke_function("rand")
+return _invoke_function("rand", lit(random.randint(0, sys.maxsize)))

Review Comment:
   These are spark connect functions. Do we also need to update 
`python/pyspark/sql/functions.py`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db opened a new pull request, #43019: [SPARK-45219][PYTHON][DOCS] Refine docstring of withColumn(s)Renamed

2023-09-20 Thread via GitHub



allisonwang-db opened a new pull request, #43019:
URL: https://github.com/apache/spark/pull/43019

   
   
   ### What changes were proposed in this pull request?
   
   This PR refines the docstring of `DataFrame.withColumnRenamed` and 
`DataFrame.withColumnsRenamed`.
   
   ### Why are the changes needed?
   
   To improve PySpark documentations.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   doctest
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331929198


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   >> I just think we shouldn't leave users this error message, when they don't 
enable the spark connect logger:
   
   We can wrap the outside exception in some SparkException saying that 
AddArtifacts failed. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #43011: [WIP][SPARK-45232][DOC] Add missing function groups to SQL references

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #43011:
URL: https://github.com/apache/spark/pull/43011#discussion_r1331925178


##
sql/gen-sql-functions-docs.py:
##
@@ -34,6 +34,8 @@
 "math_funcs", "conditional_funcs", "generator_funcs",
 "predicate_funcs", "string_funcs", "misc_funcs",
 "bitwise_funcs", "conversion_funcs", "csv_funcs",
+"xml_funcs", "lambda_funcs", "collection_funcs",
+"url_funcs", "hash_funcs", "struct_funcs",

Review Comment:
   QQ: For generator_funcs, do we have documentation for them when used in the 
FROM clause of a query? Functions like explode are typically considered 
table-valued generator functions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #43014: [SPARK-45235][CONNECT][PYTHON] Support map and array parameters by `sql()`

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #43014:
URL: https://github.com/apache/spark/pull/43014#discussion_r1331920397


##
python/pyspark/sql/tests/connect/test_connect_basic.py:
##
@@ -1237,13 +1237,23 @@ def test_sql(self):
 self.assertEqual(1, len(pdf.index))
 
 def test_sql_with_named_args(self):
-df = self.connect.sql("SELECT * FROM range(10) WHERE id > :minId", 
args={"minId": 7})
-df2 = self.spark.sql("SELECT * FROM range(10) WHERE id > :minId", 
args={"minId": 7})
+from pyspark.sql.functions import create_map, lit
+from pyspark.sql.connect.functions import lit as clit
+from pyspark.sql.connect.functions import create_map as ccreate_map

Review Comment:
   @zhengruifeng just wondering should `SF` and `CF` be capitalized here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331900167


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   > Could you be more specific on what Exception can be thrown here? From your 
PR description, it looks like a GPRC error.
   
   Doesn't matter, any exception thrown here would be suppressed by grpc and 
only generic exception "Exception iterating requests" will be visible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331897585


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   What's the behavior after your change when the SPARK_CONNECT logger is 
enabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331899412


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   I just think we shouldn't leave users this error message, when they don't 
enable the spark connect logger:
   ```
   status = StatusCode.UNKNOWN
   details = "Exception iterating requests!"
   debug_error_string = "None"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331901403


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   > What's the behavior after your change when the SPARK_CONNECT logger is 
enabled?
   described in pr
   



##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   > What's the behavior after your change when the SPARK_CONNECT logger is 
enabled?
   
   described in pr
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331900167


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   > Could you be more specific on what Exception can be thrown here? From your 
PR description, it looks like a GPRC error.
   
   Doesn't matter, any exception thrown here would be suppressed by grpc and 
only generic exception "Fail to iterate response" will be visible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331901152


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   In my example I'm intercepting FileNotFound error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



peter-toth commented on PR #42864:
URL: https://github.com/apache/spark/pull/42864#issuecomment-1728048096

   Thanks for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`

2023-09-20 Thread via GitHub



dongjoon-hyun closed pull request #43015: [SPARK-45237][DOCS] Change the 
default value of `spark.history.store.hybridStore.diskBackend` in 
`monitoring.md` to `ROCKSDB`
URL: https://github.com/apache/spark/pull/43015


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun opened a new pull request, #43018: [SPARK-45241][INFRA] Use Zulu JDK in java-other-versions pipeline and Java 21

2023-09-20 Thread via GitHub



dongjoon-hyun opened a new pull request, #43018:
URL: https://github.com/apache/spark/pull/43018

   … 
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`

2023-09-20 Thread via GitHub



dongjoon-hyun commented on PR #43015:
URL: https://github.com/apache/spark/pull/43015#issuecomment-1727988603

   Merged to master/3.5/3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution

2023-09-20 Thread via GitHub



yaooqinn commented on code in PR #42199:
URL: https://github.com/apache/spark/pull/42199#discussion_r1331832544


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala:
##
@@ -77,6 +79,11 @@ object SQLExecution {
 }
 val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong
 executionIdToQueryExecution.put(executionId, queryExecution)
+val originalInterruptOnCancel = 
sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL)
+if (originalInterruptOnCancel == null) {
+  val interruptOnCancel = 
sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL)
+  sc.setInterruptOnCancel(interruptOnCancel)

Review Comment:
   STS call cancelJobGroup after query failed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331894962


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   Could you be more specific on what `Exception` can be thrown here? From your 
PR description, it looks like a GPRC error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



cloud-fan closed pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction 
based resolution in SQL Dataset functions
URL: https://github.com/apache/spark/pull/42864


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331785918


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   I don't think we can do something like that here. We are not having 
server-side error response needed to process. We have purely client-side error 
of exception being thrown while grpc consumes iterator of requests. It's not 
possible to extract any information on outside of grpc being called since it 
suppresses error entirely.
   
   I think the only option to make proper exception classes is to preload all 
the artifacts into memory and construct corresponding requests before streaming 
them into grpc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on PR #42949:
URL: https://github.com/apache/spark/pull/42949#issuecomment-1727917945

   Updated fork's master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hdaikoku commented on a diff in pull request #42426: [SPARK-44756][CORE] Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-09-20 Thread via GitHub



hdaikoku commented on code in PR #42426:
URL: https://github.com/apache/spark/pull/42426#discussion_r1331742974


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java:
##
@@ -274,7 +287,13 @@ private void handleBlockTransferFailure(String blockId, 
Throwable exception) {
   synchronized (RetryingBlockTransferor.this) {
 if (this == currentListener && outstandingBlocksIds.contains(blockId)) 
{
   if (shouldRetry(exception)) {
-initiateRetry(exception);
+try {
+  initiateRetry(exception);
+} catch (Throwable t) {
+  logger.error("Exception while trying to initiate retry", t);

Review Comment:
   Thank you for the suggestion, refactored accordingly: 
[32498ff](https://github.com/apache/spark/pull/42426/commits/32498ff9a906a21746b619f1ddd316777d188417)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



cloud-fan commented on PR #42864:
URL: https://github.com/apache/spark/pull/42864#issuecomment-1727813318

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   The size of the ErrorInfo exceeds the header limit sometimes in the tests I 
added. So disable this by default. It should not be an issue after we migrate 
to use FetchErrorDetails RPC



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331243346


##
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcExceptionConverter.scala:
##
@@ -93,33 +179,65 @@ private[client] object GrpcExceptionConverter extends 
JsonUtils {
   new SparkArrayIndexOutOfBoundsException(message)),
 errorConstructor[DateTimeException]((message, _) => new 
SparkDateTimeException(message)),
 errorConstructor((message, cause) => new SparkRuntimeException(message, 
cause)),
-errorConstructor((message, cause) => new SparkUpgradeException(message, 
cause)))
-
-  private def errorInfoToThrowable(info: ErrorInfo, message: String): 
Option[Throwable] = {
-val classes =
-  mapper.readValue(info.getMetadataOrDefault("classes", "[]"), 
classOf[Array[String]])
+errorConstructor((message, cause) => new SparkUpgradeException(message, 
cause)),
+errorConstructor((message, cause) => new SparkException(message, 
cause.orNull)))
+
+  /**
+   * errorsToThrowable reconstructs the exception based on a list of protobuf 
messages
+   * FetchErrorDetailsResponse.Error with un-truncated error messages and 
server-side stacktrace
+   * (if set).
+   */
+  private def errorsToThrowable(
+  errorIdx: Int,
+  errors: Seq[FetchErrorDetailsResponse.Error]): Throwable = {
+
+val error = errors(errorIdx)
+
+val classHierarchy = error.getErrorTypeHierarchyList.asScala
+
+val constructor =
+  classHierarchy
+.flatMap(errorFactory.get)
+.headOption
+.getOrElse((message: String, cause: Option[Throwable]) =>
+  new SparkException(

Review Comment:
   This is from 
https://github.com/apache/spark/pull/42377#discussion_r1329224205. But this one 
seems to be simpler



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong opened a new pull request, #43017: [SPARK-45239] Reduce default spark.connect.jvmStacktrace.maxSize

2023-09-20 Thread via GitHub



heyihong opened a new pull request, #43017:
URL: https://github.com/apache/spark/pull/43017

   
   
   ### What changes were proposed in this pull request?
   
   
   - Reduce default spark.connect.jvmStacktrace.maxSize
   
   
   ### Why are the changes needed?
   
   
   - `spark.sql.pyspark.jvmStacktrace.enabled` is partially broken (i.e. 
hitting the 8K header limit) and has to be disable in some tests
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   - No
   
   ### How was this patch tested?
   
   
   - Existing tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331693297


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   The change is rebased already. I reenabled the pyspark jvm stacktrace flag 
for tests but disable this flag in a specific test case instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

2023-09-20 Thread via GitHub



cloud-fan commented on PR #42997:
URL: https://github.com/apache/spark/pull/42997#issuecomment-1727800532

   cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #43005: [WIP][SPARK-44112][BUILD][INFRA] Drop Java 8 and 11 support

2023-09-20 Thread via GitHub



LuciferYang commented on code in PR #43005:
URL: https://github.com/apache/spark/pull/43005#discussion_r1331245458


##
.github/workflows/build_coverage.yml:
##
@@ -17,7 +17,7 @@
 # under the License.
 #
 
-name: "Build / Coverage (master, Scala 2.12, Hadoop 3, JDK 8)"
+name: "Build / Coverage (master, Scala 2.12, Hadoop 17, JDK 8)"

Review Comment:
   good catch ~ thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference

2023-09-20 Thread via GitHub



zhengruifeng commented on PR #43013:
URL: https://github.com/apache/spark/pull/43013#issuecomment-1727396313

   `/mllib` in scala, `pyspark.ml` and `pyspark.mllib` in python, don't work on 
connect.
   
   only new module `pyspark.ml.connect` works on connect.
   
   `pyspark.ml` contains many classification/clustering/etc, they are not 
supported in `pyspark.ml.connect`
   
   cc @WeichenXu123 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   The size of the ErrorInfo is about to exceed the header limit even in tests. 
So disable this by default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42971: [SPARK-43979][SQL][FOLLOWUP] Handle non alias-only project case

2023-09-20 Thread via GitHub



cloud-fan commented on PR #42971:
URL: https://github.com/apache/spark/pull/42971#issuecomment-1727363616

   The failed streaming test is unrelated, I'm merging it to master/3.5, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #43008: [SPARK-44113][BUILD][INFRA][DOCS] Drop support for Scala 2.12

2023-09-20 Thread via GitHub



LuciferYang commented on PR #43008:
URL: https://github.com/apache/spark/pull/43008#issuecomment-1727210378

   cc @dongjoon-hyun FYI
   
   Do we need to further split this PR ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #43007: [SPARK-45229][CORE][UI] Show the number of drivers waiting in SUBMITTED status in MasterPage

2023-09-20 Thread via GitHub



dongjoon-hyun closed pull request #43007: [SPARK-45229][CORE][UI] Show the 
number of drivers waiting in SUBMITTED status in MasterPage
URL: https://github.com/apache/spark/pull/43007


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #43005: [WIP][SPARK-44112][BUILD][INFRA] Drop Java 8 and 11 support

2023-09-20 Thread via GitHub



LuciferYang commented on PR #43005:
URL: https://github.com/apache/spark/pull/43005#issuecomment-1727201913

   wait https://github.com/apache/spark/pull/43008


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #42199:
URL: https://github.com/apache/spark/pull/42199#discussion_r1331684745


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala:
##
@@ -77,6 +79,11 @@ object SQLExecution {
 }
 val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong
 executionIdToQueryExecution.put(executionId, queryExecution)
+val originalInterruptOnCancel = 
sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL)
+if (originalInterruptOnCancel == null) {
+  val interruptOnCancel = 
sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL)
+  sc.setInterruptOnCancel(interruptOnCancel)

Review Comment:
   > does this PR just leverage a feature that SC already have?
   
   Yes, the problem is, now we enable this feature by default, while this 
feature can lead to memory leak.
   
   That's why I asked, how does thriftserver leverage this feature? After this 
PR, all queries will use this feature, not only thriftserver.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 166 matches

Mail list logo