[GitHub] [spark] zhengruifeng commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics
zhengruifeng commented on code in PR #43010: URL: https://github.com/apache/spark/pull/43010#discussion_r1332491666 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -990,6 +990,9 @@ message CollectMetrics { // (Required) The metric sequence. repeated Expression metrics = 3; + + // (Required) A unique DataFrame id. + int64 dataframe_id = 4; Review Comment: since `dataframe_id` is set the `plan_id` why not reusing the `plan_id` in `RelationCommon`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics
zhengruifeng commented on code in PR #43010: URL: https://github.com/apache/spark/pull/43010#discussion_r1332489254 ## python/pyspark/sql/connect/plan.py: ## @@ -1197,6 +1197,7 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation: plan.collect_metrics.input.CopyFrom(self._child.plan(session)) plan.collect_metrics.name = self._name plan.collect_metrics.metrics.extend([self.col_to_expr(x, session) for x in self._exprs]) +plan.collect_metrics.dataframe_id = self._child._plan_id Review Comment: the `_plan_id` is unique: https://github.com/apache/spark/blob/5299e5423f439ce702d14666c79f9d42eb446943/python/pyspark/sql/connect/plan.py#L57-L72 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method
cloud-fan commented on code in PR #42612: URL: https://github.com/apache/spark/pull/42612#discussion_r1332486360 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala: ## @@ -270,6 +271,8 @@ object SerializerSupport { * non-null value. * @param isDeterministic Whether the method invocation is deterministic or not. If false, Spark *will not apply certain optimizations such as constant folding. + * @param scalarFunction the [[ScalarFunction]] object if this is calling the magic method of the + * [[ScalarFunction]] otherwise is unset. */ case class StaticInvoke( Review Comment: we can override `stringArgs` in `StaticInvoke`, to exclude the new parameter if it's None, to avoid the golden file changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics
cloud-fan commented on code in PR #43010: URL: https://github.com/apache/spark/pull/43010#discussion_r1332484719 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala: ## @@ -779,34 +779,35 @@ class AnalysisSuite extends AnalysisTest with Matchers { val literal = Literal(1).as("lit") // Ok -assert(CollectMetrics("event", literal :: sum :: random_sum :: Nil, testRelation).resolved) +assert(CollectMetrics("event", literal :: sum :: random_sum :: Nil, testRelation, 0).resolved) // Bad name -assert(!CollectMetrics("", sum :: Nil, testRelation).resolved) +assert(!CollectMetrics("", sum :: Nil, testRelation, 1).resolved) assertAnalysisErrorClass( - CollectMetrics("", sum :: Nil, testRelation), + CollectMetrics("", sum :: Nil, testRelation, 1), expectedErrorClass = "INVALID_OBSERVED_METRICS.MISSING_NAME", expectedMessageParameters = Map( -"operator" -> "'CollectMetrics , [sum(a#x) AS sum#xL]\n+- LocalRelation , [a#x]\n") +"operator" -> + "'CollectMetrics , [sum(a#x) AS sum#xL], 1\n+- LocalRelation , [a#x]\n") ) // No columns -assert(!CollectMetrics("evt", Nil, testRelation).resolved) +assert(!CollectMetrics("evt", Nil, testRelation, 2).resolved) def checkAnalysisError(exprs: Seq[NamedExpression], errors: String*): Unit = { - assertAnalysisError(CollectMetrics("event", exprs, testRelation), errors) + assertAnalysisError(CollectMetrics("event", exprs, testRelation, 3), errors) } // Unwrapped attribute assertAnalysisErrorClass( - CollectMetrics("event", a :: Nil, testRelation), + CollectMetrics("event", a :: Nil, testRelation, 4), expectedErrorClass = "INVALID_OBSERVED_METRICS.NON_AGGREGATE_FUNC_ARG_IS_ATTRIBUTE", expectedMessageParameters = Map("expr" -> "\"a\"") ) // Unwrapped non-deterministic expression assertAnalysisErrorClass( - CollectMetrics("event", Rand(10).as("rnd") :: Nil, testRelation), + CollectMetrics("event", Rand(10).as("rnd") :: Nil, testRelation, 5), Review Comment: I think we can always use `0` in this test suite? Unless we have multiple `CollectMetrics` in one plan tree -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics
cloud-fan commented on code in PR #43010: URL: https://github.com/apache/spark/pull/43010#discussion_r1332484719 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala: ## @@ -779,34 +779,35 @@ class AnalysisSuite extends AnalysisTest with Matchers { val literal = Literal(1).as("lit") // Ok -assert(CollectMetrics("event", literal :: sum :: random_sum :: Nil, testRelation).resolved) +assert(CollectMetrics("event", literal :: sum :: random_sum :: Nil, testRelation, 0).resolved) // Bad name -assert(!CollectMetrics("", sum :: Nil, testRelation).resolved) +assert(!CollectMetrics("", sum :: Nil, testRelation, 1).resolved) assertAnalysisErrorClass( - CollectMetrics("", sum :: Nil, testRelation), + CollectMetrics("", sum :: Nil, testRelation, 1), expectedErrorClass = "INVALID_OBSERVED_METRICS.MISSING_NAME", expectedMessageParameters = Map( -"operator" -> "'CollectMetrics , [sum(a#x) AS sum#xL]\n+- LocalRelation , [a#x]\n") +"operator" -> + "'CollectMetrics , [sum(a#x) AS sum#xL], 1\n+- LocalRelation , [a#x]\n") ) // No columns -assert(!CollectMetrics("evt", Nil, testRelation).resolved) +assert(!CollectMetrics("evt", Nil, testRelation, 2).resolved) def checkAnalysisError(exprs: Seq[NamedExpression], errors: String*): Unit = { - assertAnalysisError(CollectMetrics("event", exprs, testRelation), errors) + assertAnalysisError(CollectMetrics("event", exprs, testRelation, 3), errors) } // Unwrapped attribute assertAnalysisErrorClass( - CollectMetrics("event", a :: Nil, testRelation), + CollectMetrics("event", a :: Nil, testRelation, 4), expectedErrorClass = "INVALID_OBSERVED_METRICS.NON_AGGREGATE_FUNC_ARG_IS_ATTRIBUTE", expectedMessageParameters = Map("expr" -> "\"a\"") ) // Unwrapped non-deterministic expression assertAnalysisErrorClass( - CollectMetrics("event", Rand(10).as("rnd") :: Nil, testRelation), + CollectMetrics("event", Rand(10).as("rnd") :: Nil, testRelation, 5), Review Comment: I think we can always use `0` in this test suite? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics
cloud-fan commented on code in PR #43010: URL: https://github.com/apache/spark/pull/43010#discussion_r1332484136 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -1097,17 +1097,15 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB * are allowed (e.g. self-joins). */ private def checkCollectedMetrics(plan: LogicalPlan): Unit = { -val metricsMap = mutable.Map.empty[String, LogicalPlan] +val metricsMap = mutable.Map.empty[String, CollectMetrics] def check(plan: LogicalPlan): Unit = plan.foreach { node => node match { -case metrics @ CollectMetrics(name, _, _) => - val simplifiedMetrics = simplifyPlanForCollectedMetrics(metrics.canonicalized) Review Comment: nit: we can remove `simplifyPlanForCollectedMetrics` now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics
cloud-fan commented on code in PR #43010: URL: https://github.com/apache/spark/pull/43010#discussion_r1332483781 ## python/pyspark/sql/connect/plan.py: ## @@ -1197,6 +1197,7 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation: plan.collect_metrics.input.CopyFrom(self._child.plan(session)) plan.collect_metrics.name = self._name plan.collect_metrics.metrics.extend([self.col_to_expr(x, session) for x in self._exprs]) +plan.collect_metrics.dataframe_id = self._child._plan_id Review Comment: does Spark Connect DataFrame also have the unique id? cc @HyukjinKwon @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #43024: [SPARK-45246][BUILD][PS] Encourage using latest `jinja2` other than documentation build
itholic opened a new pull request, #43024: URL: https://github.com/apache/spark/pull/43024 ### What changes were proposed in this pull request? This PR proposes to update the `jinja2` version from `dev/requirements.txt` to the latest version for PySpark. ### Why are the changes needed? Currently, `jinja2<3.0.0` is installed when running `pip install -r dev/requirements.txt`, but it's not working for some functionality for Pandas API on Spark. Because Pandas requires the latest version of `jinja2` for their LaTex render engine, so Pandas API on Spark also requires the same version since we're leverage the Pandas internally. https://github.com/apache/spark/assets/44108233/1f0bbb15-829c-473b-bb47-e602fa6cb04d;> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on a diff in pull request #43023: [SPARK-45245] PythonWorkerFactory: Timeout if worker does not connect back.
rangadi commented on code in PR #43023: URL: https://github.com/apache/spark/pull/43023#discussion_r1332415585 ## core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala: ## @@ -184,10 +185,20 @@ private[spark] class PythonWorkerFactory( redirectStreamsToStderr(workerProcess.getInputStream, workerProcess.getErrorStream) // Wait for it to connect to our socket, and validate the auth secret. - serverSocketChannel.socket().setSoTimeout(1) Review Comment: There was supposed to be 10 second timeout. But this call does not seem to affect `serverSocketChannel.accept()`. This set might only take effect if we did `serverSocketChannel.socket().accept()`, but hat returns a socket, not a channel. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on pull request #43023: [SPARK-45245] PythonWorkerFactory: Timeout if worker does not connect back.
rangadi commented on PR #43023: URL: https://github.com/apache/spark/pull/43023#issuecomment-1728734215 cc: @HyukjinKwon, @WweiL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi opened a new pull request, #43023: [SPARK-45245] PythonWorkerFactory: Timeout if worker does not connect back.
rangadi opened a new pull request, #43023: URL: https://github.com/apache/spark/pull/43023 ### What changes were proposed in this pull request? `createSimpleWorker()` method in `PythonWorkerFactory` waits forever if the worker fails to connect back to the server. This is because it calls accept() without a timeout. If the worker does not connect back, accept() waits forever. There is supposed to be 10 seconds timeout, but it was not implemented correctly. This PR adds a 10 second timeout. ### Why are the changes needed? Otherwise create method could be stuck forever. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Unit test - Manual ### Was this patch authored or co-authored using generative AI tooling? Generated-by: ChatGPT 4.0 Asked ChatGPT to generate sample code to do non-blocking accept() on a socket channel in Java. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zwangsheng commented on pull request #43022: [SPARK-45244][IT] Correct spelling in VolcanoTestsSuite
zwangsheng commented on PR #43022: URL: https://github.com/apache/spark/pull/43022#issuecomment-1728726734 friendly ping @Yikun, PTAL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zwangsheng opened a new pull request, #43022: [SPARK-45244][IT] Correct spelling in VolcanoTestsSuite
zwangsheng opened a new pull request, #43022: URL: https://github.com/apache/spark/pull/43022 ### What changes were proposed in this pull request? ### Why are the changes needed? Correct typo in VolcanoTestsSuite, which naming methods with `checkAnnotaion`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exited UT ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails
HeartSaVioR commented on PR #42895: URL: https://github.com/apache/spark/pull/42895#issuecomment-1728709440 UPDATE: we sorted the issue with the JIRA account out now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails
HeartSaVioR commented on PR #42895: URL: https://github.com/apache/spark/pull/42895#issuecomment-1728702545 Please send the mail to priv...@spark.apache.org and look for action from PMC members. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0
itholic commented on code in PR #40420: URL: https://github.com/apache/spark/pull/40420#discussion_r1332390021 ## python/pyspark/pandas/datetimes.py: ## @@ -116,26 +117,59 @@ def pandas_microsecond(s) -> ps.Series[np.int32]: # type: ignore[no-untyped-def def nanosecond(self) -> "ps.Series": raise NotImplementedError() -# TODO(SPARK-42617): Support isocalendar.week and replace it. -# See also https://github.com/pandas-dev/pandas/pull/33595. -@property -def week(self) -> "ps.Series": +def isocalendar(self) -> "ps.DataFrame": """ -The week ordinal of the year. +Calculate year, week, and day according to the ISO 8601 standard. -.. deprecated:: 3.4.0 -""" -warnings.warn( -"weekofyear and week have been deprecated.", -FutureWarning, -) -return self._data.spark.transform(lambda c: F.weekofyear(c).cast(LongType())) +.. versionadded:: 4.0.0 -@property -def weekofyear(self) -> "ps.Series": -return self.week +Returns +--- +DataFrame +With columns year, week and day. -weekofyear.__doc__ = week.__doc__ +.. note:: Returns have int64 type instead of UInt32 as is in pandas due to UInt32 +is not supported by spark + +Examples + +>>> dfs = ps.from_pandas(pd.date_range(start='2019-12-29', freq='D', periods=4).to_series()) +>>> dfs.dt.isocalendar() +year week day +2019-12-29 2019527 +2019-12-30 2020 11 +2019-12-31 2020 12 +2020-01-01 2020 13 + +>>> dfs.dt.isocalendar().week +2019-12-2952 +2019-12-30 1 +2019-12-31 1 +2020-01-01 1 +Name: week, dtype: int64 +""" + +return_types = [self._data.index.dtype, int, int, int] + +def pandas_isocalendar( # type: ignore[no-untyped-def] +pdf, +) -> ps.DataFrame[return_types]: # type: ignore[valid-type] Review Comment: ```suggestion def pandas_isocalendar( pdf: pd.DataFrame, ) -> ps.DataFrame[Any]: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] neilramaswamy commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails
neilramaswamy commented on PR #42895: URL: https://github.com/apache/spark/pull/42895#issuecomment-1728699388 My account hasn't been approved yet (submitted request 8 days ago). I vaguely remember there being an email/channel to ping about this, but I can't find the name. Do you recall? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails
HeartSaVioR commented on PR #42895: URL: https://github.com/apache/spark/pull/42895#issuecomment-1728695227 @neilramaswamy Could you please give your JIRA account so that I can add you as contributor and assign the ticket to you? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails
HeartSaVioR closed pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails URL: https://github.com/apache/spark/pull/42895 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails
HeartSaVioR commented on PR #42895: URL: https://github.com/apache/spark/pull/42895#issuecomment-1728693932 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails
HeartSaVioR commented on PR #42895: URL: https://github.com/apache/spark/pull/42895#issuecomment-1728693569 Looks like these test failures are flaky. As long as I see the previous test failure is gone, I could say this PR passes all tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0
itholic commented on code in PR #40420: URL: https://github.com/apache/spark/pull/40420#discussion_r1332376664 ## python/pyspark/pandas/datetimes.py: ## @@ -116,26 +117,59 @@ def pandas_microsecond(s) -> ps.Series[np.int32]: # type: ignore[no-untyped-def def nanosecond(self) -> "ps.Series": raise NotImplementedError() -# TODO(SPARK-42617): Support isocalendar.week and replace it. -# See also https://github.com/pandas-dev/pandas/pull/33595. -@property -def week(self) -> "ps.Series": +def isocalendar(self) -> "ps.DataFrame": """ -The week ordinal of the year. +Calculate year, week, and day according to the ISO 8601 standard. -.. deprecated:: 3.4.0 -""" -warnings.warn( -"weekofyear and week have been deprecated.", -FutureWarning, -) -return self._data.spark.transform(lambda c: F.weekofyear(c).cast(LongType())) +.. versionadded:: 4.0.0 -@property -def weekofyear(self) -> "ps.Series": -return self.week +Returns +--- +DataFrame +With columns year, week and day. -weekofyear.__doc__ = week.__doc__ +.. note:: Returns have int64 type instead of UInt32 as is in pandas due to UInt32 +is not supported by spark + +Examples + Review Comment: We can merge it before Spark 4.0.0, so we have enough time tho. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0
itholic commented on code in PR #40420: URL: https://github.com/apache/spark/pull/40420#discussion_r1332376264 ## python/pyspark/pandas/datetimes.py: ## @@ -116,26 +117,59 @@ def pandas_microsecond(s) -> ps.Series[np.int32]: # type: ignore[no-untyped-def def nanosecond(self) -> "ps.Series": raise NotImplementedError() -# TODO(SPARK-42617): Support isocalendar.week and replace it. -# See also https://github.com/pandas-dev/pandas/pull/33595. -@property -def week(self) -> "ps.Series": +def isocalendar(self) -> "ps.DataFrame": """ -The week ordinal of the year. +Calculate year, week, and day according to the ISO 8601 standard. -.. deprecated:: 3.4.0 -""" -warnings.warn( -"weekofyear and week have been deprecated.", -FutureWarning, -) -return self._data.spark.transform(lambda c: F.weekofyear(c).cast(LongType())) +.. versionadded:: 4.0.0 -@property -def weekofyear(self) -> "ps.Series": -return self.week +Returns +--- +DataFrame +With columns year, week and day. -weekofyear.__doc__ = week.__doc__ +.. note:: Returns have int64 type instead of UInt32 as is in pandas due to UInt32 +is not supported by spark + +Examples + Review Comment: Sure! It's not very urgent, so please take your time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #42940: [SPARK-45178][SS] Fallback to execute a single batch for Trigger.AvailableNow with unsupported sources rather than using wrapper
HeartSaVioR commented on PR #42940: URL: https://github.com/apache/spark/pull/42940#issuecomment-1728653787 https://lists.apache.org/thread/ljronxf6bymvqjmlwpzy84gzgvnqrmoh ^^^ DISCUSSION thread in dev@. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #43019: [SPARK-45219][PYTHON][DOCS] Refine docstring of withColumn(s)Renamed
HyukjinKwon commented on code in PR #43019: URL: https://github.com/apache/spark/pull/43019#discussion_r1332312402 ## python/pyspark/sql/dataframe.py: ## @@ -5797,25 +5798,53 @@ def withColumnRenamed(self, existing: str, new: str) -> "DataFrame": Parameters -- existing : str -string, name of the existing column to rename. +The name of the existing column to be renamed. new : str -string, new name of the column. +The new name to be assigned to the column. Returns --- :class:`DataFrame` -DataFrame with renamed column. +A new DataFrame with renamed column. + +See Also + +:meth:`withColumnsRenamed` Examples >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"]) ->>> df.withColumnRenamed('age', 'age2').show() + +Example 1: Rename a single column + +>>> df.withColumnRenamed("age", "age2").show() ++-+ |age2| name| ++-+ | 2|Alice| | 5| Bob| ++-+ + +Example 2: Rename a column that does not exist (no-op) + +>>> df.withColumnRenamed("non_existing", "new_name").show() ++---+-+ +|age| name| ++---+-+ +| 2|Alice| +| 5| Bob| ++---+-+ + +Example 3: Rename multiple columns + +>>> df.withColumnRenamed("age", "age2") \\ +... .withColumnRenamed("name", "name2").show() Review Comment: actually, let's don't use `\\`. That's sort of discouraged by PEP 8., and seems it fits within 100 characters. ```suggestion >>> df.withColumnRenamed("age", "age2").withColumnRenamed("name", "name2").show() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #43011: [WIP][SPARK-45232][DOCS] Add missing function groups to SQL references
zhengruifeng commented on PR #43011: URL: https://github.com/apache/spark/pull/43011#issuecomment-1728625579 we can check the documents built in the GA of this PR, https://github.com/zhengruifeng/spark/actions/runs/6249096629 ![image](https://github.com/apache/spark/assets/7322292/ea1f1389-754b-479c-9f7c-1b9083180667) however, it expires after one day -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #43011: [WIP][SPARK-45232][DOCS] Add missing function groups to SQL references
zhengruifeng commented on PR #43011: URL: https://github.com/apache/spark/pull/43011#issuecomment-1728624460 ![image](https://github.com/apache/spark/assets/7322292/09ae6ec9-a2a6-4f00-b260-6c680ff3) ![image](https://github.com/apache/spark/assets/7322292/15308ec1-aa0a-4495-9769-3e8707662b89) ![image](https://github.com/apache/spark/assets/7322292/fdd0e404-bffd-4663-bec1-82ead49052ce) ![image](https://github.com/apache/spark/assets/7322292/b8c11c51-979a-4f8e-9c29-75769b52dae1) ![image](https://github.com/apache/spark/assets/7322292/0f5426c4-6086-46b2-8d67-05f57e10f0fb) ![image](https://github.com/apache/spark/assets/7322292/066e7bea-646d-4bca-8d14-7e72e91aefd1) ![image](https://github.com/apache/spark/assets/7322292/e94734d1-3894-453f-b2e0-968e619ab513) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml
zhengruifeng commented on PR #42382: URL: https://github.com/apache/spark/pull/42382#issuecomment-1728616265 what about adding a implicit conversions `sc -> spark` in ml: ml only use `spark`; 3-rd lib will need to import this implicit conversion; or just keep two `saveMetadata` methods? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #43019: [SPARK-45219][PYTHON][DOCS] Refine docstring of withColumn(s)Renamed
HyukjinKwon commented on code in PR #43019: URL: https://github.com/apache/spark/pull/43019#discussion_r1332312402 ## python/pyspark/sql/dataframe.py: ## @@ -5797,25 +5798,53 @@ def withColumnRenamed(self, existing: str, new: str) -> "DataFrame": Parameters -- existing : str -string, name of the existing column to rename. +The name of the existing column to be renamed. new : str -string, new name of the column. +The new name to be assigned to the column. Returns --- :class:`DataFrame` -DataFrame with renamed column. +A new DataFrame with renamed column. + +See Also + +:meth:`withColumnsRenamed` Examples >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"]) ->>> df.withColumnRenamed('age', 'age2').show() + +Example 1: Rename a single column + +>>> df.withColumnRenamed("age", "age2").show() ++-+ |age2| name| ++-+ | 2|Alice| | 5| Bob| ++-+ + +Example 2: Rename a column that does not exist (no-op) + +>>> df.withColumnRenamed("non_existing", "new_name").show() ++---+-+ +|age| name| ++---+-+ +| 2|Alice| +| 5| Bob| ++---+-+ + +Example 3: Rename multiple columns + +>>> df.withColumnRenamed("age", "age2") \\ +... .withColumnRenamed("name", "name2").show() Review Comment: actually, let's don't use `\\`. That's sort of discouraged by PEP 8., and seems it fits within 100 characters. ```suggestion >>> df.withColumnRenamed("age", "age2").withColumnRenamed("name", "name2").show() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiongbo-sjtu opened a new pull request, #43021: [SPARK-45227][CORE] Fix an issue with CoarseGrainedExecutorBackend wh…
xiongbo-sjtu opened a new pull request, #43021: URL: https://github.com/apache/spark/pull/43021 ### What changes were proposed in this pull request? Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck ### Why are the changes needed? For each executor, the single-threaded dispatcher can run into an "infinite loop" (as explained in the SPARK-45227). Once an executor process runs into a state, it'd stop launching tasks from the driver or reporting task status back. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference
HyukjinKwon closed pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference URL: https://github.com/apache/spark/pull/43013 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference
HyukjinKwon commented on PR #43013: URL: https://github.com/apache/spark/pull/43013#issuecomment-1728598145 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference
HyukjinKwon commented on PR #43013: URL: https://github.com/apache/spark/pull/43013#issuecomment-1728597889 Yeah. We should probably document `pyspark.ml.connect` separately but there haven't been examples out yet. Let's document that separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs
HyukjinKwon closed pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs URL: https://github.com/apache/spark/pull/42997 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs
HyukjinKwon commented on PR #42997: URL: https://github.com/apache/spark/pull/42997#issuecomment-1728596271 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on pull request #43011: [WIP][SPARK-45232][DOCS] Add missing function groups to SQL references
gatorsmile commented on PR #43011: URL: https://github.com/apache/spark/pull/43011#issuecomment-1728585646 cc @srielau -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #41384: [SPARK-43297][ML]Use scala parallel collection ParVector to accelarate LocalKMeans.
github-actions[bot] commented on PR #41384: URL: https://github.com/apache/spark/pull/41384#issuecomment-1728583712 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #43011: [WIP][SPARK-45232][DOCS] Add missing function groups to SQL references
zhengruifeng commented on PR #43011: URL: https://github.com/apache/spark/pull/43011#issuecomment-1728581850 @allisonwang-db I am not sure, I don't see document for FROM clause, you may check 3 places: - https://spark.apache.org/docs/latest/api/sql/index.html#explode - https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#generator-functions - https://spark.apache.org/docs/latest/sql-ref-syntax.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0
itholic commented on PR #40420: URL: https://github.com/apache/spark/pull/40420#issuecomment-1728575805 Yeah, mypy check always tricky Let me take a look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on pull request #43017: [SPARK-45239] Reduce default spark.connect.jvmStacktrace.maxSize
heyihong commented on PR #43017: URL: https://github.com/apache/spark/pull/43017#issuecomment-1728573383 > Hi @heyihong , since you already have this error enrichment PR: #42987 Does it make sense to still support this `spark.connect.jvmStacktrace.maxSize` in spark connect? Also, just wondering, why do we want to enable this `spark.sql.pyspark.jvmStacktrace.enabled` config? For long term, no. But for short term, we need to be backward compatible with the old clients. We should turn off `spark.sql.pyspark.jvmStacktrace.enabled` once error enrichment feature is done. But we still need to keep the config for a while since it is the only way for old clients to get jvmStacktrace in the error messages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42860: [SPARK-45107][PYTHON][DOCS] Refine docstring of explode
zhengruifeng commented on PR #42860: URL: https://github.com/apache/spark/pull/42860#issuecomment-1728571438 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #42860: [SPARK-45107][PYTHON][DOCS] Refine docstring of explode
zhengruifeng closed pull request #42860: [SPARK-45107][PYTHON][DOCS] Refine docstring of explode URL: https://github.com/apache/spark/pull/42860 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #43001: [SPARK-45218][PYTHON][DOCS] Refine docstring of Column.isin
zhengruifeng commented on PR #43001: URL: https://github.com/apache/spark/pull/43001#issuecomment-1728570548 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #43001: [SPARK-45218][PYTHON][DOCS] Refine docstring of Column.isin
zhengruifeng closed pull request #43001: [SPARK-45218][PYTHON][DOCS] Refine docstring of Column.isin URL: https://github.com/apache/spark/pull/43001 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #43012: [SPARK-45234][PYTHON][DOCS] Refine DocString of `regr_*` functions
zhengruifeng commented on PR #43012: URL: https://github.com/apache/spark/pull/43012#issuecomment-1728568181 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #43012: [SPARK-45234][PYTHON][DOCS] Refine DocString of `regr_*` functions
zhengruifeng closed pull request #43012: [SPARK-45234][PYTHON][DOCS] Refine DocString of `regr_*` functions URL: https://github.com/apache/spark/pull/43012 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #43014: [SPARK-45235][CONNECT][PYTHON] Support map and array parameters by `sql()`
zhengruifeng commented on code in PR #43014: URL: https://github.com/apache/spark/pull/43014#discussion_r1332291118 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -1237,13 +1237,23 @@ def test_sql(self): self.assertEqual(1, len(pdf.index)) def test_sql_with_named_args(self): -df = self.connect.sql("SELECT * FROM range(10) WHERE id > :minId", args={"minId": 7}) -df2 = self.spark.sql("SELECT * FROM range(10) WHERE id > :minId", args={"minId": 7}) +from pyspark.sql.functions import create_map, lit +from pyspark.sql.connect.functions import lit as clit +from pyspark.sql.connect.functions import create_map as ccreate_map Review Comment: @allisonwang-db since they are just used in tests, so I guess doesn't matter. for new test file, I think we can start with `sf` and `cf`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
HyukjinKwon commented on PR #42949: URL: https://github.com/apache/spark/pull/42949#issuecomment-1728565241 Mind running `dev/reformat-python` one more last time? Seems like there's a diff between old and new black versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiongbo-sjtu closed pull request #43020: [SPARK-45227][CORE] Fix an issue with CoarseGrainedExecutorBackend
xiongbo-sjtu closed pull request #43020: [SPARK-45227][CORE] Fix an issue with CoarseGrainedExecutorBackend URL: https://github.com/apache/spark/pull/43020 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiongbo-sjtu opened a new pull request, #43020: [SPARK-45227][CORE] Fix an issue with CoarseGrainedExecutorBackend
xiongbo-sjtu opened a new pull request, #43020: URL: https://github.com/apache/spark/pull/43020 ### What changes were proposed in this pull request? Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck ### Why are the changes needed? For each executor, the single-threaded dispatcher can run into an "infinite loop" (as explained in the SPARK-45227). Once an executor process runs into a state, it'd stop launching tasks from the driver or reporting task status back. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #43002: [SPARK-43498][PS][TESTS] Enable `StatsTests.test_axis_on_dataframe` for pandas 2.0.0.
dongjoon-hyun commented on PR #43002: URL: https://github.com/apache/spark/pull/43002#issuecomment-1728516043 +1, LGTM, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #43010: [SPARK-41086][SQL] Use DataFrame ID to semantically validate CollectMetrics
amaliujia commented on PR #43010: URL: https://github.com/apache/spark/pull/43010#issuecomment-1728458545 @cloud-fan trying this idea -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42993: [SPARK-45231][INFRA] Remove unrecognized and meaningless command about `Ammonite` from the GA testing workflow.
dongjoon-hyun commented on PR #42993: URL: https://github.com/apache/spark/pull/42993#issuecomment-1728412993 cc @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #43008: [SPARK-44113][BUILD][INFRA][DOCS] Drop support for Scala 2.12
dongjoon-hyun commented on PR #43008: URL: https://github.com/apache/spark/pull/43008#issuecomment-1728334196 I'm good with this single PR. Could you resolve the conflict? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21
dongjoon-hyun closed pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21 URL: https://github.com/apache/spark/pull/43018 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21
dongjoon-hyun commented on PR #43018: URL: https://github.com/apache/spark/pull/43018#issuecomment-1728265731 The one Parquet test failure. I verified manually. It seems to be a flaky test. ``` $ java -version openjdk version "21" 2023-09-19 LTS ... $ build/sbt "sql/testOnly *.ParquetV1FilterSuite -- -z StringEndsWith" [info] ParquetV1FilterSuite: 11:49:50.643 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [info] - filter pushdown - StringEndsWith (6 seconds, 241 milliseconds) 11:49:57.656 WARN org.apache.spark.sql.execution.datasources.parquet.ParquetV1FilterSuite: = POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.parquet.ParquetV1FilterSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), ForkJoi... [info] Run completed in 8 seconds, 36 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 21 s, completed Sep 20, 2023, 11:49:57 AM ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21
dongjoon-hyun commented on PR #43018: URL: https://github.com/apache/spark/pull/43018#issuecomment-1728258545 Thank you, @viirya . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21
viirya commented on PR #43018: URL: https://github.com/apache/spark/pull/43018#issuecomment-1728214949 Looks good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] robreeves closed pull request #43006: [SPARK-TODO][CORE][SQL] Eliminate intermediate staging directory for dynamic partition overwrite commits
robreeves closed pull request #43006: [SPARK-TODO][CORE][SQL] Eliminate intermediate staging directory for dynamic partition overwrite commits URL: https://github.com/apache/spark/pull/43006 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21
dongjoon-hyun commented on PR #43018: URL: https://github.com/apache/spark/pull/43018#issuecomment-1728175236 Could you review this PR, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #43018: [SPARK-45241][INFRA] Use Zulu JDK in `build_and_test` and `build_java21` GitHub Action and Java 21
dongjoon-hyun commented on PR #43018: URL: https://github.com/apache/spark/pull/43018#issuecomment-1728174982 Java 21 passed. ![Screenshot 2023-09-20 at 10 39 23 AM](https://github.com/apache/spark/assets/9700541/ee99d4e7-cad4-4ad3-9324-631a00b7730a) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Error Enrichment for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331977811 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: Here is the fix: https://github.com/apache/spark/pull/43017 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method
sunchao commented on code in PR #42612: URL: https://github.com/apache/spark/pull/42612#discussion_r1331964553 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala: ## @@ -279,7 +283,9 @@ case class StaticInvoke( inputTypes: Seq[AbstractDataType] = Nil, propagateNull: Boolean = true, returnNullable: Boolean = true, -isDeterministic: Boolean = true) extends InvokeLike { +isDeterministic: Boolean = true, +scalarFunctionName: Option[String] = None, Review Comment: I'm OK to keep as it is. For me it is just a nit and one parameter makes the change less intrusive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs
peter-toth commented on code in PR #42997: URL: https://github.com/apache/spark/pull/42997#discussion_r1331955820 ## python/pyspark/sql/connect/functions.py: ## @@ -388,7 +390,7 @@ def rand(seed: Optional[int] = None) -> Column: if seed is not None: return _invoke_function("rand", lit(seed)) else: -return _invoke_function("rand") +return _invoke_function("rand", lit(random.randint(0, sys.maxsize))) Review Comment: No, we don't. The sql Python functions just invoke the Scala implementation of the functions, which have been fixed. BTW, I covered sql Python functions with a test: https://github.com/apache/spark/pull/42997/files#diff-a4984706554eae562eb0c7a66d284b1855d9bc463b03474f73493c7694244b66R1356-R1366 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs
peter-toth commented on code in PR #42997: URL: https://github.com/apache/spark/pull/42997#discussion_r1331955820 ## python/pyspark/sql/connect/functions.py: ## @@ -388,7 +390,7 @@ def rand(seed: Optional[int] = None) -> Column: if seed is not None: return _invoke_function("rand", lit(seed)) else: -return _invoke_function("rand") +return _invoke_function("rand", lit(random.randint(0, sys.maxsize))) Review Comment: No, we don't. The sql Python functions just invoke the Scala implementations, which has been fixed. BTW, I covered sql Python functions with a test: https://github.com/apache/spark/pull/42997/files#diff-a4984706554eae562eb0c7a66d284b1855d9bc463b03474f73493c7694244b66R1356-R1366 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs
allisonwang-db commented on code in PR #42997: URL: https://github.com/apache/spark/pull/42997#discussion_r1331943027 ## python/pyspark/sql/connect/functions.py: ## @@ -388,7 +390,7 @@ def rand(seed: Optional[int] = None) -> Column: if seed is not None: return _invoke_function("rand", lit(seed)) else: -return _invoke_function("rand") +return _invoke_function("rand", lit(random.randint(0, sys.maxsize))) Review Comment: These are spark connect functions. Do we also need to update `python/pyspark/sql/functions.py`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db opened a new pull request, #43019: [SPARK-45219][PYTHON][DOCS] Refine docstring of withColumn(s)Renamed
allisonwang-db opened a new pull request, #43019: URL: https://github.com/apache/spark/pull/43019 ### What changes were proposed in this pull request? This PR refines the docstring of `DataFrame.withColumnRenamed` and `DataFrame.withColumnsRenamed`. ### Why are the changes needed? To improve PySpark documentations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331929198 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: >> I just think we shouldn't leave users this error message, when they don't enable the spark connect logger: We can wrap the outside exception in some SparkException saying that AddArtifacts failed. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #43011: [WIP][SPARK-45232][DOC] Add missing function groups to SQL references
allisonwang-db commented on code in PR #43011: URL: https://github.com/apache/spark/pull/43011#discussion_r1331925178 ## sql/gen-sql-functions-docs.py: ## @@ -34,6 +34,8 @@ "math_funcs", "conditional_funcs", "generator_funcs", "predicate_funcs", "string_funcs", "misc_funcs", "bitwise_funcs", "conversion_funcs", "csv_funcs", +"xml_funcs", "lambda_funcs", "collection_funcs", +"url_funcs", "hash_funcs", "struct_funcs", Review Comment: QQ: For generator_funcs, do we have documentation for them when used in the FROM clause of a query? Functions like explode are typically considered table-valued generator functions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #43014: [SPARK-45235][CONNECT][PYTHON] Support map and array parameters by `sql()`
allisonwang-db commented on code in PR #43014: URL: https://github.com/apache/spark/pull/43014#discussion_r1331920397 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -1237,13 +1237,23 @@ def test_sql(self): self.assertEqual(1, len(pdf.index)) def test_sql_with_named_args(self): -df = self.connect.sql("SELECT * FROM range(10) WHERE id > :minId", args={"minId": 7}) -df2 = self.spark.sql("SELECT * FROM range(10) WHERE id > :minId", args={"minId": 7}) +from pyspark.sql.functions import create_map, lit +from pyspark.sql.connect.functions import lit as clit +from pyspark.sql.connect.functions import create_map as ccreate_map Review Comment: @zhengruifeng just wondering should `SF` and `CF` be capitalized here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331900167 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: > Could you be more specific on what Exception can be thrown here? From your PR description, it looks like a GPRC error. Doesn't matter, any exception thrown here would be suppressed by grpc and only generic exception "Exception iterating requests" will be visible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
allisonwang-db commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331897585 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: What's the behavior after your change when the SPARK_CONNECT logger is enabled? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
allisonwang-db commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331899412 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: I just think we shouldn't leave users this error message, when they don't enable the spark connect logger: ``` status = StatusCode.UNKNOWN details = "Exception iterating requests!" debug_error_string = "None" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331901403 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: > What's the behavior after your change when the SPARK_CONNECT logger is enabled? described in pr ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: > What's the behavior after your change when the SPARK_CONNECT logger is enabled? described in pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331900167 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: > Could you be more specific on what Exception can be thrown here? From your PR description, it looks like a GPRC error. Doesn't matter, any exception thrown here would be suppressed by grpc and only generic exception "Fail to iterate response" will be visible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331901152 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: In my example I'm intercepting FileNotFound error -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
peter-toth commented on PR #42864: URL: https://github.com/apache/spark/pull/42864#issuecomment-1728048096 Thanks for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`
dongjoon-hyun closed pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB` URL: https://github.com/apache/spark/pull/43015 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #43018: [SPARK-45241][INFRA] Use Zulu JDK in java-other-versions pipeline and Java 21
dongjoon-hyun opened a new pull request, #43018: URL: https://github.com/apache/spark/pull/43018 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`
dongjoon-hyun commented on PR #43015: URL: https://github.com/apache/spark/pull/43015#issuecomment-1727988603 Merged to master/3.5/3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution
yaooqinn commented on code in PR #42199: URL: https://github.com/apache/spark/pull/42199#discussion_r1331832544 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -77,6 +79,11 @@ object SQLExecution { } val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong executionIdToQueryExecution.put(executionId, queryExecution) +val originalInterruptOnCancel = sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL) +if (originalInterruptOnCancel == null) { + val interruptOnCancel = sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL) + sc.setInterruptOnCancel(interruptOnCancel) Review Comment: STS call cancelJobGroup after query failed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
allisonwang-db commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331894962 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: Could you be more specific on what `Exception` can be thrown here? From your PR description, it looks like a GPRC error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
cloud-fan closed pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions URL: https://github.com/apache/spark/pull/42864 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331785918 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: I don't think we can do something like that here. We are not having server-side error response needed to process. We have purely client-side error of exception being thrown while grpc consumes iterator of requests. It's not possible to extract any information on outside of grpc being called since it suppresses error entirely. I think the only option to make proper exception classes is to preload all the artifacts into memory and construct corresponding requests before streaming them into grpc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on PR #42949: URL: https://github.com/apache/spark/pull/42949#issuecomment-1727917945 Updated fork's master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hdaikoku commented on a diff in pull request #42426: [SPARK-44756][CORE] Executor hangs when RetryingBlockTransferor fails to initiate retry
hdaikoku commented on code in PR #42426: URL: https://github.com/apache/spark/pull/42426#discussion_r1331742974 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java: ## @@ -274,7 +287,13 @@ private void handleBlockTransferFailure(String blockId, Throwable exception) { synchronized (RetryingBlockTransferor.this) { if (this == currentListener && outstandingBlocksIds.contains(blockId)) { if (shouldRetry(exception)) { -initiateRetry(exception); +try { + initiateRetry(exception); +} catch (Throwable t) { + logger.error("Exception while trying to initiate retry", t); Review Comment: Thank you for the suggestion, refactored accordingly: [32498ff](https://github.com/apache/spark/pull/42426/commits/32498ff9a906a21746b619f1ddd316777d188417) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
cloud-fan commented on PR #42864: URL: https://github.com/apache/spark/pull/42864#issuecomment-1727813318 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: The size of the ErrorInfo exceeds the header limit sometimes in the tests I added. So disable this by default. It should not be an issue after we migrate to use FetchErrorDetails RPC -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331243346 ## connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcExceptionConverter.scala: ## @@ -93,33 +179,65 @@ private[client] object GrpcExceptionConverter extends JsonUtils { new SparkArrayIndexOutOfBoundsException(message)), errorConstructor[DateTimeException]((message, _) => new SparkDateTimeException(message)), errorConstructor((message, cause) => new SparkRuntimeException(message, cause)), -errorConstructor((message, cause) => new SparkUpgradeException(message, cause))) - - private def errorInfoToThrowable(info: ErrorInfo, message: String): Option[Throwable] = { -val classes = - mapper.readValue(info.getMetadataOrDefault("classes", "[]"), classOf[Array[String]]) +errorConstructor((message, cause) => new SparkUpgradeException(message, cause)), +errorConstructor((message, cause) => new SparkException(message, cause.orNull))) + + /** + * errorsToThrowable reconstructs the exception based on a list of protobuf messages + * FetchErrorDetailsResponse.Error with un-truncated error messages and server-side stacktrace + * (if set). + */ + private def errorsToThrowable( + errorIdx: Int, + errors: Seq[FetchErrorDetailsResponse.Error]): Throwable = { + +val error = errors(errorIdx) + +val classHierarchy = error.getErrorTypeHierarchyList.asScala + +val constructor = + classHierarchy +.flatMap(errorFactory.get) +.headOption +.getOrElse((message: String, cause: Option[Throwable]) => + new SparkException( Review Comment: This is from https://github.com/apache/spark/pull/42377#discussion_r1329224205. But this one seems to be simpler -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong opened a new pull request, #43017: [SPARK-45239] Reduce default spark.connect.jvmStacktrace.maxSize
heyihong opened a new pull request, #43017: URL: https://github.com/apache/spark/pull/43017 ### What changes were proposed in this pull request? - Reduce default spark.connect.jvmStacktrace.maxSize ### Why are the changes needed? - `spark.sql.pyspark.jvmStacktrace.enabled` is partially broken (i.e. hitting the 8K header limit) and has to be disable in some tests ### Does this PR introduce _any_ user-facing change? - No ### How was this patch tested? - Existing tests ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331693297 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: The change is rebased already. I reenabled the pyspark jvm stacktrace flag for tests but disable this flag in a specific test case instead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs
cloud-fan commented on PR #42997: URL: https://github.com/apache/spark/pull/42997#issuecomment-1727800532 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #43005: [WIP][SPARK-44112][BUILD][INFRA] Drop Java 8 and 11 support
LuciferYang commented on code in PR #43005: URL: https://github.com/apache/spark/pull/43005#discussion_r1331245458 ## .github/workflows/build_coverage.yml: ## @@ -17,7 +17,7 @@ # under the License. # -name: "Build / Coverage (master, Scala 2.12, Hadoop 3, JDK 8)" +name: "Build / Coverage (master, Scala 2.12, Hadoop 17, JDK 8)" Review Comment: good catch ~ thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference
zhengruifeng commented on PR #43013: URL: https://github.com/apache/spark/pull/43013#issuecomment-1727396313 `/mllib` in scala, `pyspark.ml` and `pyspark.mllib` in python, don't work on connect. only new module `pyspark.ml.connect` works on connect. `pyspark.ml` contains many classification/clustering/etc, they are not supported in `pyspark.ml.connect` cc @WeichenXu123 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: The size of the ErrorInfo is about to exceed the header limit even in tests. So disable this by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #42971: [SPARK-43979][SQL][FOLLOWUP] Handle non alias-only project case
cloud-fan commented on PR #42971: URL: https://github.com/apache/spark/pull/42971#issuecomment-1727363616 The failed streaming test is unrelated, I'm merging it to master/3.5, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #43008: [SPARK-44113][BUILD][INFRA][DOCS] Drop support for Scala 2.12
LuciferYang commented on PR #43008: URL: https://github.com/apache/spark/pull/43008#issuecomment-1727210378 cc @dongjoon-hyun FYI Do we need to further split this PR ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #43007: [SPARK-45229][CORE][UI] Show the number of drivers waiting in SUBMITTED status in MasterPage
dongjoon-hyun closed pull request #43007: [SPARK-45229][CORE][UI] Show the number of drivers waiting in SUBMITTED status in MasterPage URL: https://github.com/apache/spark/pull/43007 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #43005: [WIP][SPARK-44112][BUILD][INFRA] Drop Java 8 and 11 support
LuciferYang commented on PR #43005: URL: https://github.com/apache/spark/pull/43005#issuecomment-1727201913 wait https://github.com/apache/spark/pull/43008 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution
cloud-fan commented on code in PR #42199: URL: https://github.com/apache/spark/pull/42199#discussion_r1331684745 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -77,6 +79,11 @@ object SQLExecution { } val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong executionIdToQueryExecution.put(executionId, queryExecution) +val originalInterruptOnCancel = sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL) +if (originalInterruptOnCancel == null) { + val interruptOnCancel = sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL) + sc.setInterruptOnCancel(interruptOnCancel) Review Comment: > does this PR just leverage a feature that SC already have? Yes, the problem is, now we enable this feature by default, while this feature can lead to memory leak. That's why I asked, how does thriftserver leverage this feature? After this PR, all queries will use this feature, not only thriftserver. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org