[GitHub] [spark] yaooqinn commented on pull request #40524: [SPARK-42898][SQL] Mark that string/date casts do not need time zone id
yaooqinn commented on PR #40524: URL: https://github.com/apache/spark/pull/40524#issuecomment-1643292939 cc @ulysses-you who is the last one to touch this part -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #41951: [SPARK-44367][SQL][UI] Show error message on UI for each failed query
yaooqinn commented on PR #41951: URL: https://github.com/apache/spark/pull/41951#issuecomment-1643284132 Thanks, merged to master and 3.5. PS, K8s IT failures are irrelevant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn closed pull request #41951: [SPARK-44367][SQL][UI] Show error message on UI for each failed query
yaooqinn closed pull request #41951: [SPARK-44367][SQL][UI] Show error message on UI for each failed query URL: https://github.com/apache/spark/pull/41951 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #42088: [SPARK-44491][INFRA] Add `branch-3.5` to `publish_snapshot` GitHub Action job
LuciferYang commented on PR #42088: URL: https://github.com/apache/spark/pull/42088#issuecomment-1643272217 cc @HyukjinKwon @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #42088: [SPARK-44491][INFRA] Add `branch-3.5` to `publish_snapshot` GitHub Action job
panbingkun opened a new pull request, #42088: URL: https://github.com/apache/spark/pull/42088 ### What changes were proposed in this pull request? This PR aims to add `branch-3.5` to `publish_snapshot` GitHub Action job. ### Why are the changes needed? Since GitHub Action Cron job is executed only at `master` branch, we need to register newly added branch at `master` branch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A, Manual check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mathewjacob1002 opened a new pull request, #42087: [SPARK-44264] Added Example to Deepspeed Distributor
mathewjacob1002 opened a new pull request, #42087: URL: https://github.com/apache/spark/pull/42087 ### What changes were proposed in this pull request? Added examples to the docstring of using DeepspeedTorchDistributor ### Why are the changes needed? More concrete examples, allowing for a better understanding of feature. ### Does this PR introduce _any_ user-facing change? Yes, docs changes. ### How was this patch tested? make html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #42040: [WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected `AnalysisException` from Spark Connect client
zhengruifeng closed pull request #42040: [WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected `AnalysisException` from Spark Connect client URL: https://github.com/apache/spark/pull/42040 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42040: [WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected `AnalysisException` from Spark Connect client
zhengruifeng commented on PR #42040: URL: https://github.com/apache/spark/pull/42040#issuecomment-1643204561 close this one in favor of https://github.com/apache/spark/pull/42086 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #41932: [SPARK-44131][SQL][PYTHON][CONNECT][FOLLOWUP] Support qualified function name for call_function
beliefer commented on PR #41932: URL: https://github.com/apache/spark/pull/41932#issuecomment-1643189573 The CI failure is unrelated to this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42086: [SPARK-43611][SQL][PS][CONNCECT] Make `ExtractWindowExpressions` retain the `PLAN_ID_TAG `
zhengruifeng commented on PR #42086: URL: https://github.com/apache/spark/pull/42086#issuecomment-1643187065 I have checked with @cloud-fan that we might have to modify the rules one by one. The good news is that 90% UTs can be resolved by this single one, and I think we only need to touch another 3~4 rules. @HyukjinKwon @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #42086: [SPARK-43611][SQL][PS][CONNCECT] Make `ExtractWindowExpressions` retain the `PLAN_ID_TAG `
zhengruifeng opened a new pull request, #42086: URL: https://github.com/apache/spark/pull/42086 ### What changes were proposed in this pull request? Make `ExtractWindowExpressions` retain the `PLAN_ID_TAG ` ### Why are the changes needed? In https://github.com/apache/spark/pull/39925, we introduced a new mechanism to resolve expression with specified plan. However, sometimes the plan ID might be eliminated by the analyzer, and then some expressions can not be correctly resolved, this issue is the main blocker of PS on Connect. ### Does this PR introduce _any_ user-facing change? yes, a lot of Pandas APIs enabled ### How was this patch tested? Enable 113 out of all the 122 related UTs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cxzl25 opened a new pull request, #42085: [SPARK-44490][WEBUI] Remove `TaskPagedTable` in StagePage
cxzl25 opened a new pull request, #42085: URL: https://github.com/apache/spark/pull/42085 ### What changes were proposed in this pull request? Remove `TaskPagedTable` ### Why are the changes needed? In [SPARK-21809](https://issues.apache.org/jira/browse/SPARK-21809), we introduced `stagespage-template.html` to show the running status of Stage. `TaskPagedTable` is no longer effective, but there are still many PRs updating related codes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? local test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #41948: [SPARK-44380][SQL][PYTHON] Support for Python UDTF to analyze in Python
HyukjinKwon closed pull request #41948: [SPARK-44380][SQL][PYTHON] Support for Python UDTF to analyze in Python URL: https://github.com/apache/spark/pull/41948 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #41948: [SPARK-44380][SQL][PYTHON] Support for Python UDTF to analyze in Python
HyukjinKwon commented on PR #41948: URL: https://github.com/apache/spark/pull/41948#issuecomment-1643063875 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer opened a new pull request, #42084: [SPARK-44292][SQL][FOLLOWUP] Make TYPE_CHECK_FAILURE_WITH_HINT use correct name
beliefer opened a new pull request, #42084: URL: https://github.com/apache/spark/pull/42084 ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/41850 uses `TYPE_CHECK_FAILURE_WITH_HINT`, it should be `DATATYPE_MISMATCH.TYPE_CHECK_FAILURE_WITH_HINT`. ### Why are the changes needed? Fix a bug. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #42007: [SPARK-44431][SQL] Fix behavior of null IN (empty list) in optimization rules
cloud-fan closed pull request #42007: [SPARK-44431][SQL] Fix behavior of null IN (empty list) in optimization rules URL: https://github.com/apache/spark/pull/42007 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #42007: [SPARK-44431][SQL] Fix behavior of null IN (empty list) in optimization rules
cloud-fan commented on PR #42007: URL: https://github.com/apache/spark/pull/42007#issuecomment-1643037992 thanks, merging to master/3.5! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] richardc-db opened a new pull request, #42083: Support deserializing long types when creating `Metadata` object from JObject
richardc-db opened a new pull request, #42083: URL: https://github.com/apache/spark/pull/42083 ### What changes were proposed in this pull request? Adds support to deserialize long types when creating `Metadata` objects from `JObject`s. ### Why are the changes needed? Code will previously crash when adding a `long` type to the `Metadata` object. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42037: [SPARK-44305][SQL] Dynamically choose whether to broadcast hadoop conf
HyukjinKwon commented on code in PR #42037: URL: https://github.com/apache/spark/pull/42037#discussion_r1268879200 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala: ## @@ -152,15 +153,25 @@ class OrcFileFormat assert(supportBatch(sparkSession, resultSchema)) } -OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(hadoopConf, sqlConf.caseSensitiveAnalysis) - -val broadcastedConf = - sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) val isCaseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis +OrcConf.IS_SCHEMA_EVOLUTION_CASE_SENSITIVE.setBoolean(hadoopConf, isCaseSensitive) +val broadcastedHadoopConf = if (options.isEmpty) { + Option.empty +} else { + Option(sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))) +} + val orcFilterPushDown = sparkSession.sessionState.conf.orcFilterPushDown +val sparkConf = sparkSession.sparkContext.conf (file: PartitionedFile) => { - val conf = broadcastedConf.value.value + val conf = if (broadcastedHadoopConf.isDefined) { +broadcastedHadoopConf.get.value.value Review Comment: Oh, I see. you're passing the original one here. Okay I am good with this approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #41349: [SPARK-43839][SQL] Convert `_LEGACY_ERROR_TEMP_1337` to `UNSUPPORTED_FEATURE.TIME_TRAVEL`
panbingkun commented on code in PR #41349: URL: https://github.com/apache/spark/pull/41349#discussion_r1268872661 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala: ## @@ -89,10 +89,12 @@ class V2SessionCatalog(catalog: SessionCatalog) throw QueryCompilationErrors.timeTravelUnsupportedError( Review Comment: @cloud-fan Follow Up pr: https://github.com/apache/spark/pull/42082 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #42082: [SPARK-43839][SQL][FOLLOWUP] Convert _LEGACY_ERROR_TEMP_1337 to UNSUPPORTED_FEATURE.TIME_TRAVEL
panbingkun commented on PR #42082: URL: https://github.com/apache/spark/pull/42082#issuecomment-1643026104 cc @cloud-fan @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #42082: [SPARK-43839][SQL][FOLLOWUP] Convert _LEGACY_ERROR_TEMP_1337 to UNSUPPORTED_FEATURE.TIME_TRAVEL
panbingkun commented on code in PR #42082: URL: https://github.com/apache/spark/pull/42082#discussion_r1268872155 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala: ## @@ -88,19 +88,11 @@ class V2SessionCatalog(catalog: SessionCatalog) } private def failTimeTravel(ident: Identifier, t: Table): Table = { -t match { - case V1Table(catalogTable) => -if (catalogTable.tableType == CatalogTableType.VIEW) { - throw QueryCompilationErrors.timeTravelUnsupportedError( -toSQLId(catalogTable.identifier.nameParts)) -} else { - throw QueryCompilationErrors.timeTravelUnsupportedError( -toSQLId(catalogTable.identifier.nameParts)) -} - - case _ => throw QueryCompilationErrors.timeTravelUnsupportedError( -toSQLId(ident.asTableIdentifier.nameParts)) +val nameParts = t match { + case V1Table(catalogTable) => catalogTable.identifier.nameParts + case _ => ident.asTableIdentifier.nameParts Review Comment: If the code here is simplified as follows: `val nameParts = ident.asTableIdentifier.nameParts` in some scenarios, `relationId` will not attach catalog name in the error message, such as: https://github.com/apache/spark/blob/e0c79c637c16df92aa3c6fb14f151132ea26e92f/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala#L979-L990 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #42082: [SPARK-43839][SQL][FOLLOWUP] Convert _LEGACY_ERROR_TEMP_1337 to UNSUPPORTED_FEATURE.TIME_TRAVEL
panbingkun opened a new pull request, #42082: URL: https://github.com/apache/spark/pull/42082 ### What changes were proposed in this pull request? - The pr is following up https://github.com/apache/spark/pull/41349. - The pr aims to simplify code logic after merge `_LEGACY_ERROR_TEMP_1337` to `UNSUPPORTED_FEATURE.TIME_TRAVEL`. ### Why are the changes needed? Simplify code logic. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #42081: [SPARK-44487][TEST] Fix KubernetesSuite report NPE when not set spark.kubernetes.test.unpackSparkDir
Hisoka-X commented on PR #42081: URL: https://github.com/apache/spark/pull/42081#issuecomment-1643018942 > Is this related to the GA failure of the master? No, just bug I fonud when debug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #42081: [SPARK-44487][TEST] Fix KubernetesSuite report NPE when not set spark.kubernetes.test.unpackSparkDir
LuciferYang commented on PR #42081: URL: https://github.com/apache/spark/pull/42081#issuecomment-1643016092 Is this related to the GA failure of the master? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #41850: [SPARK-44292][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2315-2319]
beliefer commented on code in PR #41850: URL: https://github.com/apache/spark/pull/41850#discussion_r1268862800 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -277,13 +277,13 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB e.dataTypeMismatch(e, checkRes) case TypeCheckResult.TypeCheckFailure(message) => e.setTagValue(DATA_TYPE_MISMATCH_ERROR, true) -extraHintForAnsiTypeCoercionExpression(operator) +val extraHint = extraHintForAnsiTypeCoercionExpression(operator) e.failAnalysis( - errorClass = "_LEGACY_ERROR_TEMP_2315", + errorClass = "TYPE_CHECK_FAILURE_WITH_HINT", Review Comment: OK. let me see it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #41347: [SPARK-43838][SQL] Fix subquery on single table with having clause can't be optimized
Hisoka-X commented on PR #41347: URL: https://github.com/apache/spark/pull/41347#issuecomment-1643006858 Thanks @cloud-fan for your help and @allisonwang-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #41347: [SPARK-43838][SQL] Fix subquery on single table with having clause can't be optimized
cloud-fan closed pull request #41347: [SPARK-43838][SQL] Fix subquery on single table with having clause can't be optimized URL: https://github.com/apache/spark/pull/41347 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #41347: [SPARK-43838][SQL] Fix subquery on single table with having clause can't be optimized
cloud-fan commented on PR #41347: URL: https://github.com/apache/spark/pull/41347#issuecomment-1643004942 The k8s failure is unrelated, I'm merging it to master, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #41349: [SPARK-43839][SQL] Convert `_LEGACY_ERROR_TEMP_1337` to `UNSUPPORTED_FEATURE.TIME_TRAVEL`
panbingkun commented on code in PR #41349: URL: https://github.com/apache/spark/pull/41349#discussion_r1268852208 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala: ## @@ -89,10 +89,12 @@ class V2SessionCatalog(catalog: SessionCatalog) throw QueryCompilationErrors.timeTravelUnsupportedError( Review Comment: Yeah, after merging, the code here can be simplified. It's my bad, let me do it immediately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #42081: [SPARK-44487][TEST] Fix KubernetesSuite report NPE when not set spark.kubernetes.test.unpackSparkDir
Hisoka-X commented on PR #42081: URL: https://github.com/apache/spark/pull/42081#issuecomment-1642997959 cc @Yikun @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hisoka-X opened a new pull request, #42081: [SPARK-44487][TEST] Fix KubernetesSuite report NPE when not set spark.kubernetes.test.unpackSparkDir
Hisoka-X opened a new pull request, #42081: URL: https://github.com/apache/spark/pull/42081 ### What changes were proposed in this pull request? Fix KubernetesSuite report NPE when not set `spark.kubernetes.test.unpackSparkDir` ```java Exception encountered when invoking run on a nested suite. java.lang.NullPointerException at sun.nio.fs.UnixPath.normalizeAndCheck(UnixPath.java:77) at sun.nio.fs.UnixPath.(UnixPath.java:71) at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281) at java.nio.file.Paths.get(Paths.java:84) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.$anonfun$beforeAll$4(KubernetesSuite.scala:164) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.$anonfun$beforeAll$4$adapted(KubernetesSuite.scala:163) at scala.collection.LinearSeqOptimized.find(LinearSeqOptimized.scala:115) at scala.collection.LinearSeqOptimized.find$(LinearSeqOptimized.scala:112) at scala.collection.immutable.List.find(List.scala:91) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:163) ``` ### Why are the changes needed? `spark.kubernetes.test.unpackSparkDir` is an option, not necessary config. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test. Tested in local -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #41349: [SPARK-43839][SQL] Convert `_LEGACY_ERROR_TEMP_1337` to `UNSUPPORTED_FEATURE.TIME_TRAVEL`
panbingkun commented on code in PR #41349: URL: https://github.com/apache/spark/pull/41349#discussion_r1268851681 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala: ## @@ -89,10 +89,12 @@ class V2SessionCatalog(catalog: SessionCatalog) throw QueryCompilationErrors.timeTravelUnsupportedError( toSQLId(catalogTable.identifier.nameParts)) } else { - throw QueryCompilationErrors.tableNotSupportTimeTravelError(ident) Review Comment: Its corresponding error class '_LEGACY_ERROR_TEMP_1337' has been merged with `UNSUPPORTED_FEATURE.TIME_TRAVEL`, so it has been removed here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #41711: [SPARK-44155] Adding a dev utility to improve error messages based on LLM
itholic commented on code in PR #41711: URL: https://github.com/apache/spark/pull/41711#discussion_r1268843077 ## dev/error_message_refiner.py: ## @@ -0,0 +1,265 @@ +#!/usr/bin/env python3 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +Utility for refining error messages based on LLM. + +Usage: +python error_message_refiner.py [--model_name=] + +Arguments: + Required. +The name of the error class to refine the messages for. +The list of error classes is located in + `common/utils/src/main/resources/error/error-classes.json`. + +Options: +--model_name= Optional. +The version of Chat GPT to use for refining the error messages. +If not provided, the default version("gpt-3.5-turbo") will be used. + +Example usage: +python error_message_refiner.py CANNOT_DECODE_URL --model_name=gpt-4 + +Description: +This script refines error messages using the LLM based approach. +It takes the name of the error class as a required argument and, optionally, +allows specifying the version of Chat GPT to use for refining the messages. + +Options: +--model_name: Specifies the version of Chat GPT. + If not provided, the default version("gpt-3.5-turbo") will be used. + +Note: +- Ensure that the necessary dependencies are installed before running the script. +- Ensure that the valid API key is entered in the `api-key.txt`. +- The refined error messages will be displayed in the console output. +- To use the gpt-4 model, you need to join the waitlist. Please refer to + https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4 for more details. +""" + +import argparse +import json +import openai +import re +import subprocess +import random +import os +from typing import Tuple, Optional + +from sparktestsupport import SPARK_HOME + +PATH_TO_ERROR_CLASS = f"{SPARK_HOME}/common/utils/src/main/resources/error/error-classes.json" +# Register your own open API key for the environment variable. +# The API key can be obtained from https://platform.openai.com/account/api-keys. +OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY") + +openai.api_key = OPENAI_API_KEY + + +def _git_grep_files(search_string: str, exclude: str = None) -> str: +""" +Executes 'git grep' command to search for files containing the given search string. +Returns the file path where the search string is found. +""" +result = subprocess.run( +["git", "grep", "-l", search_string, "--", f"{SPARK_HOME}/*.scala"], +capture_output=True, +text=True, +) +output = result.stdout.strip() + +files = output.split("\n") +files = [file for file in files if "Suite" not in file] +if exclude is not None: +files = [file for file in files if exclude not in file] +file = random.choice(files) +return file + + +def _find_function(file_name: str, search_string: str) -> Optional[str]: +""" +Searches for a function in the given file containing the specified search string. +Returns the name of the function if found, otherwise None. +""" +with open(file_name, "r") as file: +content = file.read() +functions = re.findall(r"def\s+(\w+)\s*\(", content) + +for function in functions: +function_content = re.search( + rf"def\s+{re.escape(function)}(?:(?!def).)*?{re.escape(search_string)}", +content, +re.DOTALL, +) +if function_content and search_string in function_content.group(0): +return function + +return None + + +def _find_func_body(file_name: str, search_string: str) -> Optional[str]: +""" +Searches for a function body in the given file containing the specified search string. +Returns the function body if found, otherwise None. +""" +with open(file_name, "r") as file: +content = file.read() +functions = re.findall(r"def\s+(\w+)\s*\(", content) + +for function in functions: +function_content = re.search( +
[GitHub] [spark] panbingkun commented on a diff in pull request #41349: [SPARK-43839][SQL] Convert `_LEGACY_ERROR_TEMP_1337` to `UNSUPPORTED_FEATURE.TIME_TRAVEL`
panbingkun commented on code in PR #41349: URL: https://github.com/apache/spark/pull/41349#discussion_r1268843065 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala: ## @@ -89,10 +89,12 @@ class V2SessionCatalog(catalog: SessionCatalog) throw QueryCompilationErrors.timeTravelUnsupportedError( toSQLId(catalogTable.identifier.nameParts)) } else { - throw QueryCompilationErrors.tableNotSupportTimeTravelError(ident) Review Comment: Let me investigate it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo commented on a diff in pull request #42009: [SPARK-44422][CONNECT] Spark Connect fine grained interrupt
zhenlineo commented on code in PR #42009: URL: https://github.com/apache/spark/pull/42009#discussion_r1268824443 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SparkSessionE2ESuite.scala: ## @@ -96,5 +103,151 @@ class SparkSessionE2ESuite extends RemoteSparkSession { assert(e2.getMessage.contains("OPERATION_CANCELED"), s"Unexpected exception: $e2") finished = true assert(ThreadUtils.awaitResult(interruptor, 10.seconds)) +assert(interrupted.distinct.length == 2, s"Interrupted operations: ${interrupted.distinct}.") + } + + test("interrupt tag") { +val session = spark +import session.implicits._ + +// global ExecutionContext has only 2 threads in Apache Spark CI +// create own thread pool for four Futures used in this test +val numThreads = 4 +val fpool = ThreadUtils.newForkJoinPool("job-tags-test-thread-pool", numThreads) +val executionContext = ExecutionContext.fromExecutorService(fpool) + +val q1 = Future { + assert(spark.getTags() == Set()) + spark.addTag("two") + assert(spark.getTags() == Set("two")) + spark.clearTags() // check that clearing all tags works + assert(spark.getTags() == Set()) + spark.addTag("one") + assert(spark.getTags() == Set("one")) + try { +spark + .range(10) + .map(n => { +Thread.sleep(3); n + }) + .collect() + } finally { +spark.clearTags() // clear for the case of thread reuse by another Future + } +}(executionContext) +val q2 = Future { + assert(spark.getTags() == Set()) + spark.addTag("one") + spark.addTag("two") + spark.addTag("one") + spark.addTag("two") // duplicates shouldn't matter + try { +spark + .range(10) + .map(n => { +Thread.sleep(3); n + }) + .collect() + } finally { +spark.clearTags() // clear for the case of thread reuse by another Future + } +}(executionContext) +val q3 = Future { + assert(spark.getTags() == Set()) + spark.addTag("foo") + spark.removeTag("foo") + assert(spark.getTags() == Set()) // check that remove works removing the last tag + spark.addTag("two") + assert(spark.getTags() == Set("two")) + try { +spark + .range(10) + .map(n => { +Thread.sleep(3); n + }) + .collect() + } finally { +spark.clearTags() // clear for the case of thread reuse by another Future + } +}(executionContext) +val q4 = Future { + assert(spark.getTags() == Set()) + spark.addTag("one") + spark.addTag("two") + spark.addTag("two") + assert(spark.getTags() == Set("one", "two")) + spark.removeTag("two") // check that remove works, despite duplicate add + assert(spark.getTags() == Set("one")) + try { +spark + .range(10) + .map(n => { +Thread.sleep(3); n + }) + .collect() + } finally { +spark.clearTags() // clear for the case of thread reuse by another Future + } +}(executionContext) +val interrupted = mutable.ListBuffer[String]() + +// q2 and q3 should be cancelled +interrupted.clear() +eventually(timeout(20.seconds), interval(1.seconds)) { + val ids = spark.interruptTag("two") + interrupted ++= ids + assert( +interrupted.distinct.length == 2, +s"Interrupted operations: ${interrupted.distinct}.") +} +val e2 = intercept[SparkException] { + ThreadUtils.awaitResult(q2, 1.minute) +} +assert(e2.getCause.getMessage contains "OPERATION_CANCELED") +val e3 = intercept[SparkException] { + ThreadUtils.awaitResult(q3, 1.minute) +} +assert(e3.getCause.getMessage contains "OPERATION_CANCELED") +assert(interrupted.distinct.length == 2, s"Interrupted operations: ${interrupted.distinct}.") + +// q1 and q4 should be cancelled +interrupted.clear() +eventually(timeout(20.seconds), interval(1.seconds)) { + val ids = spark.interruptTag("one") + interrupted ++= ids + assert( +interrupted.distinct.length == 2, +s"Interrupted operations: ${interrupted.distinct}.") +} +val e1 = intercept[SparkException] { + ThreadUtils.awaitResult(q1, 1.minute) +} +assert(e1.getCause.getMessage contains "OPERATION_CANCELED") +val e4 = intercept[SparkException] { + ThreadUtils.awaitResult(q4, 1.minute) +} +assert(e4.getCause.getMessage contains "OPERATION_CANCELED") +assert(interrupted.distinct.length == 2, s"Interrupted operations: ${interrupted.distinct}.") + } + + test("interrupt operation") { +val session = spark +import session.implicits._ + +val result = spark + .range(10) + .map(n => { +Thread.sleep(5000); n + }) + .collectResult()
[GitHub] [spark] ericm-db opened a new pull request, #42080: Statestoresuite threadpool
ericm-db opened a new pull request, #42080: URL: https://github.com/apache/spark/pull/42080 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on pull request #42075: [SPARK-43966][SQL][PYTHON] Support non-deterministic table-valued functions
allisonwang-db commented on PR #42075: URL: https://github.com/apache/spark/pull/42075#issuecomment-1642927436 cc @HyukjinKwon @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #40608: [SPARK-35198][CONNECT][CORE][PYTHON][SQL] Add support for calling debugCodegen from Python & Java
github-actions[bot] commented on PR #40608: URL: https://github.com/apache/spark/pull/40608#issuecomment-1642927305 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #40728: [WIP][SPARK-39634][SQL] Allow file splitting in combination with row index generation
github-actions[bot] commented on PR #40728: URL: https://github.com/apache/spark/pull/40728#issuecomment-1642927268 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng opened a new pull request, #42079: [WIP][SPARK-44486][PYTHON][CONNECT] Implement PyArrow `self_destruct` feature for `toPandas`
xinrong-meng opened a new pull request, #42079: URL: https://github.com/apache/spark/pull/42079 ### What changes were proposed in this pull request? Implement Arrow `self_destruct` of `toPandas` for memory savings. Now the Spark configuration `spark.sql.execution.arrow.pyspark.selfDestruct.enabled` can be used to enable PyArrow’s `self_destruct` feature in Spark Connect, which can save memory when creating a Pandas DataFrame via `toPandas` by freeing Arrow-allocated memory while building the Pandas DataFrame. ### Why are the changes needed? Reach parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? TBD -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #42072: [SPARK-44481][CONNECT][PYTHON] Make pyspark.sql.is_remote an API
HyukjinKwon closed pull request #42072: [SPARK-44481][CONNECT][PYTHON] Make pyspark.sql.is_remote an API URL: https://github.com/apache/spark/pull/42072 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42072: [SPARK-44481][CONNECT][PYTHON] Make pyspark.sql.is_remote an API
HyukjinKwon commented on PR #42072: URL: https://github.com/apache/spark/pull/42072#issuecomment-1642913899 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #41831: [SPARK-44278][CONNECT] Implement a GRPC server interceptor that cleans up thread local properties
HyukjinKwon closed pull request #41831: [SPARK-44278][CONNECT] Implement a GRPC server interceptor that cleans up thread local properties URL: https://github.com/apache/spark/pull/41831 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #41831: [SPARK-44278][CONNECT] Implement a GRPC server interceptor that cleans up thread local properties
HyukjinKwon commented on PR #41831: URL: https://github.com/apache/spark/pull/41831#issuecomment-1642910764 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42072: [SPARK-44481][CONNECT][PYTHON] Make pyspark.sql.is_remote an API
HyukjinKwon commented on code in PR #42072: URL: https://github.com/apache/spark/pull/42072#discussion_r1268791343 ## python/pyspark/sql/__init__.py: ## @@ -72,4 +73,5 @@ "DataFrameWriter", "DataFrameWriterV2", "PandasCogroupedOps", +"is_remote", Review Comment: I think we should just expose it under `pyspark.sql` because APIs are usually exposed under that namespace. I think you can use it as: ```python is_remote = getattr(pyspark.sql, "is_remote", getattr(pyspark.sql.utils, "is_remote", None))() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] juliuszsompolski commented on a diff in pull request #42009: [SPARK-44422][CONNECT] Spark Connect fine grained interrupt
juliuszsompolski commented on code in PR #42009: URL: https://github.com/apache/spark/pull/42009#discussion_r1268751980 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala: ## @@ -40,6 +40,7 @@ private[sql] class SparkResult[T]( extends AutoCloseable with Cleanable { self => + private[this] var opId: String = null Review Comment: Yeah, but I adapted to the surrounding style of ``` private[this] var opId: String = null private[this] var numRecords: Int = 0 private[this] var structType: StructType = _ private[this] var arrowSchema: pojo.Schema = _ private[this] var nextResultIndex: Int = 0 ``` From ``` def schema: StructType = { if (structType == null) { processResponses(stopOnSchema = true) } structType } ``` it looks like I can assume `_` is null, so I could make it `_` and maybe that's more idiomatic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] asl3 commented on a diff in pull request #41711: [SPARK-44155] Adding a dev utility to improve error messages based on LLM
asl3 commented on code in PR #41711: URL: https://github.com/apache/spark/pull/41711#discussion_r1268722533 ## dev/error_message_refiner.py: ## @@ -0,0 +1,265 @@ +#!/usr/bin/env python3 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +Utility for refining error messages based on LLM. + +Usage: +python error_message_refiner.py [--model_name=] + +Arguments: + Required. +The name of the error class to refine the messages for. +The list of error classes is located in + `common/utils/src/main/resources/error/error-classes.json`. + +Options: +--model_name= Optional. +The version of Chat GPT to use for refining the error messages. +If not provided, the default version("gpt-3.5-turbo") will be used. + +Example usage: +python error_message_refiner.py CANNOT_DECODE_URL --model_name=gpt-4 + +Description: +This script refines error messages using the LLM based approach. +It takes the name of the error class as a required argument and, optionally, +allows specifying the version of Chat GPT to use for refining the messages. + +Options: +--model_name: Specifies the version of Chat GPT. + If not provided, the default version("gpt-3.5-turbo") will be used. + +Note: +- Ensure that the necessary dependencies are installed before running the script. +- Ensure that the valid API key is entered in the `api-key.txt`. +- The refined error messages will be displayed in the console output. +- To use the gpt-4 model, you need to join the waitlist. Please refer to + https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4 for more details. +""" + +import argparse +import json +import openai +import re +import subprocess +import random +import os +from typing import Tuple, Optional + +from sparktestsupport import SPARK_HOME + +PATH_TO_ERROR_CLASS = f"{SPARK_HOME}/common/utils/src/main/resources/error/error-classes.json" +# Register your own open API key for the environment variable. +# The API key can be obtained from https://platform.openai.com/account/api-keys. +OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY") + +openai.api_key = OPENAI_API_KEY + + +def _git_grep_files(search_string: str, exclude: str = None) -> str: +""" +Executes 'git grep' command to search for files containing the given search string. +Returns the file path where the search string is found. +""" +result = subprocess.run( +["git", "grep", "-l", search_string, "--", f"{SPARK_HOME}/*.scala"], +capture_output=True, +text=True, +) +output = result.stdout.strip() + +files = output.split("\n") +files = [file for file in files if "Suite" not in file] +if exclude is not None: +files = [file for file in files if exclude not in file] +file = random.choice(files) +return file + + +def _find_function(file_name: str, search_string: str) -> Optional[str]: +""" +Searches for a function in the given file containing the specified search string. +Returns the name of the function if found, otherwise None. +""" +with open(file_name, "r") as file: +content = file.read() +functions = re.findall(r"def\s+(\w+)\s*\(", content) + +for function in functions: +function_content = re.search( + rf"def\s+{re.escape(function)}(?:(?!def).)*?{re.escape(search_string)}", +content, +re.DOTALL, +) +if function_content and search_string in function_content.group(0): +return function + +return None + + +def _find_func_body(file_name: str, search_string: str) -> Optional[str]: +""" +Searches for a function body in the given file containing the specified search string. +Returns the function body if found, otherwise None. +""" +with open(file_name, "r") as file: +content = file.read() +functions = re.findall(r"def\s+(\w+)\s*\(", content) + +for function in functions: +function_content = re.search( +
[GitHub] [spark] asl3 commented on a diff in pull request #41711: [SPARK-44155] Adding a dev utility to improve error messages based on LLM
asl3 commented on code in PR #41711: URL: https://github.com/apache/spark/pull/41711#discussion_r1268722533 ## dev/error_message_refiner.py: ## @@ -0,0 +1,265 @@ +#!/usr/bin/env python3 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +Utility for refining error messages based on LLM. + +Usage: +python error_message_refiner.py [--model_name=] + +Arguments: + Required. +The name of the error class to refine the messages for. +The list of error classes is located in + `common/utils/src/main/resources/error/error-classes.json`. + +Options: +--model_name= Optional. +The version of Chat GPT to use for refining the error messages. +If not provided, the default version("gpt-3.5-turbo") will be used. + +Example usage: +python error_message_refiner.py CANNOT_DECODE_URL --model_name=gpt-4 + +Description: +This script refines error messages using the LLM based approach. +It takes the name of the error class as a required argument and, optionally, +allows specifying the version of Chat GPT to use for refining the messages. + +Options: +--model_name: Specifies the version of Chat GPT. + If not provided, the default version("gpt-3.5-turbo") will be used. + +Note: +- Ensure that the necessary dependencies are installed before running the script. +- Ensure that the valid API key is entered in the `api-key.txt`. +- The refined error messages will be displayed in the console output. +- To use the gpt-4 model, you need to join the waitlist. Please refer to + https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4 for more details. +""" + +import argparse +import json +import openai +import re +import subprocess +import random +import os +from typing import Tuple, Optional + +from sparktestsupport import SPARK_HOME + +PATH_TO_ERROR_CLASS = f"{SPARK_HOME}/common/utils/src/main/resources/error/error-classes.json" +# Register your own open API key for the environment variable. +# The API key can be obtained from https://platform.openai.com/account/api-keys. +OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY") + +openai.api_key = OPENAI_API_KEY + + +def _git_grep_files(search_string: str, exclude: str = None) -> str: +""" +Executes 'git grep' command to search for files containing the given search string. +Returns the file path where the search string is found. +""" +result = subprocess.run( +["git", "grep", "-l", search_string, "--", f"{SPARK_HOME}/*.scala"], +capture_output=True, +text=True, +) +output = result.stdout.strip() + +files = output.split("\n") +files = [file for file in files if "Suite" not in file] +if exclude is not None: +files = [file for file in files if exclude not in file] +file = random.choice(files) +return file + + +def _find_function(file_name: str, search_string: str) -> Optional[str]: +""" +Searches for a function in the given file containing the specified search string. +Returns the name of the function if found, otherwise None. +""" +with open(file_name, "r") as file: +content = file.read() +functions = re.findall(r"def\s+(\w+)\s*\(", content) + +for function in functions: +function_content = re.search( + rf"def\s+{re.escape(function)}(?:(?!def).)*?{re.escape(search_string)}", +content, +re.DOTALL, +) +if function_content and search_string in function_content.group(0): +return function + +return None + + +def _find_func_body(file_name: str, search_string: str) -> Optional[str]: +""" +Searches for a function body in the given file containing the specified search string. +Returns the function body if found, otherwise None. +""" +with open(file_name, "r") as file: +content = file.read() +functions = re.findall(r"def\s+(\w+)\s*\(", content) + +for function in functions: +function_content = re.search( +
[GitHub] [spark] asl3 commented on a diff in pull request #41711: [SPARK-44155] Adding a dev utility to improve error messages based on LLM
asl3 commented on code in PR #41711: URL: https://github.com/apache/spark/pull/41711#discussion_r1268719990 ## dev/error_message_refiner.py: ## @@ -0,0 +1,265 @@ +#!/usr/bin/env python3 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +Utility for refining error messages based on LLM. + +Usage: +python error_message_refiner.py [--model_name=] + +Arguments: + Required. +The name of the error class to refine the messages for. +The list of error classes is located in + `common/utils/src/main/resources/error/error-classes.json`. + +Options: +--model_name= Optional. +The version of Chat GPT to use for refining the error messages. +If not provided, the default version("gpt-3.5-turbo") will be used. + +Example usage: +python error_message_refiner.py CANNOT_DECODE_URL --model_name=gpt-4 + +Description: +This script refines error messages using the LLM based approach. +It takes the name of the error class as a required argument and, optionally, +allows specifying the version of Chat GPT to use for refining the messages. + +Options: +--model_name: Specifies the version of Chat GPT. + If not provided, the default version("gpt-3.5-turbo") will be used. + +Note: +- Ensure that the necessary dependencies are installed before running the script. +- Ensure that the valid API key is entered in the `api-key.txt`. +- The refined error messages will be displayed in the console output. +- To use the gpt-4 model, you need to join the waitlist. Please refer to + https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4 for more details. +""" + +import argparse +import json +import openai +import re +import subprocess +import random +import os +from typing import Tuple, Optional + +from sparktestsupport import SPARK_HOME + +PATH_TO_ERROR_CLASS = f"{SPARK_HOME}/common/utils/src/main/resources/error/error-classes.json" +# Register your own open API key for the environment variable. +# The API key can be obtained from https://platform.openai.com/account/api-keys. +OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY") + +openai.api_key = OPENAI_API_KEY + + +def _git_grep_files(search_string: str, exclude: str = None) -> str: +""" +Executes 'git grep' command to search for files containing the given search string. +Returns the file path where the search string is found. +""" +result = subprocess.run( +["git", "grep", "-l", search_string, "--", f"{SPARK_HOME}/*.scala"], +capture_output=True, +text=True, +) +output = result.stdout.strip() + +files = output.split("\n") +files = [file for file in files if "Suite" not in file] +if exclude is not None: +files = [file for file in files if exclude not in file] +file = random.choice(files) +return file + + +def _find_function(file_name: str, search_string: str) -> Optional[str]: +""" +Searches for a function in the given file containing the specified search string. +Returns the name of the function if found, otherwise None. +""" +with open(file_name, "r") as file: +content = file.read() +functions = re.findall(r"def\s+(\w+)\s*\(", content) + +for function in functions: +function_content = re.search( + rf"def\s+{re.escape(function)}(?:(?!def).)*?{re.escape(search_string)}", +content, +re.DOTALL, +) +if function_content and search_string in function_content.group(0): +return function + +return None + + +def _find_func_body(file_name: str, search_string: str) -> Optional[str]: +""" +Searches for a function body in the given file containing the specified search string. +Returns the function body if found, otherwise None. +""" +with open(file_name, "r") as file: +content = file.read() +functions = re.findall(r"def\s+(\w+)\s*\(", content) + +for function in functions: +function_content = re.search( +
[GitHub] [spark] grundprinzip commented on a diff in pull request #42009: [SPARK-44422][CONNECT] Spark Connect fine grained interrupt
grundprinzip commented on code in PR #42009: URL: https://github.com/apache/spark/pull/42009#discussion_r1268715542 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala: ## @@ -40,6 +40,7 @@ private[sql] class SparkResult[T]( extends AutoCloseable with Cleanable { self => + private[this] var opId: String = null Review Comment: is using `Option[String]` more ideomatic? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mathewjacob1002 opened a new pull request, #42078: [WIP][DO NOT REVIEW] Testing stuff
mathewjacob1002 opened a new pull request, #42078: URL: https://github.com/apache/spark/pull/42078 First commit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] juliuszsompolski commented on pull request #42009: [SPARK-44422][CONNECT] Spark Connect fine grained interrupt
juliuszsompolski commented on PR #42009: URL: https://github.com/apache/spark/pull/42009#issuecomment-1642777980 cc @zhenlineo @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WweiL opened a new pull request, #42077: [SPARK-44484][SS]Add batchDuration to StreamingQueryProgress json method
WweiL opened a new pull request, #42077: URL: https://github.com/apache/spark/pull/42077 ### What changes were proposed in this pull request? Add the missing field batchDuration to StreamingQueryProgress json method. Also modify tests accordingly ### Why are the changes needed? Add a missing field ### Does this PR introduce _any_ user-facing change? Probably yes - in their call to `query.lastProgress` or `query.recentProgress` and inside listener this new field will show up ### How was this patch tested? Existing unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #41948: [SPARK-44380][SQL][PYTHON] Support for Python UDTF to analyze in Python
ueshin commented on code in PR #41948: URL: https://github.com/apache/spark/pull/41948#discussion_r1268593999 ## python/pyspark/sql/udtf.py: ## @@ -153,6 +175,19 @@ def _validate_udtf_handler(cls: Any) -> None: error_class="INVALID_UDTF_NO_EVAL", message_parameters={"name": cls.__name__} ) +has_analyze_staticmethod = hasattr(cls, "analyze") and isinstance( +inspect.getattr_static(cls, "analyze"), staticmethod Review Comment: So now at [4e2b6be](https://github.com/apache/spark/pull/41948/commits/4e2b6befbd015fcdad1ddf6ccecd79a630ac7d60), - When `returnType` is not provided, the class must have `analyze` static method; otherwise raise an error `INVALID_UDTF_RETURN_TYPE`. - When `returnType` is provided and also the class has any `analyze` attribute, raise an error `INVALID_UDTF_BOTH_RETURN_TYPE_AND_ANALYZE`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell opened a new pull request, #42076: [SPARK-44449][CONNECT] Upcasting for direct Arrow Deserialization
hvanhovell opened a new pull request, #42076: URL: https://github.com/apache/spark/pull/42076 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #41948: [SPARK-44380][SQL][PYTHON] Support for Python UDTF to analyze in Python
dtenedor commented on code in PR #41948: URL: https://github.com/apache/spark/pull/41948#discussion_r1268552267 ## python/pyspark/sql/udtf.py: ## @@ -153,6 +175,19 @@ def _validate_udtf_handler(cls: Any) -> None: error_class="INVALID_UDTF_NO_EVAL", message_parameters={"name": cls.__name__} ) +has_analyze_staticmethod = hasattr(cls, "analyze") and isinstance( +inspect.getattr_static(cls, "analyze"), staticmethod Review Comment: I would recommend to return an error in this case if it's easy to do so, in order to prevent confusion. This would be in addition to the existing error that @ueshin pointed out that we are already returning: > It IS checking if the UDTF class has an analyze method and throw an error if it is not static, and the error message will be like: ``` The UDTF '' is invalid. It does not specify its return type or implement the required 'analyze' static method. Please specify the return type or implement the 'analyze' static method in '' and try again. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] juliuszsompolski commented on a diff in pull request #42009: [SPARK-44422][CONNECT] Spark Connect fine grained interrupt
juliuszsompolski commented on code in PR #42009: URL: https://github.com/apache/spark/pull/42009#discussion_r1268542613 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala: ## @@ -613,16 +613,30 @@ class SparkSession private[sql] ( /** * Interrupt all operations of this session currently running on the connected server. * - * TODO/WIP: Currently it will interrupt the Spark Jobs running on the server, triggered from - * ExecutePlan requests. If an operation is not running a Spark Job, it becomes an noop and the - * operation will continue afterwards, possibly with more Spark Jobs. - * Review Comment: This has actually been fixed by https://github.com/apache/spark/pull/41315 (now the execution is in different thread, and the interrupt interrupts that thread, not only Spark Jobs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on pull request #41948: [SPARK-44380][SQL][PYTHON] Support for Python UDTF to analyze in Python
ueshin commented on PR #41948: URL: https://github.com/apache/spark/pull/41948#issuecomment-1642586143 The failing test is [Spark on Kubernetes Integration test](https://github.com/ueshin/apache-spark/actions/runs/5596078914/jobs/10247102939#logs) that seems to be broken in `master` branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db opened a new pull request, #42075: [SPARK-43966][SQL][PYTHON] Support non-deterministic table-valued functions
allisonwang-db opened a new pull request, #42075: URL: https://github.com/apache/spark/pull/42075 ### What changes were proposed in this pull request? This PR supports non-deterministic table-valued functions. More specifically, it supports running non-deterministic Python UDTFs and built-in table-valued generator functions with non-deterministic input values. ### Why are the changes needed? To make table-valued functions more versatile. ### Does this PR introduce _any_ user-facing change? Yes. Before this PR, Spark will throw an exception when running a non-deterministic Python UDTF: ``` select * from random_udtf(1) AnalysisException: [INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic expression, ``` After this PR, it is supported. ### How was this patch tested? Existing and new unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] siying opened a new pull request, #42074: [SPARK-44464][SS] Fix applyInPandasWithStatePythonRunner to output rows that have Null as first column value
siying opened a new pull request, #42074: URL: https://github.com/apache/spark/pull/42074 Change the serialization format for group-by-with-state outputs: include an explicit hidden column indicating how many data and state records there are. The current implementation of ApplyInPandasWithStatePythonRunner cannot deal with outputs where the first column of the row is null, as it cannot distinguish the case where the column is null, or the field is filled as the number of data records are smaller than state records. It causes incorrect results for the former case. No Add unit tests that cover null cases and different other scenarios. Closes #42046 from siying/pypanda. Authored-by: Siying Dong ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] siying commented on pull request #42046: [SPARK-44464][SS] Implement applyInPandasWithState in PySpark
siying commented on PR #42046: URL: https://github.com/apache/spark/pull/42046#issuecomment-1642557386 > @siying There was a conflict. Could you please create a PR against branch-3.4? Thanks in advance! > > (Btw, I didn't indicate that title is not accurate. Could you please fix the title when you submit a PR for 3.4?) Oops. I didn't realize the title was wrong. Perhaps I did a wrong copy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] siying commented on pull request #42046: [SPARK-44464][SS] Implement applyInPandasWithState in PySpark
siying commented on PR #42046: URL: https://github.com/apache/spark/pull/42046#issuecomment-1642555659 Sure. Will do that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ericm-db commented on pull request #42066: [SPARK-44480][SS] Use thread pool to perform maintenance activity for hdfs/rocksdb state store providers
ericm-db commented on PR #42066: URL: https://github.com/apache/spark/pull/42066#issuecomment-1642470993 > CI looks failing. Could you please look into it? https://github.com/ericm-db/spark/actions/runs/5595467996/jobs/10231376237 @HeartSaVioR looks like I was setting the runnable maintenance task incorrectly. It should be fine now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #41928: [SPARK-44475][SQL][CONNECT] Relocate DataType and Parser to sql/api
amaliujia commented on code in PR #41928: URL: https://github.com/apache/spark/pull/41928#discussion_r1268235385 ## sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkAnalysisUtils.scala: ## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.sql.errors.DataTypeErrors + +trait SparkAnalysisUtils { Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #41928: [SPARK-44475][SQL][CONNECT] Relocate DataType and Parser to sql/api
amaliujia commented on code in PR #41928: URL: https://github.com/apache/spark/pull/41928#discussion_r1268233819 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala: ## @@ -16,34 +16,20 @@ */ package org.apache.spark.sql.catalyst.parser -import java.lang.{Long => JLong} -import java.nio.CharBuffer import java.util import java.util.Locale import org.antlr.v4.runtime.{ParserRuleContext, Token} import org.antlr.v4.runtime.misc.Interval -import org.antlr.v4.runtime.tree.{ParseTree, TerminalNode, TerminalNodeImpl} +import org.antlr.v4.runtime.tree.{ParseTree, TerminalNodeImpl} -import org.apache.spark.sql.catalyst.trees.{CurrentOrigin, Origin} +import org.apache.spark.sql.catalyst.util.SparkParserUtils import org.apache.spark.sql.errors.QueryParsingErrors /** * A collection of utility methods for use during the parsing process. */ -object ParserUtils { - - val U16_CHAR_PATTERN = """\\u([a-fA-F0-9]{4})(?s).*""".r - val U32_CHAR_PATTERN = """\\U([a-fA-F0-9]{8})(?s).*""".r - val OCTAL_CHAR_PATTERN = """\\([01][0-7]{2})(?s).*""".r - val ESCAPED_CHAR_PATTERN = """\\((?s).)(?s).*""".r - - /** Get the command which created the token. */ - def command(ctx: ParserRuleContext): String = { -val stream = ctx.getStart.getInputStream -stream.getText(Interval.of(0, stream.size() - 1)) - } - +object ParserUtils extends SparkParserUtils { Review Comment: We could. I happed to move the least part that needed in this refactoring PR and it seems to be enough for moving a small portion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #41349: [SPARK-43839][SQL] Convert `_LEGACY_ERROR_TEMP_1337` to `UNSUPPORTED_FEATURE.TIME_TRAVEL`
cloud-fan commented on code in PR #41349: URL: https://github.com/apache/spark/pull/41349#discussion_r1268198286 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala: ## @@ -89,10 +89,12 @@ class V2SessionCatalog(catalog: SessionCatalog) throw QueryCompilationErrors.timeTravelUnsupportedError( Review Comment: do we still need the `if-else`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #41349: [SPARK-43839][SQL] Convert `_LEGACY_ERROR_TEMP_1337` to `UNSUPPORTED_FEATURE.TIME_TRAVEL`
cloud-fan commented on code in PR #41349: URL: https://github.com/apache/spark/pull/41349#discussion_r1268192638 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala: ## @@ -89,10 +89,12 @@ class V2SessionCatalog(catalog: SessionCatalog) throw QueryCompilationErrors.timeTravelUnsupportedError( toSQLId(catalogTable.identifier.nameParts)) } else { - throw QueryCompilationErrors.tableNotSupportTimeTravelError(ident) Review Comment: can we remove `tableNotSupportTimeTravelError` from `QueryCompilationErrors`? @panbingkun @beliefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #41850: [SPARK-44292][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2315-2319]
cloud-fan commented on code in PR #41850: URL: https://github.com/apache/spark/pull/41850#discussion_r1268138809 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -277,13 +277,13 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB e.dataTypeMismatch(e, checkRes) case TypeCheckResult.TypeCheckFailure(message) => e.setTagValue(DATA_TYPE_MISMATCH_ERROR, true) -extraHintForAnsiTypeCoercionExpression(operator) +val extraHint = extraHintForAnsiTypeCoercionExpression(operator) e.failAnalysis( - errorClass = "_LEGACY_ERROR_TEMP_2315", + errorClass = "TYPE_CHECK_FAILURE_WITH_HINT", Review Comment: This should be `DATATYPE_MISMATCH. TYPE_CHECK_FAILURE_WITH_HINT`. Can we add a test? @beliefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #42011: [SPARK-44396][Connect] Direct Arrow Deserialization
hvanhovell closed pull request #42011: [SPARK-44396][Connect] Direct Arrow Deserialization URL: https://github.com/apache/spark/pull/42011 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #42011: [SPARK-44396][Connect] Direct Arrow Deserialization
hvanhovell commented on PR #42011: URL: https://github.com/apache/spark/pull/42011#issuecomment-1642080475 Merging this. Test failure is unrelated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jchen5 commented on pull request #42007: [SPARK-44431][SQL] Fix behavior of null IN (empty list) in optimization rules
jchen5 commented on PR #42007: URL: https://github.com/apache/spark/pull/42007#issuecomment-1642057004 @cloud-fan PR is updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on pull request #41831: [SPARK-44278][CONNECT] Implement a GRPC server interceptor that cleans up thread local properties
heyihong commented on PR #41831: URL: https://github.com/apache/spark/pull/41831#issuecomment-1642046358 > Should this interceptor be included by default (as the outermost interceptor)? I am not sure... But maybe we can do this in a separate pr if needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] harupy commented on a diff in pull request #42072: [SPARK-44481][CONNECT][PYTHON] Make pyspark.sql.is_remote an API
harupy commented on code in PR #42072: URL: https://github.com/apache/spark/pull/42072#discussion_r1267980107 ## python/pyspark/sql/__init__.py: ## @@ -72,4 +73,5 @@ "DataFrameWriter", "DataFrameWriterV2", "PandasCogroupedOps", +"is_remote", Review Comment: Can we expose this as `pyspark.sql.utils.is_remote`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #41932: [SPARK-44131][SQL][PYTHON][CONNECT][FOLLOWUP] Support qualified function name for call_function
beliefer commented on code in PR #41932: URL: https://github.com/apache/spark/pull/41932#discussion_r1267967007 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -139,7 +139,18 @@ object SparkConnectServerUtils { .map(clientTestJar => Seq("--jars", clientTestJar.getCanonicalPath)) .getOrElse(Seq.empty) -writerV2Configs ++ hiveTestConfigs ++ udfTestConfigs Review Comment: Thank you for the double check. @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #41932: [SPARK-44131][SQL][PYTHON][CONNECT][FOLLOWUP] Support qualified function name for call_function
LuciferYang commented on code in PR #41932: URL: https://github.com/apache/spark/pull/41932#discussion_r1267940632 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala: ## @@ -139,7 +139,18 @@ object SparkConnectServerUtils { .map(clientTestJar => Seq("--jars", clientTestJar.getCanonicalPath)) .getOrElse(Seq.empty) -writerV2Configs ++ hiveTestConfigs ++ udfTestConfigs Review Comment: Both maven and sbt ok now, thanks @beliefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #42073: [SPARK-44482][CONNECT] Connect server should can specify the bind address
panbingkun commented on PR #42073: URL: https://github.com/apache/spark/pull/42073#issuecomment-1641909732 The manual testing process is as follows: 1.My local env as follows: 172.xxx.xxx.xxx 2.Set the following configuration in `spark-defauls.conf` - spark.connect.grpc.binding.address 127.0.0.1 - spark.connect.grpc.binding.port 5000` 3.Start connect server sh sbin/stop-connect-server.sh 4.Specify the host name and port to connect to the server - Success Scenario: https://github.com/apache/spark/assets/15246973/01e2b7ce-5e1c-4c3b-b321-e97df43b8a26;> - Failure scenario: https://github.com/apache/spark/assets/15246973/d7de08dc-ac46-4e45-98ca-2d54646fbdbb;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #42073: [SPARK-44482][CONNECT] Connect server should can specify the bind address
panbingkun opened a new pull request, #42073: URL: https://github.com/apache/spark/pull/42073 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - Manually test. - Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #41932: [SPARK-44131][SQL][PYTHON][CONNECT][FOLLOWUP] Support qualified function name for call_function
beliefer commented on code in PR #41932: URL: https://github.com/apache/spark/pull/41932#discussion_r1267892365 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala: ## @@ -1161,6 +1161,27 @@ class ClientE2ETestSuite extends RemoteSparkSession with SQLHelper with PrivateM val joined = ds1.joinWith(ds2, $"a.value._1" === $"b.value._2", "inner") checkSameResult(Seq((Some((2, 3)), Some((1, 2, joined) } + + test("call_function") { +val session: SparkSession = spark +import session.implicits._ +val testData = spark.range(5).repartition(1) +try { + session.sql("CREATE FUNCTION custom_sum AS 'test.org.apache.spark.sql.MyDoubleSum'") Review Comment: I have fixed the issue. Please wait for the CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #42072: [SPARK-44481][CONNECT][PYTHON] Make pyspark.sql.is_remote an API
HyukjinKwon opened a new pull request, #42072: URL: https://github.com/apache/spark/pull/42072 ### What changes were proposed in this pull request? ### Why are the changes needed? For the end users to be able to do if-else, e.g., for dispatching the code path to the legacy mode or connect mode. ### Does this PR introduce _any_ user-facing change? Yes, it exposes a method as an API. ### How was this patch tested? Manually built and checked the documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn closed pull request #42054: [SPARK-44470][BUILD] Setting version to 4.0.0-SNAPSHOT
yaooqinn closed pull request #42054: [SPARK-44470][BUILD] Setting version to 4.0.0-SNAPSHOT URL: https://github.com/apache/spark/pull/42054 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #42054: [SPARK-44470][BUILD] Setting version to 4.0.0-SNAPSHOT
yaooqinn commented on PR #42054: URL: https://github.com/apache/spark/pull/42054#issuecomment-1641794593 CLOSE as duplicated and fixed by SPARK-44467 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Deependra-Patel opened a new pull request, #42071: [SPARK-44209] Expose amount of shuffle data available on the node
Deependra-Patel opened a new pull request, #42071: URL: https://github.com/apache/spark/pull/42071 This will be available as external shuffle service metric ### What changes were proposed in this pull request? Adding three more metrics to ShuffleMetrics (exposed by External Shuffle Service) "totalShuffleDataBytes", "numAppsWithShuffleData" and "lastNodeShuffleMetricRefreshEpochMillis" We add a new daemon that scans the disk every 30s (configurable) and sums up total size of the shuffle data. ### Why are the changes needed? Adding these metrics would help in - 1. Deciding if we can decommission the node if no shuffle data present 2. Better live monitoring of customer's workload to see if there is skewed shuffle data present on the node ### Does this PR introduce _any_ user-facing change? This documentation will need to be updated on next release: https://spark.apache.org/docs/latest/monitoring.html#component-instance--shuffleservice ### How was this patch tested? UTs are added Also tested manually on a YARN cluster and metrics are getting published -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #42070: [MINOR][INFRA] Update the labeler for CORE and CONNECT
HyukjinKwon closed pull request #42070: [MINOR][INFRA] Update the labeler for CORE and CONNECT URL: https://github.com/apache/spark/pull/42070 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42070: [MINOR][INFRA] Update the labeler for CORE and CONNECT
HyukjinKwon commented on PR #42070: URL: https://github.com/apache/spark/pull/42070#issuecomment-1641707945 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42068: [SPARK-44361][SQL][FOLLOW-UP] Remove unused variables and fix import statements
HyukjinKwon commented on PR #42068: URL: https://github.com/apache/spark/pull/42068#issuecomment-1641706328 Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #42068: [SPARK-44361][SQL][FOLLOW-UP] Remove unused variables and fix import statements
HyukjinKwon closed pull request #42068: [SPARK-44361][SQL][FOLLOW-UP] Remove unused variables and fix import statements URL: https://github.com/apache/spark/pull/42068 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hisoka-X commented on pull request #41949: [SPARK-44375][SQL] Use PartitionEvaluator API in DebugExec
Hisoka-X commented on PR #41949: URL: https://github.com/apache/spark/pull/41949#issuecomment-1641693941 Thanks @cloud-fan @viirya @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #41951: [SPARK-44367][SQL][UI] Show error message on UI for each failed query
yaooqinn commented on PR #41951: URL: https://github.com/apache/spark/pull/41951#issuecomment-1641676820 How about this? ![image](https://github.com/apache/spark/assets/8326978/7a466124-f28a-4a84-944c-6fbeb577145e) ![image](https://github.com/apache/spark/assets/8326978/ca1fca1b-9733-4bcc-a708-6b229d3c7509) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #42067: [SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed
HyukjinKwon closed pull request #42067: [SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed URL: https://github.com/apache/spark/pull/42067 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42067: [SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed
HyukjinKwon commented on PR #42067: URL: https://github.com/apache/spark/pull/42067#issuecomment-1641654294 Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vicennial commented on a diff in pull request #42069: [WIP] Fix class loading problem caused by stub user classes not found on the server classpath
vicennial commented on code in PR #42069: URL: https://github.com/apache/spark/pull/42069#discussion_r1267729075 ## core/src/main/scala/org/apache/spark/executor/Executor.scala: ## @@ -860,6 +860,14 @@ private[spark] class Executor( } } + private def getIsolatedSession( + taskDescription: TaskDescription) = { +taskDescription.artifacts.state match { + case Some(jobArtifactState) => +isolatedSessionCache.get(jobArtifactState.uuid, () => newSessionState(jobArtifactState)) + case _ => defaultSessionState +} + } Review Comment: Nit: This refactoring is unnecessary here. This also modifies [OSS-synced code](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L516-L521) so we should leave it as it is -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vicennial commented on a diff in pull request #42069: [WIP] Fix class loading problem caused by stub user classes not found on the server classpath
vicennial commented on code in PR #42069: URL: https://github.com/apache/spark/pull/42069#discussion_r1267726085 ## core/src/main/java/org/apache/spark/util/ChildFirstURLClassLoader.java: ## @@ -40,6 +40,15 @@ public ChildFirstURLClassLoader(URL[] urls, ClassLoader parent) { this.parent = new ParentClassLoader(parent); } + /** + * Specify the realParent if there is a need to load in the order of + * `realParent -> urls (child) -> parent`. Review Comment: What does `realParent` mean here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vicennial commented on a diff in pull request #42069: [WIP] Fix class loading problem caused by stub user classes not found on the server classpath
vicennial commented on code in PR #42069: URL: https://github.com/apache/spark/pull/42069#discussion_r1267724082 ## core/src/main/scala/org/apache/spark/executor/Executor.scala: ## @@ -543,11 +538,16 @@ private[spark] class Executor( // requires access to properties contained within (e.g. for access control). Executor.taskDeserializationProps.set(taskDescription.properties) -updateDependencies( +val updated = updateDependencies( taskDescription.artifacts.files, taskDescription.artifacts.jars, taskDescription.artifacts.archives, isolatedSession) +if (updated) { + // reset the thread class loader + val newIsolatedSession = getIsolatedSession(taskDescription) + Thread.currentThread.setContextClassLoader(newIsolatedSession.replClassLoader) +} Review Comment: `updateDependencies` is called in several places, could we integrate this logic into the method itself to prevent cases where we end up using the outdated instance? Maybe we can leave the `Thread.currentThread.setContextClassLoader(newIsolatedSession.replClassLoader)` part here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vicennial commented on a diff in pull request #42069: [WIP] Fix class loading problem caused by stub user classes not found on the server classpath
vicennial commented on code in PR #42069: URL: https://github.com/apache/spark/pull/42069#discussion_r1267720733 ## core/src/main/scala/org/apache/spark/executor/Executor.scala: ## @@ -1102,12 +1117,19 @@ private[spark] class Executor( val url = new File(root, localName).toURI.toURL if (!state.urlClassLoader.getURLs().contains(url)) { logInfo(s"Adding $url to class loader") -state.urlClassLoader.addURL(url) +// TODO: make use of the session cache for the class loader. +// Currently we invalidate all when adding a new url to always clear the stubbed +// classes in the current repl class loader. +// This is not always needed if the newly added jar does not contains any stubbed +// classes. +isolatedSessionCache.invalidateAll() Review Comment: Shouldn't we only invalidate the `isolatedSession` that is being updated? Each session has its own individual `URLClassLoader` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #41928: [SPARK-44475][SQL][CONNECT] Relocate DataType and Parser to sql/api
hvanhovell commented on code in PR #41928: URL: https://github.com/apache/spark/pull/41928#discussion_r1267677534 ## sql/api/src/main/scala/org/apache/spark/sql/types/Decimal.scala: ## @@ -143,7 +142,7 @@ final class Decimal extends Ordered[Decimal] with Serializable { // the scale is 2. The expected precision should be 2. this._precision = decimal.scale this._scale = decimal.scale -} else if (decimal.scale < 0 && !SQLConf.get.allowNegativeScaleOfDecimalEnabled) { +} else if (decimal.scale < 0 && !SqlApiConf.get.allowNegativeScaleOfDecimalEnabled) { Review Comment: As soon as SQLConf is loaded we connect them: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L185 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #42066: [SPARK-44480][SS] Use thread pool to perform maintenance activity for hdfs/rocksdb state store providers
HeartSaVioR commented on PR #42066: URL: https://github.com/apache/spark/pull/42066#issuecomment-1641586354 CI looks failing. Could you please look into it? https://github.com/ericm-db/spark/actions/runs/5595467996/jobs/10231376237 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] juliuszsompolski commented on a diff in pull request #41964: [SPARK-44394][CONNECT][WEBUI] Add a Spark UI page for Spark Connect
juliuszsompolski commented on code in PR #41964: URL: https://github.com/apache/spark/pull/41964#discussion_r1267671107 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectServer.scala: ## @@ -49,6 +49,7 @@ object SparkConnectServer extends Logging { SparkConnectService.server.awaitTermination() } finally { session.stop() + SparkConnectService.uiTab.foreach(_.detach()) Review Comment: yeah... it seems that in this path the Server is killed by a signal (via `sbin/stop-connect-server.sh`) and then `SparkConnectService.stop()` is never called. It would be nice to SparkConnectService.stop() here, but then it will try to shutdown the server again... In any case, it's unrelated to this PR, so let's leave it like it is now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #41821: [SPARK-44272][YARN] Path Inconsistency when Operating statCache within Yarn Client
mridulm commented on PR #41821: URL: https://github.com/apache/spark/pull/41821#issuecomment-1641556251 Merged to master and branch-3.5. Thanks for working on this @shuwang21 ! FYI @xuanyuanking, who is RM for 3.5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm closed pull request #41821: [SPARK-44272][YARN] Path Inconsistency when Operating statCache within Yarn Client
mridulm closed pull request #41821: [SPARK-44272][YARN] Path Inconsistency when Operating statCache within Yarn Client URL: https://github.com/apache/spark/pull/41821 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org