[GitHub] [spark] gengliangwang commented on a diff in pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI
gengliangwang commented on code in PR #38567: URL: https://github.com/apache/spark/pull/38567#discussion_r1028977314 ## core/src/main/scala/org/apache/spark/status/KVUtils.scala: ## @@ -80,6 +89,44 @@ private[spark] object KVUtils extends Logging { db } + def createKVStore( + storePath: Option[File], + diskBackend: HybridStoreDiskBackend.Value, + conf: SparkConf): KVStore = { +storePath.map { path => + val dir = diskBackend match { +case LEVELDB => "listing.ldb" +case ROCKSDB => "listing.rdb" + } + + val dbPath = Files.createDirectories(new File(path, dir).toPath()).toFile() + Utils.chmod700(dbPath) + + Review Comment: Thanks, removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert
cloud-fan commented on code in PR #38495: URL: https://github.com/apache/spark/pull/38495#discussion_r1028966247 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala: ## @@ -105,6 +106,15 @@ private[hive] class HiveClientImpl( private class RawHiveTableImpl(override val rawTable: HiveTable) extends RawHiveTable { override lazy val toCatalogTable = convertHiveTableToCatalogTable(rawTable) + +override def hiveTableProps(containsStats: Boolean): Map[String, String] = { Review Comment: why do we need this parameter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous
EnricoMi commented on PR #38676: URL: https://github.com/apache/spark/pull/38676#issuecomment-1323236232 @wangyum @cloud-fan I am not sure if this is the right approach to fix `DeduplicateRelations`. Please advise. Problem is that `DeduplicateRelations` is only considering duplicates between left `output` and right `output`, but this situation is caused by duplicates between left `references` and right `output`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38737: [SPARK-41174][CORE][SQL] Propagate an error class to users for invalid `format` of `to_binary()`
LuciferYang commented on PR #38737: URL: https://github.com/apache/spark/pull/38737#issuecomment-1323234015 @MaxGekk rebased -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`
cloud-fan commented on code in PR #38734: URL: https://github.com/apache/spark/pull/38734#discussion_r1028961395 ## python/pyspark/sql/connect/dataframe.py: ## @@ -122,6 +122,18 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat new_frame._plan = plan return new_frame +def isEmpty(self) -> bool: +"""Returns ``True`` if this :class:`DataFrame` is empty. + +.. versionadded:: 3.4.0 + +Returns +--- +bool +Whether it's empty DataFrame or not. +""" +return len(self.take(1)) == 0 Review Comment: This makes me think that maybe we should have some shared code between pyspark and python connect client, to share some API definitions (api doc, function signature) and functions with default implementations. cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`
MaxGekk closed pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS` URL: https://github.com/apache/spark/pull/38685 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`
MaxGekk commented on PR #38685: URL: https://github.com/apache/spark/pull/38685#issuecomment-1323221728 Merging to master. Thank you, @srielau and @cloud-fan for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation
HyukjinKwon commented on code in PR #38659: URL: https://github.com/apache/spark/pull/38659#discussion_r1028944132 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -271,8 +273,12 @@ class SparkConnectPlanner(session: SparkSession) { } private def transformLocalRelation(rel: proto.LocalRelation): LogicalPlan = { -val attributes = rel.getAttributesList.asScala.map(transformAttribute(_)).toSeq -new org.apache.spark.sql.catalyst.plans.logical.LocalRelation(attributes) +val (rows, structType) = ArrowConverters.fromBatchWithSchemaIterator( + Seq(rel.getData.toByteArray).iterator, Review Comment: Should we use the same protobuf message you added, @zhengruifeng ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation
HyukjinKwon commented on code in PR #38659: URL: https://github.com/apache/spark/pull/38659#discussion_r1028943446 ## sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala: ## @@ -253,16 +253,94 @@ private[sql] object ArrowConverters extends Logging { val vectorLoader = new VectorLoader(root) vectorLoader.load(arrowRecordBatch) arrowRecordBatch.close() +vectorSchemaRootToIter(root) + } +} + } + + /** + * Maps iterator from serialized ArrowRecordBatches to InternalRows. Different from + * [[fromBatchIterator]], each input arrow batch starts with the schema. + */ + private[sql] def fromBatchWithSchemaIterator( Review Comment: Sorry for late reviews. Can we dedup the logic like `ArrowBatchWithSchemaIterator` is doing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation
HyukjinKwon commented on code in PR #38659: URL: https://github.com/apache/spark/pull/38659#discussion_r1028942760 ## core/src/main/scala/org/apache/spark/util/Utils.scala: ## @@ -3257,6 +3257,14 @@ private[spark] object Utils extends Logging { case _ => math.max(sortedSize(len / 2), 1) } } + + def closeAll(closeables: AutoCloseable*): Unit = { Review Comment: I think this is too much to have it as a common util at the core module. It's only used twice .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation
HyukjinKwon commented on code in PR #38659: URL: https://github.com/apache/spark/pull/38659#discussion_r1028942267 ## connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala: ## @@ -44,14 +47,18 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { lazy val connectTestRelation = createLocalRelationProto( - Seq(AttributeReference("id", IntegerType)(), AttributeReference("name", StringType)())) + Seq(AttributeReference("id", IntegerType)(), AttributeReference("name", StringType)()), + Seq()) lazy val connectTestRelation2 = createLocalRelationProto( - Seq(AttributeReference("id", IntegerType)(), AttributeReference("name", StringType)())) + Seq(AttributeReference("id", IntegerType)(), AttributeReference("name", StringType)()), + Seq()) Review Comment: ```suggestion Seq.empty) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client
grundprinzip commented on code in PR #38723: URL: https://github.com/apache/spark/pull/38723#discussion_r1028942189 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -220,6 +220,29 @@ def test_create_global_temp_view(self): with self.assertRaises(_MultiThreadedRendezvous): Review Comment: Didn't we fix this to grpc.RPCError? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38744: [SPARK-41217][SQL] Add the error class `FAILED_FUNCTION_CALL`
MaxGekk closed pull request #38744: [SPARK-41217][SQL] Add the error class `FAILED_FUNCTION_CALL` URL: https://github.com/apache/spark/pull/38744 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38744: [SPARK-41217][SQL] Add the error class `FAILED_FUNCTION_CALL`
LuciferYang commented on PR #38744: URL: https://github.com/apache/spark/pull/38744#issuecomment-1323206350 OK, I will rebase my pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38744: [SPARK-41217][SQL] Add the error class `FAILED_FUNCTION_CALL`
MaxGekk commented on PR #38744: URL: https://github.com/apache/spark/pull/38744#issuecomment-1323205514 Merging to master. Thank you, @panbingkun @LuciferYang @cloud-fan for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38715: [SPARK-41197] Upgrade Kafka version to 3.3 release
HyukjinKwon commented on PR #38715: URL: https://github.com/apache/spark/pull/38715#issuecomment-1323204958 cc @dongjoon-hyun and @HeartSaVioR FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/
zhengruifeng commented on code in PR #38742: URL: https://github.com/apache/spark/pull/38742#discussion_r1028935043 ## python/pyspark/sql/connect/dataframe.py: ## @@ -736,6 +736,19 @@ def toPandas(self) -> Optional["pandas.DataFrame"]: query = self._plan.to_proto(self._session) return self._session._to_pandas(query) +def _basic_analyze(self) -> None: +# update isLocal, isStreaming, explain_string, tree_string, semantic_hash +if self._schema is None: +if self._plan is not None: +query = self._plan.to_proto(self._session) +if self._session is None: +raise Exception("Cannot analyze without RemoteSparkSession.") +results = self._session.basic_analyze(query) +for k, v in results.items(): +self._cache[k] = v Review Comment: got it, will update -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38752: [SPARK-40809][CONNECT][FOLLOW-UP] Do not use Buffer to make Scala 2.13 test pass
LuciferYang commented on PR #38752: URL: https://github.com/apache/spark/pull/38752#issuecomment-1323197061 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38697: [SPARK-41118][SQL][3.3] `to_number`/`try_to_number` should return `null` when format is `null`
HyukjinKwon closed pull request #38697: [SPARK-41118][SQL][3.3] `to_number`/`try_to_number` should return `null` when format is `null` URL: https://github.com/apache/spark/pull/38697 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang closed pull request #38753: [SPARK-40809][CONNECT][PYTHON][TESTS] Fix pyspark-connect test failed with Scala 2.13
LuciferYang closed pull request #38753: [SPARK-40809][CONNECT][PYTHON][TESTS] Fix pyspark-connect test failed with Scala 2.13 URL: https://github.com/apache/spark/pull/38753 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38753: [SPARK-40809][CONNECT][PYTHON][TESTS] Fix pyspark-connect test failed with Scala 2.13
LuciferYang commented on PR #38753: URL: https://github.com/apache/spark/pull/38753#issuecomment-1323196610 dup with https://github.com/apache/spark/pull/38752, close this one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38697: [SPARK-41118][SQL][3.3] `to_number`/`try_to_number` should return `null` when format is `null`
HyukjinKwon commented on PR #38697: URL: https://github.com/apache/spark/pull/38697#issuecomment-1323196618 Merged to branch-3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38752: [SPARK-40809][CONNECT][FOLLOW-UP] Do not use Buffer to make Scala 2.13 test pass
HyukjinKwon closed pull request #38752: [SPARK-40809][CONNECT][FOLLOW-UP] Do not use Buffer to make Scala 2.13 test pass URL: https://github.com/apache/spark/pull/38752 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38752: [SPARK-40809][CONNECT][FOLLOW-UP] Do not use Buffer to make Scala 2.13 test pass
HyukjinKwon commented on PR #38752: URL: https://github.com/apache/spark/pull/38752#issuecomment-1323196020 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client
HyukjinKwon commented on code in PR #38631: URL: https://github.com/apache/spark/pull/38631#discussion_r1028933832 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -301,6 +301,20 @@ def test_simple_datasource_read(self) -> None: actualResult = pandasResult.values.tolist() self.assertEqual(len(expectResult), len(actualResult)) +def test_alias(self) -> None: +"""Testing supported and unsupported alias""" +col0 = ( +self.connect.range(1, 10) +.select(col("id").alias("name", metadata={"max": 99})) +.schema() +.names[0] +) +self.assertEqual("name", col0) + +with self.assertRaises(grpc.RpcError) as exc: +self.connect.range(1, 10).select(col("id").alias("this", "is", "not")).collect() +self.assertIn("Buffer(this, is, not)", str(exc.exception)) Review Comment: I made a followup at https://github.com/apache/spark/pull/38752 :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollectingS
LuciferYang commented on PR #38704: URL: https://github.com/apache/spark/pull/38704#issuecomment-1323195610 Thanks @HyukjinKwon @mridulm @liuzqt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`
HyukjinKwon closed pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop` URL: https://github.com/apache/spark/pull/38686 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`
zhengruifeng commented on PR #38735: URL: https://github.com/apache/spark/pull/38735#issuecomment-1323195553 @HyukjinKwon thanks for the reviews -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`
HyukjinKwon commented on PR #38686: URL: https://github.com/apache/spark/pull/38686#issuecomment-1323195361 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollectingSuite`
HyukjinKwon closed pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollectingSuite` URL: https://github.com/apache/spark/pull/38704 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollectingS
HyukjinKwon commented on PR #38704: URL: https://github.com/apache/spark/pull/38704#issuecomment-1323194208 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client
HyukjinKwon commented on code in PR #38723: URL: https://github.com/apache/spark/pull/38723#discussion_r1028931691 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -220,6 +220,29 @@ def test_create_global_temp_view(self): with self.assertRaises(_MultiThreadedRendezvous): self.connect.sql("SELECT 1 AS X LIMIT 0").createGlobalTempView("view_1") +def test_select_expr(self): +self.assert_eq( Review Comment: comment with a JIRA ID? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client
HyukjinKwon commented on code in PR #38723: URL: https://github.com/apache/spark/pull/38723#discussion_r1028931101 ## python/pyspark/sql/connect/column.py: ## @@ -263,6 +263,22 @@ def __str__(self) -> str: return f"Column({self._unparsed_identifier})" +class SQLExpression(Expression): Review Comment: I still don't like kind of naming .. but this is at least somewhat consistent with what we have in DSv2 so I am fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client
HyukjinKwon commented on code in PR #38723: URL: https://github.com/apache/spark/pull/38723#discussion_r1028930295 ## python/pyspark/sql/connect/column.py: ## @@ -263,6 +263,22 @@ def __str__(self) -> str: return f"Column({self._unparsed_identifier})" +class SQLExpression(Expression): +"""Returns Expression which contains a string which is a SQL expression +and server side will parse it by Catalyst +""" + +def __init__(self, expr: str) -> None: +super().__init__() Review Comment: I think that's fine. One point of having type hints is to avoid `assert`s on those types too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38731: [SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method
HyukjinKwon closed pull request #38731: [SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method URL: https://github.com/apache/spark/pull/38731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38731: [SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method
HyukjinKwon commented on PR #38731: URL: https://github.com/apache/spark/pull/38731#issuecomment-1323189415 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`
HyukjinKwon closed pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes` URL: https://github.com/apache/spark/pull/38735 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`
HyukjinKwon commented on PR #38735: URL: https://github.com/apache/spark/pull/38735#issuecomment-1323188640 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38302: [SPARK-40834][SQL] Use SparkListenerSQLExecutionEnd to track final SQL status in UI
cloud-fan commented on code in PR #38302: URL: https://github.com/apache/spark/pull/38302#discussion_r1028926326 ## sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala: ## @@ -56,7 +56,10 @@ case class SparkListenerSQLExecutionStart( } @DeveloperApi -case class SparkListenerSQLExecutionEnd(executionId: Long, time: Long) +case class SparkListenerSQLExecutionEnd( +executionId: Long, +time: Long, +errorMessage: Option[String] = None) Review Comment: This is not a public class. It generates JSON string where backward compatibility matters, and this PR added tests for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #38753: [SPARK-40809][CONNECT][TESTS] Fix pyspark-connect test failed with Scala 2.13
LuciferYang opened a new pull request, #38753: URL: https://github.com/apache/spark/pull/38753 ### What changes were proposed in this pull request? This pr simplify assertions to fix `pyspark-connect` test failed with Scala 2.13. ### Why are the changes needed? Fix Scala 2.13 daily test: - https://github.com/apache/spark/actions/runs/3517345801/jobs/5895043640 ``` Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_basic.py", line 337, in test_alias self.assertIn("Buffer(this, is, not)", str(exc.exception)) AssertionError: 'Buffer(this, is, not)' not found in '<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = "[INTERNAL_ERROR] Found the unresolved operator: \'Project [id#24L AS List(this, is, not)]"\n\tdebug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:15002 {created_time:"2022-11-21T19:25:59.902777537+00:00", grpc_status:2, grpc_message:"[INTERNAL_ERROR] Found the unresolved operator: \\\'Project [id#24L AS List(this, is, not)]"}"\n>' ``` The error message with Scala 2.13 contains `List(this, is, not)` instead of `ArrayBuffer(this, is, not)`. ### Does this PR introduce _any_ user-facing change? No, just for test ### How was this patch tested? Pass GitHub Actions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #38748: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent
HeartSaVioR closed pull request #38748: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent URL: https://github.com/apache/spark/pull/38748 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/s
HyukjinKwon commented on code in PR #38742: URL: https://github.com/apache/spark/pull/38742#discussion_r1028915567 ## python/pyspark/sql/connect/dataframe.py: ## @@ -736,6 +736,19 @@ def toPandas(self) -> Optional["pandas.DataFrame"]: query = self._plan.to_proto(self._session) return self._session._to_pandas(query) +def _basic_analyze(self) -> None: +# update isLocal, isStreaming, explain_string, tree_string, semantic_hash +if self._schema is None: +if self._plan is not None: +query = self._plan.to_proto(self._session) +if self._session is None: +raise Exception("Cannot analyze without RemoteSparkSession.") +results = self._session.basic_analyze(query) +for k, v in results.items(): +self._cache[k] = v Review Comment: Let's don't do caching stuff for now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38748: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent
HeartSaVioR commented on PR #38748: URL: https://github.com/apache/spark/pull/38748#issuecomment-1323163883 Thanks! Merging to 3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38748: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent
HeartSaVioR commented on PR #38748: URL: https://github.com/apache/spark/pull/38748#issuecomment-1323163589 Looks like GA build failed to take the result of forked GA build. Here is a success run from the forked repository. https://github.com/Yaohua628/spark/runs/9630498867 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client
LuciferYang commented on code in PR #38631: URL: https://github.com/apache/spark/pull/38631#discussion_r1028908773 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -301,6 +301,20 @@ def test_simple_datasource_read(self) -> None: actualResult = pandasResult.values.tolist() self.assertEqual(len(expectResult), len(actualResult)) +def test_alias(self) -> None: +"""Testing supported and unsupported alias""" +col0 = ( +self.connect.range(1, 10) +.select(col("id").alias("name", metadata={"max": 99})) +.schema() +.names[0] +) +self.assertEqual("name", col0) + +with self.assertRaises(grpc.RpcError) as exc: +self.connect.range(1, 10).select(col("id").alias("this", "is", "not")).collect() +self.assertIn("Buffer(this, is, not)", str(exc.exception)) Review Comment: @grundprinzip The Scala 2.13 daily test has failed in recent times, which seems to be related to this assert: ``` == FAIL [0.742s]: test_alias (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) Testing supported and unsupported alias -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_basic.py", line 337, in test_alias self.assertIn("Buffer(this, is, not)", str(exc.exception)) AssertionError: 'Buffer(this, is, not)' not found in '<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = "[INTERNAL_ERROR] Found the unresolved operator: \'Project [id#24L AS List(this, is, not)]"\n\tdebug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:15002 {created_time:"2022-11-21T19:25:59.902777537+00:00", grpc_status:2, grpc_message:"[INTERNAL_ERROR] Found the unresolved operator: \\\'Project [id#24L AS List(this, is, not)]"}"\n>' ``` https://github.com/apache/spark/actions/runs/3517345801/jobs/5895043640 This seems to be related to the difference of Scala version. Can we simplify the assertion? such as `self.assertIn("(this, is, not)", str(exc.exception))` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38302: [SPARK-40834][SQL] Use SparkListenerSQLExecutionEnd to track final SQL status in UI
HyukjinKwon commented on code in PR #38302: URL: https://github.com/apache/spark/pull/38302#discussion_r1028906041 ## sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala: ## @@ -56,7 +56,10 @@ case class SparkListenerSQLExecutionStart( } @DeveloperApi -case class SparkListenerSQLExecutionEnd(executionId: Long, time: Long) +case class SparkListenerSQLExecutionEnd( +executionId: Long, +time: Long, +errorMessage: Option[String] = None) Review Comment: Doesn't this break binary compatibility? I think it should have overridden constructor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #38752: [SPARK-40809][CONNECT][FOLLOW-UP] Do not use Buffer to make Scala 2.13 test pass
HyukjinKwon opened a new pull request, #38752: URL: https://github.com/apache/spark/pull/38752 ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/38631 that fixes the test to pass in Scala 2.13 by avoiding using `Buffer` that becomes `List` in Scala 2.13. ### Why are the changes needed? To fix up the Scala 2.13 build, see https://github.com/apache/spark/actions/runs/3517345801/jobs/5895043640 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #38744: [SPARK-41217][SQL] Add the error class `FAILED_FUNCTION_CALL`
panbingkun commented on PR #38744: URL: https://github.com/apache/spark/pull/38744#issuecomment-1323142938 +1, LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaoyajun02 commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
gaoyajun02 commented on PR #38333: URL: https://github.com/apache/spark/pull/38333#issuecomment-1323136879 > Thank you, @gaoyajun02 , @mridulm , @otterc . > > * Do we need to backport this to branch-3.3? > * According to the previous failure description, what happens in branch-3.3 in case of failure? Since the 3.3 branch does not contain the pr of SPARK-38987, if the mergedChunk is zero-size, throwFetchFailedException is actually a SparkException, which will eventually cause the app to fail due to task failure 4 times. @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaoyajun02 commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
gaoyajun02 commented on PR #38333: URL: https://github.com/apache/spark/pull/38333#issuecomment-1323119482 > I was on two minds whether to fix this in 3.3 as well ... Yes, 3.3 is affected by it. > > But agree, a backport to branch-3.3 would be helpful. Can you get it a shot @gaoyajun02 ? Might need to fix some minor nits to get a patch ok, can you take a look @mridulm cc @otterc @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaoyajun02 opened a new pull request, #38751: [SPARK-40872][3.3] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
gaoyajun02 opened a new pull request, #38751: URL: https://github.com/apache/spark/pull/38751 ### What changes were proposed in this pull request? This is a backport PR of #38333 When push-based shuffle is enabled, a zero-size buf error may occur when fetching shuffle chunks from bad nodes, especially when memory is full. In this case, we can fall back to original shuffle blocks. ### Why are the changes needed? When the reduce task obtains the shuffle chunk with a zero-size buf, we let it fall back to original shuffle block. After verification, these blocks can be read successfully without shuffle retry. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] toujours33 commented on pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
toujours33 commented on PR #38711: URL: https://github.com/apache/spark/pull/38711#issuecomment-1323074280 > How far back should this backport? I hope it can be back to 3.3+(3.3 included). For version 3.3 is mainly used in our production environment~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] toujours33 commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
toujours33 commented on code in PR #38711: URL: https://github.com/apache/spark/pull/38711#discussion_r1028830007 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager( stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1 // If this is the last pending task, mark the scheduler queue as empty if (taskStart.taskInfo.speculative) { - stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt, -new mutable.HashSet[Int]) += taskIndex + stageAttemptToSpeculativeTaskIndices +.getOrElseUpdate(stageAttempt, new mutable.HashSet[Int]).add(taskIndex) + stageAttemptToUnsubmittedSpeculativeTasks Review Comment: I think it's ok here. As PR describes, `stageAttemptToUnsubmittedSpeculativeTasks.remove(taskIndex)` should be called only when speculative task start or task finished(whether it is speculative or not). Line 754 will do nothing if this speculative task has been removed when task finished, which is expected and will be ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
LuciferYang commented on PR #38711: URL: https://github.com/apache/spark/pull/38711#issuecomment-1323065288 How far back should this backport? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
LuciferYang commented on code in PR #38711: URL: https://github.com/apache/spark/pull/38711#discussion_r1028823929 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager( stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1 // If this is the last pending task, mark the scheduler queue as empty if (taskStart.taskInfo.speculative) { - stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt, -new mutable.HashSet[Int]) += taskIndex + stageAttemptToSpeculativeTaskIndices +.getOrElseUpdate(stageAttempt, new mutable.HashSet[Int]).add(taskIndex) + stageAttemptToUnsubmittedSpeculativeTasks Review Comment: My concern is that if line754 do nothing, this should be regarded as an wrong state? Does this wrong state cause other issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
LuciferYang commented on code in PR #38711: URL: https://github.com/apache/spark/pull/38711#discussion_r1028821337 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager( stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1 // If this is the last pending task, mark the scheduler queue as empty if (taskStart.taskInfo.speculative) { - stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt, -new mutable.HashSet[Int]) += taskIndex + stageAttemptToSpeculativeTaskIndices Review Comment: Got -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38650: [SPARK-41135][SQL] Rename `UNSUPPORTED_EMPTY_LOCATION` to `INVALID_EMPTY_LOCATION`
itholic commented on code in PR #38650: URL: https://github.com/apache/spark/pull/38650#discussion_r1028818844 ## core/src/main/resources/error/error-classes.json: ## @@ -656,6 +656,11 @@ ], "sqlState" : "42000" }, + "INVALID_EMPTY_LOCATION" : { +"message" : [ + "The location name cannot be empty string or null, but `` was given." Review Comment: Good catch. I tried to add a test with NULL, but it complains [PARSE_SYNTAX_ERROR] rather than [INVALID_EMPTY_LOCATION]. Let me exclude the "null" from the error message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38650: [SPARK-41135][SQL] Rename `UNSUPPORTED_EMPTY_LOCATION` to `INVALID_EMPTY_LOCATION`
itholic commented on code in PR #38650: URL: https://github.com/apache/spark/pull/38650#discussion_r1028818844 ## core/src/main/resources/error/error-classes.json: ## @@ -656,6 +656,11 @@ ], "sqlState" : "42000" }, + "INVALID_EMPTY_LOCATION" : { +"message" : [ + "The location name cannot be empty string or null, but `` was given." Review Comment: Good catch. I tried to add a test with NULL, but it complains [PARSE_SYNTAX_ERROR] rather than [INVALID_EMPTY_LOCATION]. Let me exclude the null from the error message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dengziming commented on pull request #38715: [SPARK-41197] Upgrade Kafka version to 3.3 release
dengziming commented on PR #38715: URL: https://github.com/apache/spark/pull/38715#issuecomment-1323037097 These failures comes from [apache/kafka#12049](https://github.com/apache/kafka/pull/12049) and is described here: https://kafka.apache.org/documentation/#upgrade_33_notable The new default partitioner keeps track of how many bytes are produced per-partition and once the amount exceeds batch.size, switches to the next partition. In spark kafka tests, this will result in records being sent to one partition in some tests. One simplest solution is add `props.put("partitioner.class",classOf[org.apache.kafka.clients.producer.internals.DefaultPartitioner].getName)` in `KafkaTestUtils.producerConfiguration`, or we can implement our own partitioner, or set a small`batch.size` config. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38664: [SPARK-41147][SQL] Assign a name to the legacy error class `_LEGACY_ERROR_TEMP_1042`
itholic commented on code in PR #38664: URL: https://github.com/apache/spark/pull/38664#discussion_r1028813303 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala: ## @@ -146,7 +147,10 @@ object FunctionRegistryBase { .filter(_.getParameterTypes.forall(_ == classOf[Expression])) .map(_.getParameterCount).distinct.sorted throw QueryCompilationErrors.invalidFunctionArgumentNumberError( -validParametersCount, name, params.length) +expressions.map(toPrettySQL(_)).mkString(","), Review Comment: Sounds good! Just addressed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`
zhengruifeng commented on code in PR #38734: URL: https://github.com/apache/spark/pull/38734#discussion_r1028812420 ## python/pyspark/sql/connect/dataframe.py: ## @@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat new_frame._plan = plan return new_frame +def isEmpty(self) -> bool: +"""Returns ``True`` if this :class:`DataFrame` is empty. Review Comment: this doc was copied from pyspark and scala api. I think the dataframe itself is expected to be not None, otherwise, this method can not be called. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`
zhengruifeng commented on code in PR #38686: URL: https://github.com/apache/spark/pull/38686#discussion_r1028811655 ## connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala: ## @@ -148,6 +148,23 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { comparePlans(connectPlan2, sparkPlan2) } + test("SPARK-41169: Test drop") { +// single column +val connectPlan = connectTestRelation.drop("id") +val sparkPlan = sparkTestRelation.drop("id") +comparePlans(connectPlan, sparkPlan) + +// all columns +val connectPlan2 = connectTestRelation.drop("id", "name") +val sparkPlan2 = sparkTestRelation.drop("id", "name") +comparePlans(connectPlan2, sparkPlan2) + +// non-existing column Review Comment: I just checked the implementation of `Dataset.drop`, it supports all kinds of expressions, a expression will be just ignored if it doesn't semanticEquals the columns in current dataframe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] toujours33 commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
toujours33 commented on code in PR #38711: URL: https://github.com/apache/spark/pull/38711#discussion_r1028811606 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager( stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1 // If this is the last pending task, mark the scheduler queue as empty if (taskStart.taskInfo.speculative) { - stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt, -new mutable.HashSet[Int]) += taskIndex + stageAttemptToSpeculativeTaskIndices +.getOrElseUpdate(stageAttempt, new mutable.HashSet[Int]).add(taskIndex) + stageAttemptToUnsubmittedSpeculativeTasks Review Comment: 1. How about `stageAttemptToUnsubmittedSpeculativeTasks .get(stageAttempt).foreach(_.remove(taskIndex))`. In this way, nothing will changed even if `stageAttemptToUnsubmittedSpecificativeTasks` not initialized. 2. For `stageAttemptToUnsubmittedSpeculativeTasks` is thread-safe, `stageAttemptToUnsubmittedSpeculativeTasks.contains(stageAttempt)` would be correct anyway -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38576: [SPARK-41062][SQL] Rename `UNSUPPORTED_CORRELATED_REFERENCE` to `CORRELATED_REFERENCE`
itholic commented on code in PR #38576: URL: https://github.com/apache/spark/pull/38576#discussion_r1028810170 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -1059,10 +1060,16 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB def failOnInvalidOuterReference(p: LogicalPlan): Unit = { p.expressions.foreach(checkMixedReferencesInsideAggregateExpr) if (!canHostOuter(p) && p.expressions.exists(containsOuter)) { +val exprs = new ListBuffer[String]() +for (expr <- p.expressions) { Review Comment: Thanks! Just addressed the comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] toujours33 commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
toujours33 commented on code in PR #38711: URL: https://github.com/apache/spark/pull/38711#discussion_r1028807362 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager( stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1 // If this is the last pending task, mark the scheduler queue as empty if (taskStart.taskInfo.speculative) { - stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt, -new mutable.HashSet[Int]) += taskIndex + stageAttemptToSpeculativeTaskIndices Review Comment: `allocationManager.synchronized` will be called to protect this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
LuciferYang commented on code in PR #38711: URL: https://github.com/apache/spark/pull/38711#discussion_r1028804426 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager( stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1 // If this is the last pending task, mark the scheduler queue as empty if (taskStart.taskInfo.speculative) { - stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt, -new mutable.HashSet[Int]) += taskIndex + stageAttemptToSpeculativeTaskIndices +.getOrElseUpdate(stageAttempt, new mutable.HashSet[Int]).add(taskIndex) + stageAttemptToUnsubmittedSpeculativeTasks Review Comment: Or whether `stageAttemptToUnsubmittedSpeculativeTasks.contains(stageAttempt)` should always be true at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
LuciferYang commented on code in PR #38711: URL: https://github.com/apache/spark/pull/38711#discussion_r1028803675 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager( stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1 // If this is the last pending task, mark the scheduler queue as empty if (taskStart.taskInfo.speculative) { - stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt, -new mutable.HashSet[Int]) += taskIndex + stageAttemptToSpeculativeTaskIndices +.getOrElseUpdate(stageAttempt, new mutable.HashSet[Int]).add(taskIndex) + stageAttemptToUnsubmittedSpeculativeTasks Review Comment: if use `new mutable.HashSet[Int]`, Is it necessary to call remove? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
LuciferYang commented on code in PR #38711: URL: https://github.com/apache/spark/pull/38711#discussion_r1028802840 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager( stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1 // If this is the last pending task, mark the scheduler queue as empty if (taskStart.taskInfo.speculative) { - stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt, -new mutable.HashSet[Int]) += taskIndex + stageAttemptToSpeculativeTaskIndices Review Comment: Is there a concurrent access issue? `mutable.HashMap` is not thread safe -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist
cloud-fan commented on code in PR #38703: URL: https://github.com/apache/spark/pull/38703#discussion_r1028795081 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala: ## @@ -355,7 +355,7 @@ case class ListQuery( plan.canonicalized, outerAttrs.map(_.canonicalized), ExprId(0), - childOutputs.map(_.canonicalized.asInstanceOf[Attribute]), + plan.canonicalized.output, Review Comment: @allisonwang-db do you know why we have `childOutputs`? It seems to be `child.output` most of the time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38739: [SPARK-41207][SQL] Fix BinaryArithmetic with negative scale
cloud-fan commented on code in PR #38739: URL: https://github.com/apache/spark/pull/38739#discussion_r1028793298 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -3532,6 +3532,49 @@ class DataFrameSuite extends QueryTest }.isEmpty) } } + + test("SPARK-41207: Fix BinaryArithmetic with negative scale") { +withSQLConf(SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key -> "true") { Review Comment: I'm not sure this fixes all issues. Officially supporting negative scale is really a hard problem. Do you really turn on this config in production? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/
zhengruifeng commented on code in PR #38742: URL: https://github.com/apache/spark/pull/38742#discussion_r1028639472 ## connector/connect/src/main/protobuf/spark/connect/base.proto: ## @@ -100,18 +70,135 @@ message AnalyzePlanRequest { // logging purposes and will not be interpreted by the server. optional string client_type = 4; - // (Optional) Get the explain string of the plan. - Explain explain = 5; + repeated AnalysisTask tasks = 5; + + message AnalysisTask { +oneof task { Review Comment: ~~I think we can also put catalog methods like `listTables`/`getTable` in `AnalysisTask `~~ catalog apis don't require a `plan`, maybe better to have a separate rpc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`
amaliujia commented on code in PR #38734: URL: https://github.com/apache/spark/pull/38734#discussion_r1028724092 ## python/pyspark/sql/connect/dataframe.py: ## @@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat new_frame._plan = plan return new_frame +def isEmpty(self) -> bool: +"""Returns ``True`` if this :class:`DataFrame` is empty. Review Comment: Nit: When I read this, I actually not sure if it means the size of DataFrame or it means DataFrame is None. I guess it is the size. Not sure if there is a better way to clarify this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] 19855134604 commented on a diff in pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.
19855134604 commented on code in PR #38743: URL: https://github.com/apache/spark/pull/38743#discussion_r1028790815 ## connector/protobuf/README.md: ## @@ -0,0 +1,37 @@ +# Spark Protobuf - Developer Documentation + +## Getting Started + +### Build + +```bash +./build/mvn -Phive clean package Review Comment: It doesn't seem to be necessary. Let me adjust it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38738: WIP
cloud-fan closed pull request #38738: WIP URL: https://github.com/apache/spark/pull/38738 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`
zhengruifeng commented on code in PR #38734: URL: https://github.com/apache/spark/pull/38734#discussion_r1028789393 ## python/pyspark/sql/connect/dataframe.py: ## @@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat new_frame._plan = plan return new_frame +def isEmpty(self) -> bool: +"""Returns ``True`` if this :class:`DataFrame` is empty. + +.. versionadded:: 3.4.0 + +Returns +--- +bool +Whether it's empty DataFrame or not. +""" +if "is_empty" not in self._cache: +self._cache["is_empty"] = len(self.take(1)) == 0 +return bool(self._cache["is_empty"]) Review Comment: oh, let me update it. Maybe it is time to design how to do the caching, I will work on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] toujours33 commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
toujours33 commented on code in PR #38711: URL: https://github.com/apache/spark/pull/38711#discussion_r1028788976 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -774,17 +776,16 @@ private[spark] class ExecutorAllocationManager( removeStageFromResourceProfileIfUnused(stageAttempt) } } + Review Comment: Done~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
LuciferYang commented on code in PR #38711: URL: https://github.com/apache/spark/pull/38711#discussion_r1028783234 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -774,17 +776,16 @@ private[spark] class ExecutorAllocationManager( removeStageFromResourceProfileIfUnused(stageAttempt) } } + Review Comment: Please remove this empty line -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #38731: [SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method
sadikovi commented on PR #38731: URL: https://github.com/apache/spark/pull/38731#issuecomment-1322965568 @xinrong-meng I have updated the PR description to clarify the user-facing change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert
cloud-fan commented on code in PR #38495: URL: https://github.com/apache/spark/pull/38495#discussion_r1028755248 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala: ## @@ -113,6 +113,9 @@ private[hive] trait HiveClient { /** Creates a table with the given metadata. */ def createTable(table: CatalogTable, ignoreIfExists: Boolean): Unit + /** Get hive table properties. */ + def hiveTableProps(rawHiveTable: RawHiveTable, containsStats: Boolean): Map[String, String] Review Comment: shall we add a method in `RawHiveTable` to do it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38746: [SPARK-41017][SQL][FOLLOWUP] Push Filter with both deterministic and nondeterministic predicates
cloud-fan closed pull request #38746: [SPARK-41017][SQL][FOLLOWUP] Push Filter with both deterministic and nondeterministic predicates URL: https://github.com/apache/spark/pull/38746 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38746: [SPARK-41017][SQL][FOLLOWUP] Push Filter with both deterministic and nondeterministic predicates
cloud-fan commented on PR #38746: URL: https://github.com/apache/spark/pull/38746#issuecomment-1322944376 thanks for review, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wankunde commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert
wankunde commented on code in PR #38495: URL: https://github.com/apache/spark/pull/38495#discussion_r1028741655 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala: ## @@ -894,12 +895,14 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter sql(insertString.toLowerCase(Locale.ROOT)) sql(insertString.toUpperCase(Locale.ROOT)) + spark.sessionState.catalog.alterTableStats(TableIdentifier("test1"), None) Review Comment: Remove this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wankunde commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert
wankunde commented on code in PR #38495: URL: https://github.com/apache/spark/pull/38495#discussion_r1028741472 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala: ## @@ -721,19 +721,18 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat table: String, stats: Option[CatalogStatistics]): Unit = withClient { requireTableExists(db, table) -val rawTable = getRawTable(db, table) - -// convert table statistics to properties so that we can persist them through hive client -val statsProperties = +val rawHiveTable = client.getRawHiveTable(db, table) +val oldProps = + client.hiveTableProps(rawHiveTable, containsStats = false) +.filterKeys(!_.startsWith(STATISTICS_PREFIX)) +val newProps = if (stats.isDefined) { -statsToProperties(stats.get) +oldProps ++ statsToProperties(stats.get) } else { -new mutable.HashMap[String, String]() +oldProps } Review Comment: @cloud-fan * Get all old properties from the hive table without table stats. (HiveStatisticsProperties is defined in HiveClientImpl, so do this in HiveClientImpl) * Filter out spark sql table and columns properties. (STATISTICS_PREFIX is defined in HiveExternalCatalog, so do this in HiveExternalCatalog) * Add the new table stats and then save the new table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38747: [SPARK-40834][SQL][FOLLOWUP] Take care of legacy query end events
cloud-fan commented on code in PR #38747: URL: https://github.com/apache/spark/pull/38747#discussion_r1028740897 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -121,7 +121,11 @@ object SQLExecution { SparkThrowableHelper.getMessage(e) } val event = SparkListenerSQLExecutionEnd( -executionId, System.currentTimeMillis(), errorMessage) +executionId, +System.currentTimeMillis(), +// Use empty string to indicate no error, as None may mean events generated by old +// versions of Spark. +errorMessage.orElse(Some(""))) Review Comment: good point -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #38622: [SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down
AngersZh commented on code in PR #38622: URL: https://github.com/apache/spark/pull/38622#discussion_r1028739055 ## resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala: ## @@ -815,6 +815,7 @@ private[spark] class ApplicationMaster( case Shutdown(code) => exitCode = code shutdown = true +allocator.setShutdown(true) Review Comment: It's ok to send `Shutdown` in cluster mode too, not only for client mode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.
LuciferYang commented on PR #38743: URL: https://github.com/apache/spark/pull/38743#issuecomment-1322936602 cc @HyukjinKwon FYI, a similar fix as SPARK-40593 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #38747: [SPARK-40834][SQL][FOLLOWUP] Take care of legacy query end events
ulysses-you commented on code in PR #38747: URL: https://github.com/apache/spark/pull/38747#discussion_r1028738643 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -121,7 +121,11 @@ object SQLExecution { SparkThrowableHelper.getMessage(e) } val event = SparkListenerSQLExecutionEnd( -executionId, System.currentTimeMillis(), errorMessage) +executionId, +System.currentTimeMillis(), +// Use empty string to indicate no error, as None may mean events generated by old +// versions of Spark. +errorMessage.orElse(Some(""))) Review Comment: Better to add docs at `SparkListenerSQLExecutionEnd`, `Some("")` is not an error. Otherwise developers may confuse. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.
LuciferYang commented on code in PR #38743: URL: https://github.com/apache/spark/pull/38743#discussion_r1028737655 ## connector/protobuf/README.md: ## @@ -0,0 +1,37 @@ +# Spark Protobuf - Developer Documentation + +## Getting Started + +### Build + +```bash +./build/mvn -Phive clean package Review Comment: For `protobuf` module, do we have to compile with `-Phive`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`
amaliujia commented on PR #38734: URL: https://github.com/apache/spark/pull/38734#issuecomment-1322927685 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`
amaliujia commented on code in PR #38734: URL: https://github.com/apache/spark/pull/38734#discussion_r1028724092 ## python/pyspark/sql/connect/dataframe.py: ## @@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat new_frame._plan = plan return new_frame +def isEmpty(self) -> bool: +"""Returns ``True`` if this :class:`DataFrame` is empty. Review Comment: Nit: When I read this, I actually not sure if it means the size of DataFrame or it means DataFrame is None. I guess it is the size. Not sure if there is a better to clarify this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`
HyukjinKwon commented on code in PR #38734: URL: https://github.com/apache/spark/pull/38734#discussion_r1028721327 ## python/pyspark/sql/connect/dataframe.py: ## @@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat new_frame._plan = plan return new_frame +def isEmpty(self) -> bool: +"""Returns ``True`` if this :class:`DataFrame` is empty. + +.. versionadded:: 3.4.0 + +Returns +--- +bool +Whether it's empty DataFrame or not. +""" +if "is_empty" not in self._cache: +self._cache["is_empty"] = len(self.take(1)) == 0 +return bool(self._cache["is_empty"]) Review Comment: I am fine with that but we can even do the caching in the DataFrame instead of Spark Connect I guess (?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`
amaliujia commented on code in PR #38735: URL: https://github.com/apache/spark/pull/38735#discussion_r1028720171 ## python/pyspark/sql/connect/dataframe.py: ## @@ -115,6 +115,9 @@ def __init__( self._cache: Dict[str, Any] = {} self._session: "RemoteSparkSession" = session +def __repr__(self) -> str: +return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) Review Comment: thanks for the example! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`
zhengruifeng commented on code in PR #38735: URL: https://github.com/apache/spark/pull/38735#discussion_r1028719570 ## python/pyspark/sql/connect/dataframe.py: ## @@ -115,6 +115,9 @@ def __init__( self._cache: Dict[str, Any] = {} self._session: "RemoteSparkSession" = session +def __repr__(self) -> str: +return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) Review Comment: it will be invoked here: ``` In [1]: df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"), (None, 10, "Tom"), (None, None, None)], schema=["age", "height", "name"]) In [2]: df Out[2]: DataFrame[age: bigint, height: bigint, name: string] In [3]: df.__repr__() Out[3]: 'DataFrame[age: bigint, height: bigint, name: string]' ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38741: [SPARK-41154][SQL][3.3] Incorrect relation caching for queries with time travel spec
cloud-fan closed pull request #38741: [SPARK-41154][SQL][3.3] Incorrect relation caching for queries with time travel spec URL: https://github.com/apache/spark/pull/38741 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #38739: [SPARK-41207][SQL] Fix BinaryArithmetic with negative scale
ulysses-you commented on code in PR #38739: URL: https://github.com/apache/spark/pull/38739#discussion_r1028718134 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -3532,6 +3532,49 @@ class DataFrameSuite extends QueryTest }.isEmpty) } } + + test("SPARK-41207: Fix BinaryArithmetic with negative scale") { +withSQLConf(SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key -> "true") { Review Comment: @cloud-fan, so far it seems the only regression is `IntegralDivide` which is affected by https://github.com/apache/spark/pull/36698. The other issues live long time. I think the root reason for these issues are same which is fixed by this pr. I'm not sure it is a kind of feature, but some correction for legacy issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38741: [SPARK-41154][SQL][3.3] Incorrect relation caching for queries with time travel spec
cloud-fan commented on PR #38741: URL: https://github.com/apache/spark/pull/38741#issuecomment-1322916287 tests all passed: https://github.com/ulysses-you/spark/runs/9613393804 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38741: [SPARK-41154][SQL][3.3] Incorrect relation caching for queries with time travel spec
cloud-fan commented on PR #38741: URL: https://github.com/apache/spark/pull/38741#issuecomment-1322916371 thanks, merging to 3.3! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38706: [SPARK-41005][COLLECT][FOLLOWUP] Remove JSON code path and use `RDD.collect` in Arrow code path
cloud-fan commented on PR #38706: URL: https://github.com/apache/spark/pull/38706#issuecomment-1322913672 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #38731: [SPARK-41209][PYSPARK] Improve PySpark type inference in _merge_type method
xinrong-meng commented on PR #38731: URL: https://github.com/apache/spark/pull/38731#issuecomment-1322912963 Shall we add an example to elaborate `Does this PR introduce any user-facing change?`? The change might be in the 3.4 release note. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jerrypeng commented on pull request #38517: [SPARK-39591][SS] Async Progress Tracking
jerrypeng commented on PR #38517: URL: https://github.com/apache/spark/pull/38517#issuecomment-1322908492 @HeartSaVioR Please review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org