[GitHub] [spark] gengliangwang commented on a diff in pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI

2022-11-21 Thread GitBox


gengliangwang commented on code in PR #38567:
URL: https://github.com/apache/spark/pull/38567#discussion_r1028977314


##
core/src/main/scala/org/apache/spark/status/KVUtils.scala:
##
@@ -80,6 +89,44 @@ private[spark] object KVUtils extends Logging {
 db
   }
 
+  def createKVStore(
+  storePath: Option[File],
+  diskBackend: HybridStoreDiskBackend.Value,
+  conf: SparkConf): KVStore = {
+storePath.map { path =>
+  val dir = diskBackend match {
+case LEVELDB => "listing.ldb"
+case ROCKSDB => "listing.rdb"
+  }
+
+  val dbPath = Files.createDirectories(new File(path, 
dir).toPath()).toFile()
+  Utils.chmod700(dbPath)
+
+

Review Comment:
   Thanks, removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert

2022-11-21 Thread GitBox


cloud-fan commented on code in PR #38495:
URL: https://github.com/apache/spark/pull/38495#discussion_r1028966247


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##
@@ -105,6 +106,15 @@ private[hive] class HiveClientImpl(
 
   private class RawHiveTableImpl(override val rawTable: HiveTable) extends 
RawHiveTable {
 override lazy val toCatalogTable = convertHiveTableToCatalogTable(rawTable)
+
+override def hiveTableProps(containsStats: Boolean): Map[String, String] = 
{

Review Comment:
   why do we need this parameter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

2022-11-21 Thread GitBox


EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1323236232

   @wangyum @cloud-fan I am not sure if this is the right approach to fix 
`DeduplicateRelations`. Please advise.
   
   Problem is that `DeduplicateRelations` is only considering duplicates 
between left `output` and right `output`, but this situation is caused by 
duplicates between left `references` and right `output`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #38737: [SPARK-41174][CORE][SQL] Propagate an error class to users for invalid `format` of `to_binary()`

2022-11-21 Thread GitBox


LuciferYang commented on PR #38737:
URL: https://github.com/apache/spark/pull/38737#issuecomment-1323234015

   @MaxGekk rebased


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

2022-11-21 Thread GitBox


cloud-fan commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028961395


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -122,6 +122,18 @@ def withPlan(cls, plan: plan.LogicalPlan, session: 
"RemoteSparkSession") -> "Dat
 new_frame._plan = plan
 return new_frame
 
+def isEmpty(self) -> bool:
+"""Returns ``True`` if this :class:`DataFrame` is empty.
+
+.. versionadded:: 3.4.0
+
+Returns
+---
+bool
+Whether it's empty DataFrame or not.
+"""
+return len(self.take(1)) == 0

Review Comment:
   This makes me think that maybe we should have some shared code between 
pyspark and python connect client, to share some API definitions (api doc, 
function signature) and functions with default implementations. cc @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`

2022-11-21 Thread GitBox


MaxGekk closed pull request #38685: [SPARK-41206][SQL] Rename the error class 
`_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`
URL: https://github.com/apache/spark/pull/38685


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`

2022-11-21 Thread GitBox


MaxGekk commented on PR #38685:
URL: https://github.com/apache/spark/pull/38685#issuecomment-1323221728

   Merging to master. Thank you, @srielau and @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation

2022-11-21 Thread GitBox


HyukjinKwon commented on code in PR #38659:
URL: https://github.com/apache/spark/pull/38659#discussion_r1028944132


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -271,8 +273,12 @@ class SparkConnectPlanner(session: SparkSession) {
   }
 
   private def transformLocalRelation(rel: proto.LocalRelation): LogicalPlan = {
-val attributes = 
rel.getAttributesList.asScala.map(transformAttribute(_)).toSeq
-new org.apache.spark.sql.catalyst.plans.logical.LocalRelation(attributes)
+val (rows, structType) = ArrowConverters.fromBatchWithSchemaIterator(
+  Seq(rel.getData.toByteArray).iterator,

Review Comment:
   Should we use the same protobuf message you added, @zhengruifeng ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation

2022-11-21 Thread GitBox


HyukjinKwon commented on code in PR #38659:
URL: https://github.com/apache/spark/pull/38659#discussion_r1028943446


##
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala:
##
@@ -253,16 +253,94 @@ private[sql] object ArrowConverters extends Logging {
 val vectorLoader = new VectorLoader(root)
 vectorLoader.load(arrowRecordBatch)
 arrowRecordBatch.close()
+vectorSchemaRootToIter(root)
+  }
+}
+  }
+
+  /**
+   * Maps iterator from serialized ArrowRecordBatches to InternalRows. 
Different from
+   * [[fromBatchIterator]], each input arrow batch starts with the schema.
+   */
+  private[sql] def fromBatchWithSchemaIterator(

Review Comment:
   Sorry for late reviews. Can we dedup the logic like 
`ArrowBatchWithSchemaIterator` is doing?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation

2022-11-21 Thread GitBox


HyukjinKwon commented on code in PR #38659:
URL: https://github.com/apache/spark/pull/38659#discussion_r1028942760


##
core/src/main/scala/org/apache/spark/util/Utils.scala:
##
@@ -3257,6 +3257,14 @@ private[spark] object Utils extends Logging {
   case _ => math.max(sortedSize(len / 2), 1)
 }
   }
+
+  def closeAll(closeables: AutoCloseable*): Unit = {

Review Comment:
   I think this is too much to have it as a common util at the core module. 
It's only used twice ..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation

2022-11-21 Thread GitBox


HyukjinKwon commented on code in PR #38659:
URL: https://github.com/apache/spark/pull/38659#discussion_r1028942267


##
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##
@@ -44,14 +47,18 @@ class SparkConnectProtoSuite extends PlanTest with 
SparkConnectPlanTest {
 
   lazy val connectTestRelation =
 createLocalRelationProto(
-  Seq(AttributeReference("id", IntegerType)(), AttributeReference("name", 
StringType)()))
+  Seq(AttributeReference("id", IntegerType)(), AttributeReference("name", 
StringType)()),
+  Seq())
 
   lazy val connectTestRelation2 =
 createLocalRelationProto(
-  Seq(AttributeReference("id", IntegerType)(), AttributeReference("name", 
StringType)()))
+  Seq(AttributeReference("id", IntegerType)(), AttributeReference("name", 
StringType)()),
+  Seq())

Review Comment:
   ```suggestion
 Seq.empty)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client

2022-11-21 Thread GitBox


grundprinzip commented on code in PR #38723:
URL: https://github.com/apache/spark/pull/38723#discussion_r1028942189


##
python/pyspark/sql/tests/connect/test_connect_basic.py:
##
@@ -220,6 +220,29 @@ def test_create_global_temp_view(self):
 with self.assertRaises(_MultiThreadedRendezvous):

Review Comment:
   Didn't we fix this to grpc.RPCError?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #38744: [SPARK-41217][SQL] Add the error class `FAILED_FUNCTION_CALL`

2022-11-21 Thread GitBox


MaxGekk closed pull request #38744: [SPARK-41217][SQL] Add the error class 
`FAILED_FUNCTION_CALL`
URL: https://github.com/apache/spark/pull/38744


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #38744: [SPARK-41217][SQL] Add the error class `FAILED_FUNCTION_CALL`

2022-11-21 Thread GitBox


LuciferYang commented on PR #38744:
URL: https://github.com/apache/spark/pull/38744#issuecomment-1323206350

   OK, I will rebase my pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38744: [SPARK-41217][SQL] Add the error class `FAILED_FUNCTION_CALL`

2022-11-21 Thread GitBox


MaxGekk commented on PR #38744:
URL: https://github.com/apache/spark/pull/38744#issuecomment-1323205514

   Merging to master. Thank you, @panbingkun @LuciferYang @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38715: [SPARK-41197] Upgrade Kafka version to 3.3 release

2022-11-21 Thread GitBox


HyukjinKwon commented on PR #38715:
URL: https://github.com/apache/spark/pull/38715#issuecomment-1323204958

   cc @dongjoon-hyun and @HeartSaVioR FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/

2022-11-21 Thread GitBox


zhengruifeng commented on code in PR #38742:
URL: https://github.com/apache/spark/pull/38742#discussion_r1028935043


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -736,6 +736,19 @@ def toPandas(self) -> Optional["pandas.DataFrame"]:
 query = self._plan.to_proto(self._session)
 return self._session._to_pandas(query)
 
+def _basic_analyze(self) -> None:
+# update isLocal, isStreaming, explain_string, tree_string, 
semantic_hash
+if self._schema is None:
+if self._plan is not None:
+query = self._plan.to_proto(self._session)
+if self._session is None:
+raise Exception("Cannot analyze without 
RemoteSparkSession.")
+results = self._session.basic_analyze(query)
+for k, v in results.items():
+self._cache[k] = v

Review Comment:
   got it, will update 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #38752: [SPARK-40809][CONNECT][FOLLOW-UP] Do not use Buffer to make Scala 2.13 test pass

2022-11-21 Thread GitBox


LuciferYang commented on PR #38752:
URL: https://github.com/apache/spark/pull/38752#issuecomment-1323197061

   late LGTM 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38697: [SPARK-41118][SQL][3.3] `to_number`/`try_to_number` should return `null` when format is `null`

2022-11-21 Thread GitBox


HyukjinKwon closed pull request #38697: [SPARK-41118][SQL][3.3] 
`to_number`/`try_to_number` should return `null` when format is `null`
URL: https://github.com/apache/spark/pull/38697


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang closed pull request #38753: [SPARK-40809][CONNECT][PYTHON][TESTS] Fix pyspark-connect test failed with Scala 2.13

2022-11-21 Thread GitBox


LuciferYang closed pull request #38753: [SPARK-40809][CONNECT][PYTHON][TESTS] 
Fix pyspark-connect test failed with Scala 2.13
URL: https://github.com/apache/spark/pull/38753


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #38753: [SPARK-40809][CONNECT][PYTHON][TESTS] Fix pyspark-connect test failed with Scala 2.13

2022-11-21 Thread GitBox


LuciferYang commented on PR #38753:
URL: https://github.com/apache/spark/pull/38753#issuecomment-1323196610

   dup with https://github.com/apache/spark/pull/38752, close this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38697: [SPARK-41118][SQL][3.3] `to_number`/`try_to_number` should return `null` when format is `null`

2022-11-21 Thread GitBox


HyukjinKwon commented on PR #38697:
URL: https://github.com/apache/spark/pull/38697#issuecomment-1323196618

   Merged to branch-3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38752: [SPARK-40809][CONNECT][FOLLOW-UP] Do not use Buffer to make Scala 2.13 test pass

2022-11-21 Thread GitBox


HyukjinKwon closed pull request #38752: [SPARK-40809][CONNECT][FOLLOW-UP] Do 
not use Buffer to make Scala 2.13 test pass
URL: https://github.com/apache/spark/pull/38752


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38752: [SPARK-40809][CONNECT][FOLLOW-UP] Do not use Buffer to make Scala 2.13 test pass

2022-11-21 Thread GitBox


HyukjinKwon commented on PR #38752:
URL: https://github.com/apache/spark/pull/38752#issuecomment-1323196020

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

2022-11-21 Thread GitBox


HyukjinKwon commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1028933832


##
python/pyspark/sql/tests/connect/test_connect_basic.py:
##
@@ -301,6 +301,20 @@ def test_simple_datasource_read(self) -> None:
 actualResult = pandasResult.values.tolist()
 self.assertEqual(len(expectResult), len(actualResult))
 
+def test_alias(self) -> None:
+"""Testing supported and unsupported alias"""
+col0 = (
+self.connect.range(1, 10)
+.select(col("id").alias("name", metadata={"max": 99}))
+.schema()
+.names[0]
+)
+self.assertEqual("name", col0)
+
+with self.assertRaises(grpc.RpcError) as exc:
+self.connect.range(1, 10).select(col("id").alias("this", "is", 
"not")).collect()
+self.assertIn("Buffer(this, is, not)", str(exc.exception))

Review Comment:
   I made a followup at https://github.com/apache/spark/pull/38752 :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollectingS

2022-11-21 Thread GitBox


LuciferYang commented on PR #38704:
URL: https://github.com/apache/spark/pull/38704#issuecomment-1323195610

   Thanks @HyukjinKwon @mridulm @liuzqt 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

2022-11-21 Thread GitBox


HyukjinKwon closed pull request #38686: [SPARK-41169][CONNECT][PYTHON] 
Implement `DataFrame.drop`
URL: https://github.com/apache/spark/pull/38686


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`

2022-11-21 Thread GitBox


zhengruifeng commented on PR #38735:
URL: https://github.com/apache/spark/pull/38735#issuecomment-1323195553

   @HyukjinKwon  thanks for the reviews


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

2022-11-21 Thread GitBox


HyukjinKwon commented on PR #38686:
URL: https://github.com/apache/spark/pull/38686#issuecomment-1323195361

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollectingSuite`

2022-11-21 Thread GitBox


HyukjinKwon closed pull request #38704: [SPARK-41193][SQL][TESTS] Ignore 
`collect data with single partition larger than 2GB bytes array limit` in 
`DatasetLargeResultCollectingSuite`
URL: https://github.com/apache/spark/pull/38704


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollectingS

2022-11-21 Thread GitBox


HyukjinKwon commented on PR #38704:
URL: https://github.com/apache/spark/pull/38704#issuecomment-1323194208

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client

2022-11-21 Thread GitBox


HyukjinKwon commented on code in PR #38723:
URL: https://github.com/apache/spark/pull/38723#discussion_r1028931691


##
python/pyspark/sql/tests/connect/test_connect_basic.py:
##
@@ -220,6 +220,29 @@ def test_create_global_temp_view(self):
 with self.assertRaises(_MultiThreadedRendezvous):
 self.connect.sql("SELECT 1 AS X LIMIT 
0").createGlobalTempView("view_1")
 
+def test_select_expr(self):
+self.assert_eq(

Review Comment:
   comment with a JIRA ID?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client

2022-11-21 Thread GitBox


HyukjinKwon commented on code in PR #38723:
URL: https://github.com/apache/spark/pull/38723#discussion_r1028931101


##
python/pyspark/sql/connect/column.py:
##
@@ -263,6 +263,22 @@ def __str__(self) -> str:
 return f"Column({self._unparsed_identifier})"
 
 
+class SQLExpression(Expression):

Review Comment:
   I still don't like kind of naming .. but this is at least somewhat 
consistent with what we have in DSv2 so I am fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client

2022-11-21 Thread GitBox


HyukjinKwon commented on code in PR #38723:
URL: https://github.com/apache/spark/pull/38723#discussion_r1028930295


##
python/pyspark/sql/connect/column.py:
##
@@ -263,6 +263,22 @@ def __str__(self) -> str:
 return f"Column({self._unparsed_identifier})"
 
 
+class SQLExpression(Expression):
+"""Returns Expression which contains a string which is a SQL expression
+and server side will parse it by Catalyst
+"""
+
+def __init__(self, expr: str) -> None:
+super().__init__()

Review Comment:
   I think that's fine. One point of having type hints is to avoid `assert`s on 
those types too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38731: [SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method

2022-11-21 Thread GitBox


HyukjinKwon closed pull request #38731: [SPARK-41209][PYTHON] Improve PySpark 
type inference in _merge_type method
URL: https://github.com/apache/spark/pull/38731


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38731: [SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method

2022-11-21 Thread GitBox


HyukjinKwon commented on PR #38731:
URL: https://github.com/apache/spark/pull/38731#issuecomment-1323189415

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`

2022-11-21 Thread GitBox


HyukjinKwon closed pull request #38735: [SPARK-41213][CONNECT][PYTHON] 
Implement `DataFrame.__repr__` and `DataFrame.dtypes`
URL: https://github.com/apache/spark/pull/38735


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`

2022-11-21 Thread GitBox


HyukjinKwon commented on PR #38735:
URL: https://github.com/apache/spark/pull/38735#issuecomment-1323188640

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38302: [SPARK-40834][SQL] Use SparkListenerSQLExecutionEnd to track final SQL status in UI

2022-11-21 Thread GitBox


cloud-fan commented on code in PR #38302:
URL: https://github.com/apache/spark/pull/38302#discussion_r1028926326


##
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala:
##
@@ -56,7 +56,10 @@ case class SparkListenerSQLExecutionStart(
 }
 
 @DeveloperApi
-case class SparkListenerSQLExecutionEnd(executionId: Long, time: Long)
+case class SparkListenerSQLExecutionEnd(
+executionId: Long,
+time: Long,
+errorMessage: Option[String] = None)

Review Comment:
   This is not a public class. It generates JSON string where backward 
compatibility matters, and this PR added tests for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #38753: [SPARK-40809][CONNECT][TESTS] Fix pyspark-connect test failed with Scala 2.13

2022-11-21 Thread GitBox


LuciferYang opened a new pull request, #38753:
URL: https://github.com/apache/spark/pull/38753

   ### What changes were proposed in this pull request?
   This pr simplify assertions to fix `pyspark-connect` test failed with Scala 
2.13.
   
   
   
   
   ### Why are the changes needed?
   Fix Scala 2.13 daily test:
   
   - https://github.com/apache/spark/actions/runs/3517345801/jobs/5895043640
   
   ```
   Traceback (most recent call last):
 File 
"/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_basic.py", line 
337, in test_alias
   self.assertIn("Buffer(this, is, not)", str(exc.exception))
   AssertionError: 'Buffer(this, is, not)' not found in 
'<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = 
StatusCode.UNKNOWN\n\tdetails = "[INTERNAL_ERROR] Found the unresolved 
operator: \'Project [id#24L AS List(this, is, not)]"\n\tdebug_error_string = 
"UNKNOWN:Error received from peer ipv4:127.0.0.1:15002 
{created_time:"2022-11-21T19:25:59.902777537+00:00", grpc_status:2, 
grpc_message:"[INTERNAL_ERROR] Found the unresolved operator: \\\'Project 
[id#24L AS List(this, is, not)]"}"\n>'
   ```
   
   The error message with Scala 2.13 contains `List(this, is, not)` instead of 
`ArrayBuffer(this, is, not)`.
   
   ### Does this PR introduce _any_ user-facing change?
   No, just for test
   
   
   ### How was this patch tested?
   Pass GitHub Actions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR closed pull request #38748: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent

2022-11-21 Thread GitBox


HeartSaVioR closed pull request #38748: [SPARK-41151][SQL][3.3] Keep built-in 
file `_metadata` column nullable value consistent
URL: https://github.com/apache/spark/pull/38748


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/s

2022-11-21 Thread GitBox


HyukjinKwon commented on code in PR #38742:
URL: https://github.com/apache/spark/pull/38742#discussion_r1028915567


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -736,6 +736,19 @@ def toPandas(self) -> Optional["pandas.DataFrame"]:
 query = self._plan.to_proto(self._session)
 return self._session._to_pandas(query)
 
+def _basic_analyze(self) -> None:
+# update isLocal, isStreaming, explain_string, tree_string, 
semantic_hash
+if self._schema is None:
+if self._plan is not None:
+query = self._plan.to_proto(self._session)
+if self._session is None:
+raise Exception("Cannot analyze without 
RemoteSparkSession.")
+results = self._session.basic_analyze(query)
+for k, v in results.items():
+self._cache[k] = v

Review Comment:
   Let's don't do caching stuff for now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #38748: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent

2022-11-21 Thread GitBox


HeartSaVioR commented on PR #38748:
URL: https://github.com/apache/spark/pull/38748#issuecomment-1323163883

   Thanks! Merging to 3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #38748: [SPARK-41151][SQL][3.3] Keep built-in file `_metadata` column nullable value consistent

2022-11-21 Thread GitBox


HeartSaVioR commented on PR #38748:
URL: https://github.com/apache/spark/pull/38748#issuecomment-1323163589

   Looks like GA build failed to take the result of forked GA build. Here is a 
success run from the forked repository.
   https://github.com/Yaohua628/spark/runs/9630498867


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

2022-11-21 Thread GitBox


LuciferYang commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1028908773


##
python/pyspark/sql/tests/connect/test_connect_basic.py:
##
@@ -301,6 +301,20 @@ def test_simple_datasource_read(self) -> None:
 actualResult = pandasResult.values.tolist()
 self.assertEqual(len(expectResult), len(actualResult))
 
+def test_alias(self) -> None:
+"""Testing supported and unsupported alias"""
+col0 = (
+self.connect.range(1, 10)
+.select(col("id").alias("name", metadata={"max": 99}))
+.schema()
+.names[0]
+)
+self.assertEqual("name", col0)
+
+with self.assertRaises(grpc.RpcError) as exc:
+self.connect.range(1, 10).select(col("id").alias("this", "is", 
"not")).collect()
+self.assertIn("Buffer(this, is, not)", str(exc.exception))

Review Comment:
   @grundprinzip  The Scala 2.13 daily test has failed in recent times, which 
seems to be related to this assert:
   ```
   ==
   FAIL [0.742s]: test_alias 
(pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests)
   Testing supported and unsupported alias
   --
   Traceback (most recent call last):
 File 
"/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_basic.py", line 
337, in test_alias
   self.assertIn("Buffer(this, is, not)", str(exc.exception))
   AssertionError: 'Buffer(this, is, not)' not found in 
'<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = 
StatusCode.UNKNOWN\n\tdetails = "[INTERNAL_ERROR] Found the unresolved 
operator: \'Project [id#24L AS List(this, is, not)]"\n\tdebug_error_string = 
"UNKNOWN:Error received from peer ipv4:127.0.0.1:15002 
{created_time:"2022-11-21T19:25:59.902777537+00:00", grpc_status:2, 
grpc_message:"[INTERNAL_ERROR] Found the unresolved operator: \\\'Project 
[id#24L AS List(this, is, not)]"}"\n>'
   ```
   
   https://github.com/apache/spark/actions/runs/3517345801/jobs/5895043640
   
   This seems to be related to the difference of Scala version. Can we simplify 
the assertion? such as `self.assertIn("(this, is, not)", str(exc.exception))`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38302: [SPARK-40834][SQL] Use SparkListenerSQLExecutionEnd to track final SQL status in UI

2022-11-21 Thread GitBox


HyukjinKwon commented on code in PR #38302:
URL: https://github.com/apache/spark/pull/38302#discussion_r1028906041


##
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala:
##
@@ -56,7 +56,10 @@ case class SparkListenerSQLExecutionStart(
 }
 
 @DeveloperApi
-case class SparkListenerSQLExecutionEnd(executionId: Long, time: Long)
+case class SparkListenerSQLExecutionEnd(
+executionId: Long,
+time: Long,
+errorMessage: Option[String] = None)

Review Comment:
   Doesn't this break binary compatibility? I think it should have overridden 
constructor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request, #38752: [SPARK-40809][CONNECT][FOLLOW-UP] Do not use Buffer to make Scala 2.13 test pass

2022-11-21 Thread GitBox


HyukjinKwon opened a new pull request, #38752:
URL: https://github.com/apache/spark/pull/38752

   ### What changes were proposed in this pull request?
   
   This PR is a followup of https://github.com/apache/spark/pull/38631 that 
fixes the test to pass in Scala 2.13 by avoiding using `Buffer` that becomes 
`List` in Scala 2.13.
   
   
   ### Why are the changes needed?
   
   To fix up the Scala 2.13 build, see 
https://github.com/apache/spark/actions/runs/3517345801/jobs/5895043640
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   Manually tested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #38744: [SPARK-41217][SQL] Add the error class `FAILED_FUNCTION_CALL`

2022-11-21 Thread GitBox


panbingkun commented on PR #38744:
URL: https://github.com/apache/spark/pull/38744#issuecomment-1323142938

   +1, LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-21 Thread GitBox


gaoyajun02 commented on PR #38333:
URL: https://github.com/apache/spark/pull/38333#issuecomment-1323136879

   > Thank you, @gaoyajun02 , @mridulm , @otterc .
   > 
   > * Do we need to backport this to branch-3.3?
   > * According to the previous failure description, what happens in 
branch-3.3 in case of failure?
   
   Since the 3.3 branch does not contain the pr of SPARK-38987, if the 
mergedChunk is zero-size, throwFetchFailedException is actually a 
SparkException, which will eventually cause the app to fail due to task failure 
4 times.
   
   @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-21 Thread GitBox


gaoyajun02 commented on PR #38333:
URL: https://github.com/apache/spark/pull/38333#issuecomment-1323119482

   > I was on two minds whether to fix this in 3.3 as well ... Yes, 3.3 is 
affected by it.
   > 
   > But agree, a backport to branch-3.3 would be helpful. Can you get it a 
shot @gaoyajun02 ? Might need to fix some minor nits to get a patch
   
   ok, can you take a look @mridulm cc @otterc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaoyajun02 opened a new pull request, #38751: [SPARK-40872][3.3] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

2022-11-21 Thread GitBox


gaoyajun02 opened a new pull request, #38751:
URL: https://github.com/apache/spark/pull/38751

   ### What changes were proposed in this pull request?
   This is a backport PR of #38333
   When push-based shuffle is enabled, a zero-size buf error may occur when 
fetching shuffle chunks from bad nodes, especially when memory is full. In this 
case, we can fall back to original shuffle blocks.
   
   ### Why are the changes needed?
   When the reduce task obtains the shuffle chunk with a zero-size buf, we let 
it fall back to original shuffle block. After verification, these blocks can be 
read successfully without shuffle retry.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] toujours33 commented on pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


toujours33 commented on PR #38711:
URL: https://github.com/apache/spark/pull/38711#issuecomment-1323074280

   > How far back should this backport?
   
   I hope it can be back to  3.3+(3.3 included). For version 3.3 is mainly used 
in our production environment~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] toujours33 commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


toujours33 commented on code in PR #38711:
URL: https://github.com/apache/spark/pull/38711#discussion_r1028830007


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager(
   stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1
 // If this is the last pending task, mark the scheduler queue as empty
 if (taskStart.taskInfo.speculative) {
-  stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt,
-new mutable.HashSet[Int]) += taskIndex
+  stageAttemptToSpeculativeTaskIndices
+.getOrElseUpdate(stageAttempt, new 
mutable.HashSet[Int]).add(taskIndex)
+  stageAttemptToUnsubmittedSpeculativeTasks

Review Comment:
   I think it's ok here.
   As PR describes, 
`stageAttemptToUnsubmittedSpeculativeTasks.remove(taskIndex)` should be called 
only when speculative task start or task finished(whether it is speculative or 
not). 
   Line 754 will do nothing if this speculative task has been removed when task 
finished, which is expected and will be ok.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


LuciferYang commented on PR #38711:
URL: https://github.com/apache/spark/pull/38711#issuecomment-1323065288

   How far back should this backport?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


LuciferYang commented on code in PR #38711:
URL: https://github.com/apache/spark/pull/38711#discussion_r1028823929


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager(
   stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1
 // If this is the last pending task, mark the scheduler queue as empty
 if (taskStart.taskInfo.speculative) {
-  stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt,
-new mutable.HashSet[Int]) += taskIndex
+  stageAttemptToSpeculativeTaskIndices
+.getOrElseUpdate(stageAttempt, new 
mutable.HashSet[Int]).add(taskIndex)
+  stageAttemptToUnsubmittedSpeculativeTasks

Review Comment:
   My concern is that if line754 do nothing, this should be regarded as an 
wrong state? Does this wrong state cause other issue?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


LuciferYang commented on code in PR #38711:
URL: https://github.com/apache/spark/pull/38711#discussion_r1028821337


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager(
   stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1
 // If this is the last pending task, mark the scheduler queue as empty
 if (taskStart.taskInfo.speculative) {
-  stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt,
-new mutable.HashSet[Int]) += taskIndex
+  stageAttemptToSpeculativeTaskIndices

Review Comment:
   Got 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38650: [SPARK-41135][SQL] Rename `UNSUPPORTED_EMPTY_LOCATION` to `INVALID_EMPTY_LOCATION`

2022-11-21 Thread GitBox


itholic commented on code in PR #38650:
URL: https://github.com/apache/spark/pull/38650#discussion_r1028818844


##
core/src/main/resources/error/error-classes.json:
##
@@ -656,6 +656,11 @@
 ],
 "sqlState" : "42000"
   },
+  "INVALID_EMPTY_LOCATION" : {
+"message" : [
+  "The location name cannot be empty string or null, but `` was 
given."

Review Comment:
   Good catch.
   I tried to add a test with NULL, but it complains [PARSE_SYNTAX_ERROR] 
rather than [INVALID_EMPTY_LOCATION].
   Let me exclude the "null" from the error message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38650: [SPARK-41135][SQL] Rename `UNSUPPORTED_EMPTY_LOCATION` to `INVALID_EMPTY_LOCATION`

2022-11-21 Thread GitBox


itholic commented on code in PR #38650:
URL: https://github.com/apache/spark/pull/38650#discussion_r1028818844


##
core/src/main/resources/error/error-classes.json:
##
@@ -656,6 +656,11 @@
 ],
 "sqlState" : "42000"
   },
+  "INVALID_EMPTY_LOCATION" : {
+"message" : [
+  "The location name cannot be empty string or null, but `` was 
given."

Review Comment:
   Good catch.
   I tried to add a test with NULL, but it complains [PARSE_SYNTAX_ERROR] 
rather than [INVALID_EMPTY_LOCATION].
   Let me exclude the null from the error message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dengziming commented on pull request #38715: [SPARK-41197] Upgrade Kafka version to 3.3 release

2022-11-21 Thread GitBox


dengziming commented on PR #38715:
URL: https://github.com/apache/spark/pull/38715#issuecomment-1323037097

   These failures comes from 
[apache/kafka#12049](https://github.com/apache/kafka/pull/12049)  and is 
described here: https://kafka.apache.org/documentation/#upgrade_33_notable
   The new default partitioner keeps track of how many bytes are produced 
per-partition and once the amount exceeds batch.size, switches to the next 
partition. In spark kafka tests, this will result in records being sent to one 
partition in some tests.
   One simplest solution is add 
`props.put("partitioner.class",classOf[org.apache.kafka.clients.producer.internals.DefaultPartitioner].getName)`
 in `KafkaTestUtils.producerConfiguration`, or we can implement our own 
partitioner, or set a small`batch.size` config.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38664: [SPARK-41147][SQL] Assign a name to the legacy error class `_LEGACY_ERROR_TEMP_1042`

2022-11-21 Thread GitBox


itholic commented on code in PR #38664:
URL: https://github.com/apache/spark/pull/38664#discussion_r1028813303


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala:
##
@@ -146,7 +147,10 @@ object FunctionRegistryBase {
 .filter(_.getParameterTypes.forall(_ == classOf[Expression]))
 .map(_.getParameterCount).distinct.sorted
   throw QueryCompilationErrors.invalidFunctionArgumentNumberError(
-validParametersCount, name, params.length)
+expressions.map(toPrettySQL(_)).mkString(","),

Review Comment:
   Sounds good! Just addressed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

2022-11-21 Thread GitBox


zhengruifeng commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028812420


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: 
"RemoteSparkSession") -> "Dat
 new_frame._plan = plan
 return new_frame
 
+def isEmpty(self) -> bool:
+"""Returns ``True`` if this :class:`DataFrame` is empty.

Review Comment:
   this doc was copied from pyspark and scala api. I think the dataframe itself 
is expected to be not None, otherwise, this method can not be called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

2022-11-21 Thread GitBox


zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1028811655


##
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##
@@ -148,6 +148,23 @@ class SparkConnectProtoSuite extends PlanTest with 
SparkConnectPlanTest {
 comparePlans(connectPlan2, sparkPlan2)
   }
 
+  test("SPARK-41169: Test drop") {
+// single column
+val connectPlan = connectTestRelation.drop("id")
+val sparkPlan = sparkTestRelation.drop("id")
+comparePlans(connectPlan, sparkPlan)
+
+// all columns
+val connectPlan2 = connectTestRelation.drop("id", "name")
+val sparkPlan2 = sparkTestRelation.drop("id", "name")
+comparePlans(connectPlan2, sparkPlan2)
+
+// non-existing column

Review Comment:
   I just checked the implementation of `Dataset.drop`, it supports all kinds 
of expressions, a expression will be just ignored if it doesn't semanticEquals 
the columns in current dataframe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] toujours33 commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


toujours33 commented on code in PR #38711:
URL: https://github.com/apache/spark/pull/38711#discussion_r1028811606


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager(
   stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1
 // If this is the last pending task, mark the scheduler queue as empty
 if (taskStart.taskInfo.speculative) {
-  stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt,
-new mutable.HashSet[Int]) += taskIndex
+  stageAttemptToSpeculativeTaskIndices
+.getOrElseUpdate(stageAttempt, new 
mutable.HashSet[Int]).add(taskIndex)
+  stageAttemptToUnsubmittedSpeculativeTasks

Review Comment:
   1. How about `stageAttemptToUnsubmittedSpeculativeTasks 
.get(stageAttempt).foreach(_.remove(taskIndex))`.
   In this way, nothing will changed even if 
`stageAttemptToUnsubmittedSpecificativeTasks` not initialized.
   2. For `stageAttemptToUnsubmittedSpeculativeTasks` is thread-safe, 
`stageAttemptToUnsubmittedSpeculativeTasks.contains(stageAttempt)` would be 
correct anyway



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38576: [SPARK-41062][SQL] Rename `UNSUPPORTED_CORRELATED_REFERENCE` to `CORRELATED_REFERENCE`

2022-11-21 Thread GitBox


itholic commented on code in PR #38576:
URL: https://github.com/apache/spark/pull/38576#discussion_r1028810170


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -1059,10 +1060,16 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 def failOnInvalidOuterReference(p: LogicalPlan): Unit = {
   p.expressions.foreach(checkMixedReferencesInsideAggregateExpr)
   if (!canHostOuter(p) && p.expressions.exists(containsOuter)) {
+val exprs = new ListBuffer[String]()
+for (expr <- p.expressions) {

Review Comment:
   Thanks! Just addressed the comments



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] toujours33 commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


toujours33 commented on code in PR #38711:
URL: https://github.com/apache/spark/pull/38711#discussion_r1028807362


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager(
   stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1
 // If this is the last pending task, mark the scheduler queue as empty
 if (taskStart.taskInfo.speculative) {
-  stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt,
-new mutable.HashSet[Int]) += taskIndex
+  stageAttemptToSpeculativeTaskIndices

Review Comment:
   `allocationManager.synchronized` will be called to protect this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


LuciferYang commented on code in PR #38711:
URL: https://github.com/apache/spark/pull/38711#discussion_r1028804426


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager(
   stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1
 // If this is the last pending task, mark the scheduler queue as empty
 if (taskStart.taskInfo.speculative) {
-  stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt,
-new mutable.HashSet[Int]) += taskIndex
+  stageAttemptToSpeculativeTaskIndices
+.getOrElseUpdate(stageAttempt, new 
mutable.HashSet[Int]).add(taskIndex)
+  stageAttemptToUnsubmittedSpeculativeTasks

Review Comment:
   Or whether 
`stageAttemptToUnsubmittedSpeculativeTasks.contains(stageAttempt)` should 
always be true at this time
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


LuciferYang commented on code in PR #38711:
URL: https://github.com/apache/spark/pull/38711#discussion_r1028803675


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager(
   stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1
 // If this is the last pending task, mark the scheduler queue as empty
 if (taskStart.taskInfo.speculative) {
-  stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt,
-new mutable.HashSet[Int]) += taskIndex
+  stageAttemptToSpeculativeTaskIndices
+.getOrElseUpdate(stageAttempt, new 
mutable.HashSet[Int]).add(taskIndex)
+  stageAttemptToUnsubmittedSpeculativeTasks

Review Comment:
   if use `new mutable.HashSet[Int]`, Is it necessary to call remove?
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


LuciferYang commented on code in PR #38711:
URL: https://github.com/apache/spark/pull/38711#discussion_r1028802840


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -749,8 +749,10 @@ private[spark] class ExecutorAllocationManager(
   stageAttemptToNumRunningTask.getOrElse(stageAttempt, 0) + 1
 // If this is the last pending task, mark the scheduler queue as empty
 if (taskStart.taskInfo.speculative) {
-  stageAttemptToSpeculativeTaskIndices.getOrElseUpdate(stageAttempt,
-new mutable.HashSet[Int]) += taskIndex
+  stageAttemptToSpeculativeTaskIndices

Review Comment:
   Is there a concurrent access issue? `mutable.HashMap` is not thread safe
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist

2022-11-21 Thread GitBox


cloud-fan commented on code in PR #38703:
URL: https://github.com/apache/spark/pull/38703#discussion_r1028795081


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala:
##
@@ -355,7 +355,7 @@ case class ListQuery(
   plan.canonicalized,
   outerAttrs.map(_.canonicalized),
   ExprId(0),
-  childOutputs.map(_.canonicalized.asInstanceOf[Attribute]),
+  plan.canonicalized.output,

Review Comment:
   @allisonwang-db do you know why we have `childOutputs`? It seems to be 
`child.output` most of the time



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38739: [SPARK-41207][SQL] Fix BinaryArithmetic with negative scale

2022-11-21 Thread GitBox


cloud-fan commented on code in PR #38739:
URL: https://github.com/apache/spark/pull/38739#discussion_r1028793298


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -3532,6 +3532,49 @@ class DataFrameSuite extends QueryTest
   }.isEmpty)
 }
   }
+
+  test("SPARK-41207: Fix BinaryArithmetic with negative scale") {
+withSQLConf(SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key -> 
"true") {

Review Comment:
   I'm not sure this fixes all issues. Officially supporting negative scale is 
really a hard problem. Do you really turn on this config in production?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/

2022-11-21 Thread GitBox


zhengruifeng commented on code in PR #38742:
URL: https://github.com/apache/spark/pull/38742#discussion_r1028639472


##
connector/connect/src/main/protobuf/spark/connect/base.proto:
##
@@ -100,18 +70,135 @@ message AnalyzePlanRequest {
   // logging purposes and will not be interpreted by the server.
   optional string client_type = 4;
 
-  // (Optional) Get the explain string of the plan.
-  Explain explain = 5;
+  repeated AnalysisTask tasks = 5;
+
+  message AnalysisTask {
+oneof task {

Review Comment:
   ~~I think we can also put catalog methods like `listTables`/`getTable` in 
`AnalysisTask `~~
   
   catalog apis don't require a `plan`, maybe better to have a separate rpc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

2022-11-21 Thread GitBox


amaliujia commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028724092


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: 
"RemoteSparkSession") -> "Dat
 new_frame._plan = plan
 return new_frame
 
+def isEmpty(self) -> bool:
+"""Returns ``True`` if this :class:`DataFrame` is empty.

Review Comment:
   Nit: 
   
   When I read this, I actually not sure if it means the size of DataFrame or 
it means DataFrame is None. I guess it is the size. Not sure if there is a 
better way to clarify this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] 19855134604 commented on a diff in pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.

2022-11-21 Thread GitBox


19855134604 commented on code in PR #38743:
URL: https://github.com/apache/spark/pull/38743#discussion_r1028790815


##
connector/protobuf/README.md:
##
@@ -0,0 +1,37 @@
+# Spark Protobuf - Developer Documentation
+
+## Getting Started 
+
+### Build
+
+```bash
+./build/mvn -Phive clean package

Review Comment:
   It doesn't seem to be necessary. Let me adjust it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #38738: WIP

2022-11-21 Thread GitBox


cloud-fan closed pull request #38738: WIP
URL: https://github.com/apache/spark/pull/38738


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

2022-11-21 Thread GitBox


zhengruifeng commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028789393


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: 
"RemoteSparkSession") -> "Dat
 new_frame._plan = plan
 return new_frame
 
+def isEmpty(self) -> bool:
+"""Returns ``True`` if this :class:`DataFrame` is empty.
+
+.. versionadded:: 3.4.0
+
+Returns
+---
+bool
+Whether it's empty DataFrame or not.
+"""
+if "is_empty" not in self._cache:
+self._cache["is_empty"] = len(self.take(1)) == 0
+return bool(self._cache["is_empty"])

Review Comment:
   oh, let me update it. Maybe it is time to design how to do the caching, I 
will work on it.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] toujours33 commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


toujours33 commented on code in PR #38711:
URL: https://github.com/apache/spark/pull/38711#discussion_r1028788976


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -774,17 +776,16 @@ private[spark] class ExecutorAllocationManager(
 removeStageFromResourceProfileIfUnused(stageAttempt)
   }
 }
+

Review Comment:
   Done~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic

2022-11-21 Thread GitBox


LuciferYang commented on code in PR #38711:
URL: https://github.com/apache/spark/pull/38711#discussion_r1028783234


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -774,17 +776,16 @@ private[spark] class ExecutorAllocationManager(
 removeStageFromResourceProfileIfUnused(stageAttempt)
   }
 }
+

Review Comment:
   Please remove this empty line



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on pull request #38731: [SPARK-41209][PYTHON] Improve PySpark type inference in _merge_type method

2022-11-21 Thread GitBox


sadikovi commented on PR #38731:
URL: https://github.com/apache/spark/pull/38731#issuecomment-1322965568

   @xinrong-meng I have updated the PR description to clarify the user-facing 
change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert

2022-11-21 Thread GitBox


cloud-fan commented on code in PR #38495:
URL: https://github.com/apache/spark/pull/38495#discussion_r1028755248


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala:
##
@@ -113,6 +113,9 @@ private[hive] trait HiveClient {
   /** Creates a table with the given metadata. */
   def createTable(table: CatalogTable, ignoreIfExists: Boolean): Unit
 
+  /** Get hive table properties. */
+  def hiveTableProps(rawHiveTable: RawHiveTable, containsStats: Boolean): 
Map[String, String]

Review Comment:
   shall we add a method in `RawHiveTable` to do it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #38746: [SPARK-41017][SQL][FOLLOWUP] Push Filter with both deterministic and nondeterministic predicates

2022-11-21 Thread GitBox


cloud-fan closed pull request #38746: [SPARK-41017][SQL][FOLLOWUP] Push Filter 
with both deterministic and nondeterministic predicates
URL: https://github.com/apache/spark/pull/38746


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38746: [SPARK-41017][SQL][FOLLOWUP] Push Filter with both deterministic and nondeterministic predicates

2022-11-21 Thread GitBox


cloud-fan commented on PR #38746:
URL: https://github.com/apache/spark/pull/38746#issuecomment-1322944376

   thanks for review, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wankunde commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert

2022-11-21 Thread GitBox


wankunde commented on code in PR #38495:
URL: https://github.com/apache/spark/pull/38495#discussion_r1028741655


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala:
##
@@ -894,12 +895,14 @@ class InsertSuite extends QueryTest with 
TestHiveSingleton with BeforeAndAfter
 
   sql(insertString.toLowerCase(Locale.ROOT))
   sql(insertString.toUpperCase(Locale.ROOT))
+  spark.sessionState.catalog.alterTableStats(TableIdentifier("test1"), 
None)

Review Comment:
   Remove this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wankunde commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert

2022-11-21 Thread GitBox


wankunde commented on code in PR #38495:
URL: https://github.com/apache/spark/pull/38495#discussion_r1028741472


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -721,19 +721,18 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
   table: String,
   stats: Option[CatalogStatistics]): Unit = withClient {
 requireTableExists(db, table)
-val rawTable = getRawTable(db, table)
-
-// convert table statistics to properties so that we can persist them 
through hive client
-val statsProperties =
+val rawHiveTable = client.getRawHiveTable(db, table)
+val oldProps =
+  client.hiveTableProps(rawHiveTable, containsStats = false)
+.filterKeys(!_.startsWith(STATISTICS_PREFIX))
+val newProps =
   if (stats.isDefined) {
-statsToProperties(stats.get)
+oldProps ++ statsToProperties(stats.get)
   } else {
-new mutable.HashMap[String, String]()
+oldProps
   }

Review Comment:
   @cloud-fan 
   * Get all old properties from the hive table without table stats. 
(HiveStatisticsProperties is defined in HiveClientImpl, so do this in 
HiveClientImpl)
   * Filter out spark sql table and columns properties. (STATISTICS_PREFIX is 
defined in HiveExternalCatalog, so do this in HiveExternalCatalog)
   * Add the new table stats and then save the new table.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38747: [SPARK-40834][SQL][FOLLOWUP] Take care of legacy query end events

2022-11-21 Thread GitBox


cloud-fan commented on code in PR #38747:
URL: https://github.com/apache/spark/pull/38747#discussion_r1028740897


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala:
##
@@ -121,7 +121,11 @@ object SQLExecution {
   SparkThrowableHelper.getMessage(e)
   }
   val event = SparkListenerSQLExecutionEnd(
-executionId, System.currentTimeMillis(), errorMessage)
+executionId,
+System.currentTimeMillis(),
+// Use empty string to indicate no error, as None may mean events 
generated by old
+// versions of Spark.
+errorMessage.orElse(Some("")))

Review Comment:
   good point



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #38622: [SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down

2022-11-21 Thread GitBox


AngersZh commented on code in PR #38622:
URL: https://github.com/apache/spark/pull/38622#discussion_r1028739055


##
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:
##
@@ -815,6 +815,7 @@ private[spark] class ApplicationMaster(
   case Shutdown(code) =>
 exitCode = code
 shutdown = true
+allocator.setShutdown(true)

Review Comment:
   It's ok to send `Shutdown` in cluster mode too, not only for client mode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.

2022-11-21 Thread GitBox


LuciferYang commented on PR #38743:
URL: https://github.com/apache/spark/pull/38743#issuecomment-1322936602

   cc @HyukjinKwon FYI, a similar fix as SPARK-40593


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #38747: [SPARK-40834][SQL][FOLLOWUP] Take care of legacy query end events

2022-11-21 Thread GitBox


ulysses-you commented on code in PR #38747:
URL: https://github.com/apache/spark/pull/38747#discussion_r1028738643


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala:
##
@@ -121,7 +121,11 @@ object SQLExecution {
   SparkThrowableHelper.getMessage(e)
   }
   val event = SparkListenerSQLExecutionEnd(
-executionId, System.currentTimeMillis(), errorMessage)
+executionId,
+System.currentTimeMillis(),
+// Use empty string to indicate no error, as None may mean events 
generated by old
+// versions of Spark.
+errorMessage.orElse(Some("")))

Review Comment:
   Better to add docs at `SparkListenerSQLExecutionEnd`, `Some("")` is not an 
error. Otherwise developers may confuse.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.

2022-11-21 Thread GitBox


LuciferYang commented on code in PR #38743:
URL: https://github.com/apache/spark/pull/38743#discussion_r1028737655


##
connector/protobuf/README.md:
##
@@ -0,0 +1,37 @@
+# Spark Protobuf - Developer Documentation
+
+## Getting Started 
+
+### Build
+
+```bash
+./build/mvn -Phive clean package

Review Comment:
   For `protobuf` module, do we have to compile with `-Phive`?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

2022-11-21 Thread GitBox


amaliujia commented on PR #38734:
URL: https://github.com/apache/spark/pull/38734#issuecomment-1322927685

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

2022-11-21 Thread GitBox


amaliujia commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028724092


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: 
"RemoteSparkSession") -> "Dat
 new_frame._plan = plan
 return new_frame
 
+def isEmpty(self) -> bool:
+"""Returns ``True`` if this :class:`DataFrame` is empty.

Review Comment:
   Nit: 
   
   When I read this, I actually not sure if it means the size of DataFrame or 
it means DataFrame is None. I guess it is the size. Not sure if there is a 
better to clarify this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`

2022-11-21 Thread GitBox


HyukjinKwon commented on code in PR #38734:
URL: https://github.com/apache/spark/pull/38734#discussion_r1028721327


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -122,6 +122,20 @@ def withPlan(cls, plan: plan.LogicalPlan, session: 
"RemoteSparkSession") -> "Dat
 new_frame._plan = plan
 return new_frame
 
+def isEmpty(self) -> bool:
+"""Returns ``True`` if this :class:`DataFrame` is empty.
+
+.. versionadded:: 3.4.0
+
+Returns
+---
+bool
+Whether it's empty DataFrame or not.
+"""
+if "is_empty" not in self._cache:
+self._cache["is_empty"] = len(self.take(1)) == 0
+return bool(self._cache["is_empty"])

Review Comment:
   I am fine with that but we can even do the caching in the DataFrame instead 
of Spark Connect I guess (?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`

2022-11-21 Thread GitBox


amaliujia commented on code in PR #38735:
URL: https://github.com/apache/spark/pull/38735#discussion_r1028720171


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -115,6 +115,9 @@ def __init__(
 self._cache: Dict[str, Any] = {}
 self._session: "RemoteSparkSession" = session
 
+def __repr__(self) -> str:
+return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes))

Review Comment:
   thanks for the example!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`

2022-11-21 Thread GitBox


zhengruifeng commented on code in PR #38735:
URL: https://github.com/apache/spark/pull/38735#discussion_r1028719570


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -115,6 +115,9 @@ def __init__(
 self._cache: Dict[str, Any] = {}
 self._session: "RemoteSparkSession" = session
 
+def __repr__(self) -> str:
+return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes))

Review Comment:
   it will be invoked here:
   ```
   In [1]: df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"), 
(None, 10, "Tom"), (None, None, None)], schema=["age", "height", "name"])
   
   In [2]: df
   Out[2]: DataFrame[age: bigint, height: bigint, name: string]
   
   In [3]: df.__repr__()
   Out[3]: 'DataFrame[age: bigint, height: bigint, name: string]'
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #38741: [SPARK-41154][SQL][3.3] Incorrect relation caching for queries with time travel spec

2022-11-21 Thread GitBox


cloud-fan closed pull request #38741: [SPARK-41154][SQL][3.3] Incorrect 
relation caching for queries with time travel spec
URL: https://github.com/apache/spark/pull/38741


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #38739: [SPARK-41207][SQL] Fix BinaryArithmetic with negative scale

2022-11-21 Thread GitBox


ulysses-you commented on code in PR #38739:
URL: https://github.com/apache/spark/pull/38739#discussion_r1028718134


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -3532,6 +3532,49 @@ class DataFrameSuite extends QueryTest
   }.isEmpty)
 }
   }
+
+  test("SPARK-41207: Fix BinaryArithmetic with negative scale") {
+withSQLConf(SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key -> 
"true") {

Review Comment:
   @cloud-fan, so far it seems the only regression is `IntegralDivide` which is 
affected by https://github.com/apache/spark/pull/36698. The other issues live 
long time. I think the root reason for these issues are same which is fixed by 
this pr. I'm not sure it is a kind of feature, but some correction for legacy 
issues.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38741: [SPARK-41154][SQL][3.3] Incorrect relation caching for queries with time travel spec

2022-11-21 Thread GitBox


cloud-fan commented on PR #38741:
URL: https://github.com/apache/spark/pull/38741#issuecomment-1322916287

   tests all passed: https://github.com/ulysses-you/spark/runs/9613393804


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38741: [SPARK-41154][SQL][3.3] Incorrect relation caching for queries with time travel spec

2022-11-21 Thread GitBox


cloud-fan commented on PR #38741:
URL: https://github.com/apache/spark/pull/38741#issuecomment-1322916371

   thanks, merging to 3.3!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38706: [SPARK-41005][COLLECT][FOLLOWUP] Remove JSON code path and use `RDD.collect` in Arrow code path

2022-11-21 Thread GitBox


cloud-fan commented on PR #38706:
URL: https://github.com/apache/spark/pull/38706#issuecomment-1322913672

   late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #38731: [SPARK-41209][PYSPARK] Improve PySpark type inference in _merge_type method

2022-11-21 Thread GitBox


xinrong-meng commented on PR #38731:
URL: https://github.com/apache/spark/pull/38731#issuecomment-1322912963

   Shall we add an example to elaborate `Does this PR introduce any user-facing 
change?`? The change might be in the 3.4 release note.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jerrypeng commented on pull request #38517: [SPARK-39591][SS] Async Progress Tracking

2022-11-21 Thread GitBox


jerrypeng commented on PR #38517:
URL: https://github.com/apache/spark/pull/38517#issuecomment-1322908492

   @HeartSaVioR Please review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >