[GitHub] [spark] HyukjinKwon commented on pull request #39947: [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception

2023-02-08 Thread via GitHub


HyukjinKwon commented on PR #39947:
URL: https://github.com/apache/spark/pull/39947#issuecomment-1423774647

   cc @xinrong-meng @grundprinzip @ueshin @zhengruifeng PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39947: [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception

2023-02-08 Thread via GitHub


HyukjinKwon commented on code in PR #39947:
URL: https://github.com/apache/spark/pull/39947#discussion_r1101076905


##
python/pyspark/errors/exceptions/connect.py:
##
@@ -61,41 +97,34 @@ class AnalysisException(SparkConnectGrpcException, 
BaseAnalysisException):
 Failed to analyze a SQL query plan from Spark Connect server.
 """
 
-def __init__(
-self,
-message: Optional[str] = None,
-error_class: Optional[str] = None,
-message_parameters: Optional[Dict[str, str]] = None,
-plan: Optional[str] = None,
-reason: Optional[str] = None,
-) -> None:
-self.message = message  # type: ignore[assignment]
-if plan is not None:
-self.message = f"{self.message}\nPlan: {plan}"

Review Comment:
   Example:
   
   ```python
   spark.range(1).select("a").show()
   ```
   
   ```
   Traceback (most recent call last):
 File "", line 1, in 
 File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 776, in 
show
   print(self._show_string(n, truncate, vertical))
 File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 619, in 
_show_string
   pdf = DataFrame.withPlan(
 File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1325, in 
toPandas
   return self._session.client.to_pandas(query)
 File "/.../spark/python/pyspark/sql/connect/client.py", line 449, in 
to_pandas
   table, metrics = self._execute_and_fetch(req)
 File "/.../spark/python/pyspark/sql/connect/client.py", line 636, in 
_execute_and_fetch
   self._handle_error(rpc_error)
 File "/.../spark/python/pyspark/sql/connect/client.py", line 670, in 
_handle_error
   raise convert_exception(info, status.message) from None
   pyspark.errors.exceptions.connect.AnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`a` cannot be resolved. Did you mean one of the following? [`id`].;
   'Project ['a]
   +- Range (0, 1, step=1, splits=Some(16))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39947: [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception

2023-02-08 Thread via GitHub


HyukjinKwon commented on code in PR #39947:
URL: https://github.com/apache/spark/pull/39947#discussion_r1101075312


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala:
##
@@ -83,27 +107,19 @@ class SparkConnectService(debug: Boolean)
   private def handleError[V](
   opType: String,
   observer: StreamObserver[V]): PartialFunction[Throwable, Unit] = {
-case ae: AnalysisException =>
-  logError(s"Error during: $opType", ae)
-  val status = RPCStatus
-.newBuilder()
-.setCode(RPCCode.INTERNAL_VALUE)
-.addDetails(
-  ProtoAny.pack(
-ErrorInfo
-  .newBuilder()
-  .setReason(ae.getClass.getName)
-  .setDomain("org.apache.spark")
-  .putMetadata("message", ae.getSimpleMessage)
-  .putMetadata("plan", Option(ae.plan).flatten.map(p => 
s"$p").getOrElse(""))
-  .build()))
-.setMessage(ae.getLocalizedMessage)
-.build()
-  observer.onError(StatusProto.toStatusRuntimeException(status))
+case se: SparkException
+if se.getCause != null && se.getCause
+  .isInstanceOf[PythonException] && se.getCause.getStackTrace
+  
.exists(_.toString.contains("org.apache.spark.sql.execution.python")) =>
+  // Python UDF execution
+  logError(s"Error during: $opType", se)
+  observer.onError(
+
StatusProto.toStatusRuntimeException(buildStatusFromThrowable(se.getCause)))

Review Comment:
   Example:
   
   ```python
   from pyspark.sql.functions import udf
   @udf
   def aa(a):
   1/0
   
   spark.range(1).select(aa("id")).show()
   ```
   
   ```
   Traceback (most recent call last):
 File "", line 1, in 
 File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 776, in 
show
   print(self._show_string(n, truncate, vertical))
 File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 619, in 
_show_string
   pdf = DataFrame.withPlan(
 File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1325, in 
toPandas
   return self._session.client.to_pandas(query)
 File "/.../spark/python/pyspark/sql/connect/client.py", line 449, in 
to_pandas
   table, metrics = self._execute_and_fetch(req)
 File "/.../spark/python/pyspark/sql/connect/client.py", line 636, in 
_execute_and_fetch
   self._handle_error(rpc_error)
 File "/.../spark/python/pyspark/sql/connect/client.py", line 670, in 
_handle_error
   raise convert_exception(info, status.message) from None
   pyspark.errors.exceptions.connect.PythonException:
 An exception was thrown from the Python worker. Please see the stack trace 
below.
   Traceback (most recent call last):
 File "", line 3, in aa
   ZeroDivisionError: division by zero
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39947: [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception

2023-02-08 Thread via GitHub


HyukjinKwon commented on code in PR #39947:
URL: https://github.com/apache/spark/pull/39947#discussion_r1101075312


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala:
##
@@ -83,27 +107,19 @@ class SparkConnectService(debug: Boolean)
   private def handleError[V](
   opType: String,
   observer: StreamObserver[V]): PartialFunction[Throwable, Unit] = {
-case ae: AnalysisException =>
-  logError(s"Error during: $opType", ae)
-  val status = RPCStatus
-.newBuilder()
-.setCode(RPCCode.INTERNAL_VALUE)
-.addDetails(
-  ProtoAny.pack(
-ErrorInfo
-  .newBuilder()
-  .setReason(ae.getClass.getName)
-  .setDomain("org.apache.spark")
-  .putMetadata("message", ae.getSimpleMessage)
-  .putMetadata("plan", Option(ae.plan).flatten.map(p => 
s"$p").getOrElse(""))
-  .build()))
-.setMessage(ae.getLocalizedMessage)
-.build()
-  observer.onError(StatusProto.toStatusRuntimeException(status))
+case se: SparkException
+if se.getCause != null && se.getCause
+  .isInstanceOf[PythonException] && se.getCause.getStackTrace
+  
.exists(_.toString.contains("org.apache.spark.sql.execution.python")) =>
+  // Python UDF execution
+  logError(s"Error during: $opType", se)
+  observer.onError(
+
StatusProto.toStatusRuntimeException(buildStatusFromThrowable(se.getCause)))

Review Comment:
   Example:
   
   ```
   Traceback (most recent call last):
 File "", line 1, in 
 File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 776, in 
show
   print(self._show_string(n, truncate, vertical))
 File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 619, in 
_show_string
   pdf = DataFrame.withPlan(
 File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1325, in 
toPandas
   return self._session.client.to_pandas(query)
 File "/.../spark/python/pyspark/sql/connect/client.py", line 449, in 
to_pandas
   table, metrics = self._execute_and_fetch(req)
 File "/.../spark/python/pyspark/sql/connect/client.py", line 636, in 
_execute_and_fetch
   self._handle_error(rpc_error)
 File "/.../spark/python/pyspark/sql/connect/client.py", line 670, in 
_handle_error
   raise convert_exception(info, status.message) from None
   pyspark.errors.exceptions.connect.PythonException:
 An exception was thrown from the Python worker. Please see the stack trace 
below.
   Traceback (most recent call last):
 File "", line 3, in aa
   ZeroDivisionError: division by zero
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #39865: [SPARK-42052][SQL] Codegen Support for HiveSimpleUDF

2023-02-08 Thread via GitHub


panbingkun commented on PR #39865:
URL: https://github.com/apache/spark/pull/39865#issuecomment-1423716592

   > Hm, I think we should better go and figure out to resolve this by using 
`Invoke` so we don't have to manually implement the codegen logic. I was fine 
with #39555 as a one time thing but seems like there are some more to fix.
   
   As the first step, I have submitted a new 
Pr(https://github.com/apache/spark/pull/39949) to rewrite HiveGenericUDF with 
Invoke.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun opened a new pull request, #39949: [SPARK-42386][SQL] Rewrite HiveGenericUDF with Invoke

2023-02-08 Thread via GitHub


panbingkun opened a new pull request, #39949:
URL: https://github.com/apache/spark/pull/39949

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #39941: [MINOR][DOCS] Add link to Hadoop docs

2023-02-08 Thread via GitHub


itholic commented on code in PR #39941:
URL: https://github.com/apache/spark/pull/39941#discussion_r1101026968


##
docs/rdd-programming-guide.md:
##
@@ -442,7 +442,7 @@ Apart from text files, Spark's Python API also supports 
several other data forma
 
 **Writable Support**
 
-PySpark SequenceFile support loads an RDD of key-value pairs within Java, 
converts Writables to base Java types, and pickles the
+PySpark SequenceFile support loads an RDD of key-value pairs within Java, 
converts 
[Writables](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Writable.html)
 to base Java types, and pickles the

Review Comment:
   Sounds good!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] chaoqin-li1123 commented on pull request #39897: [WIP][SPARK-42353][SS] Cleanup orphan sst and log files in RocksDB checkpoint directory

2023-02-08 Thread via GitHub


chaoqin-li1123 commented on PR #39897:
URL: https://github.com/apache/spark/pull/39897#issuecomment-1423698887

   @rangadi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] chaoqin-li1123 commented on a diff in pull request #39897: [WIP][SPARK-42353][SS] Cleanup orphan sst and log files in RocksDB checkpoint directory

2023-02-08 Thread via GitHub


chaoqin-li1123 commented on code in PR #39897:
URL: https://github.com/apache/spark/pull/39897#discussion_r1101025106


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala:
##
@@ -269,29 +306,46 @@ class RocksDBFileManager(
 s"$numVersionsToRetain versions")
 
 // Resolve RocksDB files for all the versions and find the max version 
each file is used
-val fileToMaxUsedVersion = new mutable.HashMap[RocksDBImmutableFile, Long]
+val fileToMaxUsedVersion = new mutable.HashMap[String, Long]
 sortedVersions.foreach { version =>
   val files = Option(versionToRocksDBFiles.get(version)).getOrElse {
 val newResolvedFiles = getImmutableFilesFromVersionZip(version)
 versionToRocksDBFiles.put(version, newResolvedFiles)
 newResolvedFiles
   }
-  files.foreach(f => fileToMaxUsedVersion(f) = version)
+  files.foreach(f => fileToMaxUsedVersion(f.dfsFileName) =
+math.max(version, fileToMaxUsedVersion.getOrElse(f.dfsFileName, 
version)))
 }
 
 // Best effort attempt to delete SST files that were last used in 
to-be-deleted versions
 val filesToDelete = fileToMaxUsedVersion.filter { case (_, v) => 
versionsToDelete.contains(v) }
+
+val sstDir = new Path(dfsRootDir, 
RocksDBImmutableFile.SST_FILES_DFS_SUBDIR)
+val logDir = new Path(dfsRootDir, 
RocksDBImmutableFile.LOG_FILES_DFS_SUBDIR)
+val allSstFiles = if (fm.exists(sstDir)) fm.list(sstDir).toSeq else 
Seq.empty
+val allLogFiles = if (fm.exists(logDir)) fm.list(logDir).toSeq else 
Seq.empty
+filesToDelete ++= findOrphanFiles(fileToMaxUsedVersion.keys.toSeq, 
allSstFiles ++ allLogFiles)
+  .map(_ -> -1L)
 logInfo(s"Deleting ${filesToDelete.size} files not used in versions >= 
$minVersionToRetain")
 var failedToDelete = 0
-filesToDelete.foreach { case (file, maxUsedVersion) =>
+filesToDelete.foreach { case (dfsFileName, maxUsedVersion) =>
   try {
-val dfsFile = dfsFilePath(file.dfsFileName)
+val dfsFile = dfsFilePath(dfsFileName)
 fm.delete(dfsFile)
-logDebug(s"Deleted file $file that was last used in version 
$maxUsedVersion")
+if (maxUsedVersion == -1) {
+  logDebug(s"Deleted orphan file $dfsFileName")
+} else {
+  logDebug(s"Deleted file $dfsFileName that was last used in version 
$maxUsedVersion")
+}
+logDebug(s"Deleted file $dfsFileName that was last used in version 
$maxUsedVersion")

Review Comment:
   Thanks, fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #39937: [SPARK-42309][SQL] Assign name to _LEGACY_ERROR_TEMP_1204

2023-02-08 Thread via GitHub


itholic commented on PR #39937:
URL: https://github.com/apache/spark/pull/39937#issuecomment-1423666280

   @MaxGekk Just updated the structure to apply sub-error classes per each 
`addError`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] db-scnakandala commented on a diff in pull request #39722: [SPARK-42162] Introduce MultiCommutativeOp expression as a memory optimization for canonicalizing large trees of commutative

2023-02-08 Thread via GitHub


db-scnakandala commented on code in PR #39722:
URL: https://github.com/apache/spark/pull/39722#discussion_r1101002739


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -632,7 +636,12 @@ case class Multiply(
 
   override lazy val canonicalized: Expression = {
 // TODO: do not reorder consecutive `Multiply`s with different `evalMode`
-orderCommutative({ case Multiply(l, r, _) => Seq(l, r) 
}).reduce(Multiply(_, _, evalMode))
+val reorderResult = buildCanonicalizedPlan(
+  { case Multiply(l, r, _) => Seq(l, r) }
+  { case (l: Expression, r: Expression) => Multiply(l, r, evalMode)},
+  Some(evalMode)
+)
+reorderResult

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39722: [SPARK-42162] Introduce MultiCommutativeOp expression as a memory optimization for canonicalizing large trees of commutative expr

2023-02-08 Thread via GitHub


cloud-fan commented on code in PR #39722:
URL: https://github.com/apache/spark/pull/39722#discussion_r1101001554


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -479,8 +480,11 @@ case class Add(
 
   override lazy val canonicalized: Expression = {
 // TODO: do not reorder consecutive `Add`s with different `evalMode`
-val reorderResult =
-  orderCommutative({ case Add(l, r, _) => Seq(l, r) }).reduce(Add(_, _, 
evalMode))
+val reorderResult = buildCanonicalizedPlan(
+  { case Add(l, r, _) => Seq(l, r) },
+  { case (l: Expression, r: Expression) => Add(l, r, evalMode)},
+  Some(evalMode)
+)
 if (resolved && reorderResult.resolved && reorderResult.dataType == 
dataType) {

Review Comment:
   not related to this PR, @gengliangwang do we have a followup to apply this 
fix for multiply?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39722: [SPARK-42162] Introduce MultiCommutativeOp expression as a memory optimization for canonicalizing large trees of commutative expr

2023-02-08 Thread via GitHub


cloud-fan commented on code in PR #39722:
URL: https://github.com/apache/spark/pull/39722#discussion_r1101000846


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -632,7 +636,12 @@ case class Multiply(
 
   override lazy val canonicalized: Expression = {
 // TODO: do not reorder consecutive `Multiply`s with different `evalMode`
-orderCommutative({ case Multiply(l, r, _) => Seq(l, r) 
}).reduce(Multiply(_, _, evalMode))
+val reorderResult = buildCanonicalizedPlan(
+  { case Multiply(l, r, _) => Seq(l, r) }
+  { case (l: Expression, r: Expression) => Multiply(l, r, evalMode)},
+  Some(evalMode)
+)
+reorderResult

Review Comment:
   ```suggestion
   buildCanonicalizedPlan(
 { case Multiply(l, r, _) => Seq(l, r) }
 { case (l: Expression, r: Expression) => Multiply(l, r, evalMode)},
 Some(evalMode)
   )
   ```



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala:
##
@@ -632,7 +636,12 @@ case class Multiply(
 
   override lazy val canonicalized: Expression = {
 // TODO: do not reorder consecutive `Multiply`s with different `evalMode`
-orderCommutative({ case Multiply(l, r, _) => Seq(l, r) 
}).reduce(Multiply(_, _, evalMode))
+val reorderResult = buildCanonicalizedPlan(
+  { case Multiply(l, r, _) => Seq(l, r) }
+  { case (l: Expression, r: Expression) => Multiply(l, r, evalMode)},
+  Some(evalMode)
+)
+reorderResult

Review Comment:
   same for other places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #39948: [SPARK-42385][BUILD] Upgrade RoaringBitmap to 0.9.39

2023-02-08 Thread via GitHub


LuciferYang opened a new pull request, #39948:
URL: https://github.com/apache/spark/pull/39948

   ### What changes were proposed in this pull request?
   This pr aims upgrade RoaringBitmap 0.9.39
   
   
   ### Why are the changes needed?
   This version bring a bug fix:
   
   - https://github.com/RoaringBitmap/RoaringBitmap/pull/614
   
   other changes as follows:
   
   https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.38...0.9.39
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Actions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39947: [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception

2023-02-08 Thread via GitHub


HyukjinKwon commented on code in PR #39947:
URL: https://github.com/apache/spark/pull/39947#discussion_r1100994819


##
python/pyspark/errors/exceptions/connect.py:
##
@@ -61,41 +97,34 @@ class AnalysisException(SparkConnectGrpcException, 
BaseAnalysisException):
 Failed to analyze a SQL query plan from Spark Connect server.
 """
 
-def __init__(
-self,
-message: Optional[str] = None,
-error_class: Optional[str] = None,
-message_parameters: Optional[Dict[str, str]] = None,
-plan: Optional[str] = None,
-reason: Optional[str] = None,
-) -> None:
-self.message = message  # type: ignore[assignment]
-if plan is not None:
-self.message = f"{self.message}\nPlan: {plan}"

Review Comment:
   The original `AnalysisException.getMessage` contains the string 
representation of the underlying plan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39947: [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception

2023-02-08 Thread via GitHub


HyukjinKwon commented on code in PR #39947:
URL: https://github.com/apache/spark/pull/39947#discussion_r1100994134


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala:
##
@@ -53,19 +59,37 @@ class SparkConnectService(debug: Boolean)
 extends SparkConnectServiceGrpc.SparkConnectServiceImplBase
 with Logging {
 
-  private def buildStatusFromThrowable[A <: Throwable with SparkThrowable](st: 
A): RPCStatus = {
-val t = Option(st.getCause).getOrElse(st)
+  private def allClasses(cl: Class[_]): Seq[Class[_]] = {
+val classes = ArrayBuffer.empty[Class[_]]
+if (cl != null && cl != classOf[java.lang.Object]) {
+  classes.append(cl) // Includes itself.
+}
+
+@tailrec
+def appendSuperClasses(clazz: Class[_]): Unit = {
+  if (clazz == null || clazz == classOf[java.lang.Object]) return
+  classes.append(clazz.getSuperclass)
+  appendSuperClasses(clazz.getSuperclass)
+}
+
+appendSuperClasses(cl)
+classes
+  }
+
+  private def buildStatusFromThrowable(st: Throwable): RPCStatus = {
 RPCStatus
   .newBuilder()
   .setCode(RPCCode.INTERNAL_VALUE)
   .addDetails(
 ProtoAny.pack(
   ErrorInfo
 .newBuilder()
-.setReason(t.getClass.getName)
+.setReason(st.getClass.getName)
 .setDomain("org.apache.spark")
+.putMetadata("message", StringUtils.abbreviate(st.getMessage, 
2048))

Review Comment:
   Otherwise, it complains the header length (8KiB limit). It can be configured 
below via `NettyChannelBuilder.maxInboundMessageSize` but I didn't change it 
here, see also https://stackoverflow.com/a/686243



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #39897: [WIP][SPARK-42353][SS] Cleanup orphan sst and log files in RocksDB checkpoint directory

2023-02-08 Thread via GitHub


HeartSaVioR commented on code in PR #39897:
URL: https://github.com/apache/spark/pull/39897#discussion_r1100978171


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala:
##
@@ -269,29 +306,46 @@ class RocksDBFileManager(
 s"$numVersionsToRetain versions")
 
 // Resolve RocksDB files for all the versions and find the max version 
each file is used
-val fileToMaxUsedVersion = new mutable.HashMap[RocksDBImmutableFile, Long]
+val fileToMaxUsedVersion = new mutable.HashMap[String, Long]
 sortedVersions.foreach { version =>
   val files = Option(versionToRocksDBFiles.get(version)).getOrElse {
 val newResolvedFiles = getImmutableFilesFromVersionZip(version)
 versionToRocksDBFiles.put(version, newResolvedFiles)
 newResolvedFiles
   }
-  files.foreach(f => fileToMaxUsedVersion(f) = version)
+  files.foreach(f => fileToMaxUsedVersion(f.dfsFileName) =
+math.max(version, fileToMaxUsedVersion.getOrElse(f.dfsFileName, 
version)))
 }
 
 // Best effort attempt to delete SST files that were last used in 
to-be-deleted versions
 val filesToDelete = fileToMaxUsedVersion.filter { case (_, v) => 
versionsToDelete.contains(v) }
+
+val sstDir = new Path(dfsRootDir, 
RocksDBImmutableFile.SST_FILES_DFS_SUBDIR)
+val logDir = new Path(dfsRootDir, 
RocksDBImmutableFile.LOG_FILES_DFS_SUBDIR)
+val allSstFiles = if (fm.exists(sstDir)) fm.list(sstDir).toSeq else 
Seq.empty
+val allLogFiles = if (fm.exists(logDir)) fm.list(logDir).toSeq else 
Seq.empty
+filesToDelete ++= findOrphanFiles(fileToMaxUsedVersion.keys.toSeq, 
allSstFiles ++ allLogFiles)
+  .map(_ -> -1L)
 logInfo(s"Deleting ${filesToDelete.size} files not used in versions >= 
$minVersionToRetain")
 var failedToDelete = 0
-filesToDelete.foreach { case (file, maxUsedVersion) =>
+filesToDelete.foreach { case (dfsFileName, maxUsedVersion) =>
   try {
-val dfsFile = dfsFilePath(file.dfsFileName)
+val dfsFile = dfsFilePath(dfsFileName)
 fm.delete(dfsFile)
-logDebug(s"Deleted file $file that was last used in version 
$maxUsedVersion")
+if (maxUsedVersion == -1) {
+  logDebug(s"Deleted orphan file $dfsFileName")
+} else {
+  logDebug(s"Deleted file $dfsFileName that was last used in version 
$maxUsedVersion")
+}
+logDebug(s"Deleted file $dfsFileName that was last used in version 
$maxUsedVersion")

Review Comment:
   nit: remove?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] db-scnakandala commented on a diff in pull request #39722: [SPARK-42162] Introduce MultiCommutativeOp expression as a memory optimization for canonicalizing large trees of commutative

2023-02-08 Thread via GitHub


db-scnakandala commented on code in PR #39722:
URL: https://github.com/apache/spark/pull/39722#discussion_r1100974436


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -240,6 +240,15 @@ object SQLConf {
 .intConf
 .createWithDefault(100)
 
+  val MULTI_COMMUTATIVE_OP_OPT_THRESHOLD =
+
buildConf("spark.sql.analyzer.canonicalization.multiCommutativeOpMemoryOptThreshold")
+  .internal()
+  .doc("The minimum number of consecutive non-commutative operands to" +

Review Comment:
   Changed it to `The minimum number of operands in a commutative expression 
tree`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39722: [SPARK-42162] Introduce MultiCommutativeOp expression as a memory optimization for canonicalizing large trees of commutative expr

2023-02-08 Thread via GitHub


cloud-fan commented on code in PR #39722:
URL: https://github.com/apache/spark/pull/39722#discussion_r1100956662


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala:
##
@@ -1335,3 +1335,72 @@ trait CommutativeExpression extends Expression {
   f: PartialFunction[CommutativeExpression, Seq[Expression]]): 
Seq[Expression] =
 gatherCommutative(this, f).sortBy(_.hashCode())
 }
+
+/**
+ * A helper class used by the Commutative expressions during canonicalization. 
During
+ * canonicalization, when we have a long tree of commutative operations, we 
use the MultiCommutative
+ * expression to represent that tree instead of creating new commutative 
objects.
+ * This class is added as a memory optimization for processing large 
commutative operation trees
+ * without creating a large number of new intermediate objects.
+ * The MultiCommutativeOp memory optimization is applied to the following 
commutative
+ * expressions:
+ *  Add, Multiply, And, Or, BitwiseAnd, BitwiseOr, BitwiseXor.
+ * @param operands A sequence of operands that produces a commutative 
expression tree.
+ * @param opCls The class of the root operator of the expression tree.
+ * @param evalMode The optional expression evaluation mode.
+ * @param originalRoot Root operator of the commutative expression tree before 
canonicalization.
+ * This object reference is used to deduce the return 
dataType of Add and
+ * Multiply operations when the input datatype is decimal.
+ */
+case class MultiCommutativeOp(
+operands: Seq[Expression],
+opCls: Class[_],
+evalMode: Option[EvalMode.Value])(originalRoot: Expression) extends 
Unevaluable {
+  // Helper method to deduce the data type of a single operation.
+  private def singleOpDataType(lType: DataType, rType: DataType): DataType = {
+originalRoot match {
+  case add: Add =>
+(lType, rType) match {
+  case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
+add.resultDecimalType(p1, s1, p2, s2)
+  case _ => lType
+}
+  case multiply: Multiply =>
+(lType, rType) match {
+  case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
+multiply.resultDecimalType(p1, s1, p2, s2)
+  case _ => lType
+}
+}
+  }
+
+  /**
+   * Returns the [[DataType]] of the result of evaluating this expression.  It 
is
+   * invalid to query the dataType of an unresolved expression (i.e., when 
`resolved` == false).
+   */
+  override def dataType: DataType = {
+originalRoot match {
+  case _: Add | _: Multiply =>
+operands.map(_.dataType).reduce((l, r) => singleOpDataType(l, r))
+  case other => other.dataType
+}
+  }
+
+  /**
+   * Returns whether this node is nullable. This node is nullable if any of 
its children is
+   * nullable.
+   */
+  override def nullable: Boolean = operands.exists(_.nullable)
+
+  /**
+   * Returns a Seq of the children of this node.
+   * Children should not change. Immutability required for containsChild 
optimization
+   */
+  override def children: Seq[Expression] = operands
+

Review Comment:
   how about providing a util function in this trait so that sub-classes is 
easier to do canonicalization?
   ```
   def buildCanonicalizedPlan(collectOperands: PartialFunction[Expression, 
Seq[Expression]], buildBinaryOp: (Expression, Expression) => Expression) = ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39722: [SPARK-42162] Introduce MultiCommutativeOp expression as a memory optimization for canonicalizing large trees of commutative expr

2023-02-08 Thread via GitHub


cloud-fan commented on code in PR #39722:
URL: https://github.com/apache/spark/pull/39722#discussion_r1100955001


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -240,6 +240,15 @@ object SQLConf {
 .intConf
 .createWithDefault(100)
 
+  val MULTI_COMMUTATIVE_OP_OPT_THRESHOLD =
+
buildConf("spark.sql.analyzer.canonicalization.multiCommutativeOpMemoryOptThreshold")
+  .internal()
+  .doc("The minimum number of consecutive non-commutative operands to" +

Review Comment:
   non-commutative?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39722: [SPARK-42162] Introduce MultiCommutativeOp expression as a memory optimization for canonicalizing large trees of commutative expr

2023-02-08 Thread via GitHub


cloud-fan commented on code in PR #39722:
URL: https://github.com/apache/spark/pull/39722#discussion_r1100954235


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala:
##
@@ -1335,3 +1335,72 @@ trait CommutativeExpression extends Expression {
   f: PartialFunction[CommutativeExpression, Seq[Expression]]): 
Seq[Expression] =
 gatherCommutative(this, f).sortBy(_.hashCode())
 }
+
+/**
+ * A helper class used by the Commutative expressions during canonicalization. 
During
+ * canonicalization, when we have a long tree of commutative operations, we 
use the MultiCommutative
+ * expression to represent that tree instead of creating new commutative 
objects.
+ * This class is added as a memory optimization for processing large 
commutative operation trees
+ * without creating a large number of new intermediate objects.
+ * The MultiCommutativeOp memory optimization is applied to the following 
commutative
+ * expressions:
+ *  Add, Multiply, And, Or, BitwiseAnd, BitwiseOr, BitwiseXor.
+ * @param operands A sequence of operands that produces a commutative 
expression tree.
+ * @param opCls The class of the root operator of the expression tree.
+ * @param evalMode The optional expression evaluation mode.
+ * @param originalRoot Root operator of the commutative expression tree before 
canonicalization.
+ * This object reference is used to deduce the return 
dataType of Add and
+ * Multiply operations when the input datatype is decimal.
+ */
+case class MultiCommutativeOp(
+operands: Seq[Expression],
+opCls: Class[_],
+evalMode: Option[EvalMode.Value])(originalRoot: Expression) extends 
Unevaluable {
+  // Helper method to deduce the data type of a single operation.
+  private def singleOpDataType(lType: DataType, rType: DataType): DataType = {
+originalRoot match {
+  case add: Add =>
+(lType, rType) match {
+  case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
+add.resultDecimalType(p1, s1, p2, s2)
+  case _ => lType
+}
+  case multiply: Multiply =>
+(lType, rType) match {
+  case (DecimalType.Fixed(p1, s1), DecimalType.Fixed(p2, s2)) =>
+multiply.resultDecimalType(p1, s1, p2, s2)
+  case _ => lType
+}
+}
+  }
+
+  /**
+   * Returns the [[DataType]] of the result of evaluating this expression.  It 
is
+   * invalid to query the dataType of an unresolved expression (i.e., when 
`resolved` == false).

Review Comment:
   nit: if you look at other sub-classes of `Expression`, we don't repeat the 
api doc in the override methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request, #39947: [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception

2023-02-08 Thread via GitHub


HyukjinKwon opened a new pull request, #39947:
URL: https://github.com/apache/spark/pull/39947

   ### What changes were proposed in this pull request?
   
   This PR proposes to take the super classes into account when throwing an 
exception from the server to Python side by adding more metadata of classes, 
causes and traceback in JVM.
   
   In addition, this PR matches the exceptions being thrown to the regular 
PySpark exceptions defined:
   
   
https://github.com/apache/spark/blob/04550edd49ee587656d215e59d6a072772d7d5ec/python/pyspark/errors/exceptions/captured.py#L108-L147
   
   
   ### Why are the changes needed?
   
   Right now, many exceptions cannot be handled (e.g., 
`NoSuchDatabaseException` that inherits `AnalysisException`) in Python side.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No to end users.
   Yes, it matches the exceptions to the regular PySpark exceptions.
   
   ### How was this patch tested?
   
   Unittests fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation

2023-02-08 Thread via GitHub


beliefer commented on PR #38799:
URL: https://github.com/apache/spark/pull/38799#issuecomment-1423565384

   > there are 3 tpc queries having plan change, what's their perf result?
   
   
![image](https://user-images.githubusercontent.com/8486025/217709318-ce005732-48ef-4668-baa7-14ded7faf52f.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39893: [SPARK-42350][SQL][K8S][SS] Replcace `get().getOrElse` with `getOrElse`

2023-02-08 Thread via GitHub


LuciferYang commented on PR #39893:
URL: https://github.com/apache/spark/pull/39893#issuecomment-1423562100

   Thanks @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39899: [SPARK-42355][BUILD] Upgrade some maven-plugins

2023-02-08 Thread via GitHub


LuciferYang commented on PR #39899:
URL: https://github.com/apache/spark/pull/39899#issuecomment-1423561970

   Thanks @srowen @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #39946: [SPARK-42310][SQL] Assign name to _LEGACY_ERROR_TEMP_1289

2023-02-08 Thread via GitHub


itholic commented on code in PR #39946:
URL: https://github.com/apache/spark/pull/39946#discussion_r1100935378


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala:
##
@@ -73,7 +73,8 @@ object DataSourceUtils extends PredicateHelper {
   def checkFieldNames(format: FileFormat, schema: StructType): Unit = {
 schema.foreach { field =>
   if (!format.supportFieldName(field.name)) {
-throw 
QueryCompilationErrors.columnNameContainsInvalidCharactersError(field.name)
+throw QueryCompilationErrors.invalidColumnNameAsPathError(
+  format.getClass.getSimpleName, field.name)

Review Comment:
   FYI: If we don't use `getSimpleName` here, it displays the format with full 
package path something like: 
`org.apache.spark.sql.hive.execution.HiveFileFormat`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #39946: [SPARK-42310][SQL] Assign name to _LEGACY_ERROR_TEMP_1289

2023-02-08 Thread via GitHub


itholic commented on PR #39946:
URL: https://github.com/apache/spark/pull/39946#issuecomment-1423557987

   Thanks for the review, @MaxGekk !
   Just adjusted the comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation

2023-02-08 Thread via GitHub


beliefer commented on code in PR #38799:
URL: https://github.com/apache/spark/pull/38799#discussion_r1100921709


##
sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowGroupLimitExec.scala:
##
@@ -0,0 +1,245 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.window
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, 
DenseRank, Expression, Rank, RowNumber, SortOrder, UnsafeProjection, UnsafeRow}
+import org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering
+import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, 
ClusteredDistribution, Distribution, Partitioning}
+import org.apache.spark.sql.execution.{SparkPlan, UnaryExecNode}
+
+sealed trait WindowGroupLimitMode
+
+case object Partial extends WindowGroupLimitMode
+
+case object Final extends WindowGroupLimitMode
+
+/**
+ * This operator is designed to filter out unnecessary rows before WindowExec
+ * for top-k computation.
+ * @param partitionSpec Should be the same as [[WindowExec#partitionSpec]].
+ * @param orderSpec Should be the same as [[WindowExec#orderSpec]].
+ * @param rankLikeFunction The function to compute row rank, should be 
RowNumber/Rank/DenseRank.
+ * @param limit The limit for rank value.
+ * @param mode The mode describes [[WindowGroupLimitExec]] before or after 
shuffle.
+ * @param child The child spark plan.
+ */
+case class WindowGroupLimitExec(
+partitionSpec: Seq[Expression],
+orderSpec: Seq[SortOrder],
+rankLikeFunction: Expression,
+limit: Int,
+mode: WindowGroupLimitMode,
+child: SparkPlan) extends UnaryExecNode {
+
+  override def output: Seq[Attribute] = child.output
+
+  override def requiredChildDistribution: Seq[Distribution] = mode match {
+case Partial => super.requiredChildDistribution
+case Final =>
+  if (partitionSpec.isEmpty) {
+AllTuples :: Nil
+  } else {
+ClusteredDistribution(partitionSpec) :: Nil
+  }
+  }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+
+  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  protected override def doExecute(): RDD[InternalRow] = rankLikeFunction 
match {
+case _: RowNumber =>
+  child.execute().mapPartitionsInternal(
+SimpleGroupLimitIterator(partitionSpec, output, _, limit))
+case _: Rank =>
+  child.execute().mapPartitionsInternal(
+RankGroupLimitIterator(partitionSpec, output, _, orderSpec, limit))
+case _: DenseRank =>
+  child.execute().mapPartitionsInternal(
+DenseRankGroupLimitIterator(partitionSpec, output, _, orderSpec, 
limit))
+  }
+
+  override protected def withNewChildInternal(newChild: SparkPlan): 
WindowGroupLimitExec =
+copy(child = newChild)
+}
+
+abstract class WindowIterator extends Iterator[InternalRow] {
+
+  def partitionSpec: Seq[Expression]
+
+  def output: Seq[Attribute]
+
+  def input: Iterator[InternalRow]
+
+  def limit: Int
+
+  val grouping = UnsafeProjection.create(partitionSpec, output)

Review Comment:
   Yes. I have another PR do the work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a diff in pull request #39941: [MINOR][DOCS] Add link to Hadoop docs

2023-02-08 Thread via GitHub


srowen commented on code in PR #39941:
URL: https://github.com/apache/spark/pull/39941#discussion_r1100921708


##
docs/rdd-programming-guide.md:
##
@@ -442,7 +442,7 @@ Apart from text files, Spark's Python API also supports 
several other data forma
 
 **Writable Support**
 
-PySpark SequenceFile support loads an RDD of key-value pairs within Java, 
converts Writables to base Java types, and pickles the
+PySpark SequenceFile support loads an RDD of key-value pairs within Java, 
converts 
[Writables](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Writable.html)
 to base Java types, and pickles the

Review Comment:
   Yeah I might write `[Writable](..)s` but that's awkward. `[Writable](...) 
objects`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #39899: [SPARK-42355][BUILD] Upgrade some maven-plugins

2023-02-08 Thread via GitHub


srowen commented on PR #39899:
URL: https://github.com/apache/spark/pull/39899#issuecomment-1423539058

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #39899: [SPARK-42355][BUILD] Upgrade some maven-plugins

2023-02-08 Thread via GitHub


srowen closed pull request #39899: [SPARK-42355][BUILD] Upgrade some 
maven-plugins
URL: https://github.com/apache/spark/pull/39899


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #39893: [SPARK-42350][SQL][K8S][SS] Replcace `get().getOrElse` with `getOrElse`

2023-02-08 Thread via GitHub


srowen closed pull request #39893: [SPARK-42350][SQL][K8S][SS] Replcace 
`get().getOrElse` with `getOrElse`
URL: https://github.com/apache/spark/pull/39893


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #39893: [SPARK-42350][SQL][K8S][SS] Replcace `get().getOrElse` with `getOrElse`

2023-02-08 Thread via GitHub


srowen commented on PR #39893:
URL: https://github.com/apache/spark/pull/39893#issuecomment-1423538496

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation

2023-02-08 Thread via GitHub


beliefer commented on code in PR #38799:
URL: https://github.com/apache/spark/pull/38799#discussion_r1100920458


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala:
##
@@ -1265,4 +1265,168 @@ class DataFrameWindowFunctionsSuite extends QueryTest
   )
 )
   }
+
+  test("SPARK-37099: Insert window group limit node for top-k computation") {
+
+val nullStr: String = null
+val df = Seq(
+  ("a", 0, "c", 1.0),
+  ("a", 1, "x", 2.0),
+  ("a", 2, "y", 3.0),
+  ("a", 3, "z", -1.0),
+  ("a", 4, "", 2.0),
+  ("a", 4, "", 2.0),
+  ("b", 1, "h", Double.NaN),
+  ("b", 1, "n", Double.PositiveInfinity),
+  ("c", 1, "z", -2.0),
+  ("c", 1, "a", -4.0),
+  ("c", 2, nullStr, 5.0)).toDF("key", "value", "order", "value2")
+
+val window = Window.partitionBy($"key").orderBy($"order".asc_nulls_first)
+val window2 = Window.partitionBy($"key").orderBy($"order".desc_nulls_first)
+
+Seq(-1, 100).foreach { threshold =>
+  withSQLConf(SQLConf.WINDOW_GROUP_LIMIT_THRESHOLD.key -> 
threshold.toString) {
+Seq($"rn" === 0, $"rn" < 1, $"rn" <= 0).foreach { condition =>
+  checkAnswer(df.withColumn("rn", 
row_number().over(window)).where(condition),
+Seq.empty[Row]
+  )
+}
+
+Seq($"rn" === 1, $"rn" < 2, $"rn" <= 1).foreach { condition =>
+  checkAnswer(df.withColumn("rn", 
row_number().over(window)).where(condition),
+Seq(
+  Row("a", 4, "", 2.0, 1),
+  Row("b", 1, "h", Double.NaN, 1),
+  Row("c", 2, null, 5.0, 1)
+)
+  )
+
+  checkAnswer(df.withColumn("rn", 
rank().over(window)).where(condition),
+Seq(
+  Row("a", 4, "", 2.0, 1),
+  Row("a", 4, "", 2.0, 1),
+  Row("b", 1, "h", Double.NaN, 1),
+  Row("c", 2, null, 5.0, 1)
+)
+  )
+
+  checkAnswer(df.withColumn("rn", 
dense_rank().over(window)).where(condition),
+Seq(
+  Row("a", 4, "", 2.0, 1),
+  Row("a", 4, "", 2.0, 1),
+  Row("b", 1, "h", Double.NaN, 1),
+  Row("c", 2, null, 5.0, 1)
+)
+  )
+}
+
+Seq($"rn" < 3, $"rn" <= 2).foreach { condition =>
+  checkAnswer(df.withColumn("rn", 
row_number().over(window)).where(condition),
+Seq(
+  Row("a", 4, "", 2.0, 1),
+  Row("a", 4, "", 2.0, 2),
+  Row("b", 1, "h", Double.NaN, 1),
+  Row("b", 1, "n", Double.PositiveInfinity, 2),
+  Row("c", 1, "a", -4.0, 2),
+  Row("c", 2, null, 5.0, 1)
+)
+  )
+
+  checkAnswer(df.withColumn("rn", 
rank().over(window)).where(condition),
+Seq(
+  Row("a", 4, "", 2.0, 1),
+  Row("a", 4, "", 2.0, 1),
+  Row("b", 1, "h", Double.NaN, 1),
+  Row("b", 1, "n", Double.PositiveInfinity, 2),
+  Row("c", 1, "a", -4.0, 2),
+  Row("c", 2, null, 5.0, 1)
+)
+  )
+
+  checkAnswer(df.withColumn("rn", 
dense_rank().over(window)).where(condition),
+Seq(
+  Row("a", 0, "c", 1.0, 2),
+  Row("a", 4, "", 2.0, 1),
+  Row("a", 4, "", 2.0, 1),
+  Row("b", 1, "h", Double.NaN, 1),
+  Row("b", 1, "n", Double.PositiveInfinity, 2),
+  Row("c", 1, "a", -4.0, 2),
+  Row("c", 2, null, 5.0, 1)
+)
+  )
+}
+
+val condition = $"rn" === 2 && $"value2" > 0.5
+checkAnswer(df.withColumn("rn", 
row_number().over(window)).where(condition),
+  Seq(
+Row("a", 4, "", 2.0, 2),
+Row("b", 1, "n", Double.PositiveInfinity, 2)
+  )
+)
+
+checkAnswer(df.withColumn("rn", rank().over(window)).where(condition),
+  Seq(
+Row("b", 1, "n", Double.PositiveInfinity, 2)
+  )
+)
+
+checkAnswer(df.withColumn("rn", 
dense_rank().over(window)).where(condition),
+  Seq(
+Row("a", 0, "c", 1.0, 2),
+Row("b", 1, "n", Double.PositiveInfinity, 2)
+  )
+)
+
+val multipleRowNumbers = df
+  .withColumn("rn", row_number().over(window))
+  .withColumn("rn2", row_number().over(window))
+  .where('rn < 2 && 'rn2 < 3)
+checkAnswer(multipleRowNumbers,
+  Seq(
+Row("a", 4, "", 2.0, 1, 1),
+Row("b", 1, "h", Double.NaN, 1, 1),
+Row("c", 2, null, 5.0, 1, 1)
+  )
+)
+
+val multipleRanks = df
+  .withColumn("rn", rank().over(window))
+  .withColumn("rn2", rank().over(window))
+  .where('rn < 2 && 'rn2 < 3)
+checkAnswer(multipleRanks,
+  Seq(
+Row("a", 4, "", 2.0, 1, 1),
+Row("a", 4, "",

[GitHub] [spark] beliefer commented on a diff in pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation

2023-02-08 Thread via GitHub


beliefer commented on code in PR #38799:
URL: https://github.com/apache/spark/pull/38799#discussion_r1100919419


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2580,6 +2580,18 @@ object SQLConf {
   .intConf
   
.createWithDefault(SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD.defaultValue.get)
 
+  val WINDOW_GROUP_LIMIT_THRESHOLD =
+buildConf("spark.sql.optimizer.windowGroupLimitThreshold")
+  .internal()
+  .doc("Threshold for filter the dataset by the window group limit before" 
+

Review Comment:
   0 means the output results is empty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39933: [SPARK-42377][CONNECT][TESTS] Test framework for Spark Connect Scala Client

2023-02-08 Thread via GitHub


LuciferYang commented on PR #39933:
URL: https://github.com/apache/spark/pull/39933#issuecomment-1423533123

   https://user-images.githubusercontent.com/1475305/217703128-b0cb8d3e-a103-4595-80b1-db3e4a9da72a.png";>
   
   license test failed,  need add new rule to `.rat-excludes`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39933: [SPARK-42377][CONNECT][TESTS] Test framework for Spark Connect Scala Client

2023-02-08 Thread via GitHub


LuciferYang commented on PR #39933:
URL: https://github.com/apache/spark/pull/39933#issuecomment-1423528903

   @hvanhovell checked, local maven test passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation

2023-02-08 Thread via GitHub


beliefer commented on code in PR #38799:
URL: https://github.com/apache/spark/pull/38799#discussion_r1100909992


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertWindowGroupLimit.scala:
##
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
CurrentRow, DenseRank, EqualTo, Expression, GreaterThan, GreaterThanOrEqual, 
IntegerLiteral, LessThan, LessThanOrEqual, NamedExpression, PredicateHelper, 
Rank, RowFrame, RowNumber, SpecifiedWindowFrame, UnboundedPreceding, 
WindowExpression, WindowSpecDefinition}
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LocalRelation, 
LogicalPlan, Window, WindowGroupLimit}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{FILTER, WINDOW}
+
+/**
+ * Optimize the filter based on rank-like window function by reduce not 
required rows.
+ * This rule optimizes the following cases:
+ * {{{
+ *   SELECT *, ROW_NUMBER() OVER(PARTITION BY k ORDER BY a) AS rn FROM Tab1 
WHERE rn = 5
+ *   SELECT *, ROW_NUMBER() OVER(PARTITION BY k ORDER BY a) AS rn FROM Tab1 
WHERE 5 = rn
+ *   SELECT *, ROW_NUMBER() OVER(PARTITION BY k ORDER BY a) AS rn FROM Tab1 
WHERE rn < 5
+ *   SELECT *, ROW_NUMBER() OVER(PARTITION BY k ORDER BY a) AS rn FROM Tab1 
WHERE 5 > rn
+ *   SELECT *, ROW_NUMBER() OVER(PARTITION BY k ORDER BY a) AS rn FROM Tab1 
WHERE rn <= 5
+ *   SELECT *, ROW_NUMBER() OVER(PARTITION BY k ORDER BY a) AS rn FROM Tab1 
WHERE 5 >= rn
+ * }}}
+ */
+object InsertWindowGroupLimit extends Rule[LogicalPlan] with PredicateHelper {
+
+  /**
+   * Extract all the limit values from predicates.
+   */
+  def extractLimits(condition: Expression, attr: Attribute): Option[Int] = {
+val limits = splitConjunctivePredicates(condition).collect {
+  case EqualTo(IntegerLiteral(limit), e) if e.semanticEquals(attr) => limit
+  case EqualTo(e, IntegerLiteral(limit)) if e.semanticEquals(attr) => limit
+  case LessThan(e, IntegerLiteral(limit)) if e.semanticEquals(attr) => 
limit - 1
+  case GreaterThan(IntegerLiteral(limit), e) if e.semanticEquals(attr) => 
limit - 1
+  case LessThanOrEqual(e, IntegerLiteral(limit)) if e.semanticEquals(attr) 
=> limit
+  case GreaterThanOrEqual(IntegerLiteral(limit), e) if 
e.semanticEquals(attr) => limit
+}
+
+if (limits.nonEmpty) Some(limits.min) else None
+  }
+
+  private def support(
+  windowExpression: NamedExpression): Boolean = windowExpression match {
+case Alias(WindowExpression(_: Rank | _: DenseRank | _: RowNumber, 
WindowSpecDefinition(_, _,
+SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _) => 
true
+case _ => false
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (conf.windowGroupLimitThreshold == -1) return plan
+
+plan.transformWithPruning(_.containsAllPatterns(FILTER, WINDOW), ruleId) {
+  case filter @ Filter(condition,
+window @ Window(windowExpressions, partitionSpec, orderSpec, child))
+if !child.isInstanceOf[WindowGroupLimit] && 
windowExpressions.exists(support) &&
+  orderSpec.nonEmpty =>
+val limits = windowExpressions.collect {
+  case alias @ Alias(WindowExpression(rankLikeFunction, _), _) if 
support(alias) =>
+extractLimits(condition, alias.toAttribute).map((_, 
rankLikeFunction))
+}.flatten
+
+// multiple different rank-like functions unsupported.
+if (limits.isEmpty) {
+  filter
+} else {
+  limits.minBy(_._1) match {

Review Comment:
   Good idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.or

[GitHub] [spark] LuciferYang commented on pull request #39893: [SPARK-42350][SQL][K8S][SS] Replcace `get().getOrElse` with `getOrElse`

2023-02-08 Thread via GitHub


LuciferYang commented on PR #39893:
URL: https://github.com/apache/spark/pull/39893#issuecomment-1423517707

   should be no more 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #39925: [SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in `Join`

2023-02-08 Thread via GitHub


zhengruifeng commented on code in PR #39925:
URL: https://github.com/apache/spark/pull/39925#discussion_r1100882974


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -123,6 +123,11 @@ class SparkConnectPlanner(val session: SparkSession) {
 transformRelationPlugin(rel.getExtension)
   case _ => throw InvalidPlanInput(s"${rel.getUnknown} not supported.")
 }
+
+if (rel.hasCommon) {
+  plan.setTagValue(LogicalPlan.PLAN_ID_TAG, rel.getCommon.getPlanId)

Review Comment:
   ok, let us make it optional.
   
   ~~the Commands and Catalogs like `CreateView` and `WriteOperation` will not 
have a plan id~~
   
   Let me also add plan_id for Commands, maybe useful in the future



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #39925: [SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in `Join`

2023-02-08 Thread via GitHub


zhengruifeng commented on code in PR #39925:
URL: https://github.com/apache/spark/pull/39925#discussion_r1100886574


##
python/pyspark/sql/connect/plan.py:
##
@@ -259,6 +272,8 @@ def __init__(
 
 def plan(self, session: "SparkConnectClient") -> proto.Relation:
 plan = proto.Relation()

Review Comment:
   nice, will add a helper method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #39925: [SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in `Join`

2023-02-08 Thread via GitHub


zhengruifeng commented on code in PR #39925:
URL: https://github.com/apache/spark/pull/39925#discussion_r1100882974


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -123,6 +123,11 @@ class SparkConnectPlanner(val session: SparkSession) {
 transformRelationPlugin(rel.getExtension)
   case _ => throw InvalidPlanInput(s"${rel.getUnknown} not supported.")
 }
+
+if (rel.hasCommon) {
+  plan.setTagValue(LogicalPlan.PLAN_ID_TAG, rel.getCommon.getPlanId)

Review Comment:
   ok, let us make it optional.
   
   the Commands and Catalogs like `CreateView` and `WriteOperation` will not 
have a plan id



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #39925: [SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in `Join`

2023-02-08 Thread via GitHub


zhengruifeng commented on code in PR #39925:
URL: https://github.com/apache/spark/pull/39925#discussion_r1100881330


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -91,8 +91,11 @@ message Unknown {}
 
 // Common metadata of all relations.
 message RelationCommon {
+  // (Required) A globally unique id for a given connect plan.
+  int64 plan_id = 1;
+
   // (Required) Shared relation metadata.
-  string source_info = 1;
+  string source_info = 2;

Review Comment:
   ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #39925: [SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in `Join`

2023-02-08 Thread via GitHub


zhengruifeng commented on code in PR #39925:
URL: https://github.com/apache/spark/pull/39925#discussion_r1100881164


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -91,8 +91,11 @@ message Unknown {}
 
 // Common metadata of all relations.
 message RelationCommon {
+  // (Required) A globally unique id for a given connect plan.
+  int64 plan_id = 1;

Review Comment:
   will update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #39925: [SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in `Join`

2023-02-08 Thread via GitHub


zhengruifeng commented on code in PR #39925:
URL: https://github.com/apache/spark/pull/39925#discussion_r1100881055


##
python/pyspark/sql/connect/plan.py:
##
@@ -673,32 +716,34 @@ def plan(self, session: "SparkConnectClient") -> 
proto.Relation:
 from pyspark.sql.connect.functions import lit
 
 assert self._child is not None
+plan = proto.Relation()
+plan.common.plan_id = self._plan_id
 
-agg = proto.Relation()
-
-agg.aggregate.input.CopyFrom(self._child.plan(session))
+plan.aggregate.input.CopyFrom(self._child.plan(session))
 
-agg.aggregate.grouping_expressions.extend([c.to_plan(session) for c in 
self._grouping_cols])
-agg.aggregate.aggregate_expressions.extend(
+plan.aggregate.grouping_expressions.extend(
+[c.to_plan(session) for c in self._grouping_cols]
+)
+plan.aggregate.aggregate_expressions.extend(
 [c.to_plan(session) for c in self._aggregate_cols]
 )
 
 if self._group_type == "groupby":
-agg.aggregate.group_type = 
proto.Aggregate.GroupType.GROUP_TYPE_GROUPBY
+plan.aggregate.group_type = 
proto.Aggregate.GroupType.GROUP_TYPE_GROUPBY
 elif self._group_type == "rollup":
-agg.aggregate.group_type = 
proto.Aggregate.GroupType.GROUP_TYPE_ROLLUP
+plan.aggregate.group_type = 
proto.Aggregate.GroupType.GROUP_TYPE_ROLLUP
 elif self._group_type == "cube":
-agg.aggregate.group_type = 
proto.Aggregate.GroupType.GROUP_TYPE_CUBE
+plan.aggregate.group_type = 
proto.Aggregate.GroupType.GROUP_TYPE_CUBE
 elif self._group_type == "pivot":
-agg.aggregate.group_type = 
proto.Aggregate.GroupType.GROUP_TYPE_PIVOT
+plan.aggregate.group_type = 
proto.Aggregate.GroupType.GROUP_TYPE_PIVOT
 assert self._pivot_col is not None
-agg.aggregate.pivot.col.CopyFrom(self._pivot_col.to_plan(session))
+plan.aggregate.pivot.col.CopyFrom(self._pivot_col.to_plan(session))
 if self._pivot_values is not None and len(self._pivot_values) > 0:
-agg.aggregate.pivot.values.extend(
+plan.aggregate.pivot.values.extend(
 [lit(v).to_plan(session).literal for v in 
self._pivot_values]
 )
 
-return agg
+return plan
 

Review Comment:
   both `plan` and `rel` are fine to me, I just want to make the naming 
consistent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #39860: [SPARK-42210][CONNECT][PYTHON] Standardize registered pickled Python UDFs

2023-02-08 Thread via GitHub


xinrong-meng commented on PR #39860:
URL: https://github.com/apache/spark/pull/39860#issuecomment-1423489976

   Forced push to base on the latest master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #39667: [SPARK-42131][SQL] Extract the function that construct the select statement for JDBC dialect.

2023-02-08 Thread via GitHub


beliefer commented on PR #39667:
URL: https://github.com/apache/spark/pull/39667#issuecomment-1423472035

   @cloud-fan @sadikovi Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2023-02-08 Thread via GitHub


HyukjinKwon closed pull request #38223: [SPARK-40770][PYTHON] Improved error 
messages for applyInPandas for schema mismatch
URL: https://github.com/apache/spark/pull/38223


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2023-02-08 Thread via GitHub


HyukjinKwon commented on PR #38223:
URL: https://github.com/apache/spark/pull/38223#issuecomment-1423450435

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #37232: [SPARK-39821][PYTHON][PS] Fix error during using DatetimeIndex

2023-02-08 Thread via GitHub


github-actions[bot] commented on PR #37232:
URL: https://github.com/apache/spark/pull/37232#issuecomment-1423412026

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #37369: [SPARK-39942][PYTHON][PS] Need to verify the input nums is integer in nsmallest func

2023-02-08 Thread via GitHub


github-actions[bot] commented on PR #37369:
URL: https://github.com/apache/spark/pull/37369#issuecomment-1423411990

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #38070: [SPARK-38004][PYTHON] Mangle dupe cols documentation

2023-02-08 Thread via GitHub


github-actions[bot] commented on PR #38070:
URL: https://github.com/apache/spark/pull/38070#issuecomment-1423411943

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #38068: [SPARK-40409] spark-sql supports reading `ByteType` data of avro serde

2023-02-08 Thread via GitHub


github-actions[bot] closed pull request #38068: [SPARK-40409] spark-sql 
supports reading `ByteType` data of avro serde
URL: https://github.com/apache/spark/pull/38068


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #38233: [SPARK-40781][CORE] Explain exit code 137 as killed due to OOM

2023-02-08 Thread via GitHub


github-actions[bot] closed pull request #38233: [SPARK-40781][CORE] Explain 
exit code 137 as killed due to OOM
URL: https://github.com/apache/spark/pull/38233


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39943: [SPARK-40819][SQL][FOLLOWUP] Update SqlConf version for nanosAsLong configuration

2023-02-08 Thread via GitHub


HyukjinKwon commented on PR #39943:
URL: https://github.com/apache/spark/pull/39943#issuecomment-1423397038

   Merged to master, branch-3.4, branch-3.3 and branch-3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #39943: [SPARK-40819][SQL][FOLLOWUP] Update SqlConf version for nanosAsLong configuration

2023-02-08 Thread via GitHub


HyukjinKwon closed pull request #39943: [SPARK-40819][SQL][FOLLOWUP] Update 
SqlConf version for nanosAsLong configuration
URL: https://github.com/apache/spark/pull/39943


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a diff in pull request #39825: [SPARK-42261][SPARK-42260][K8S] Log Allocation Stalls and Trigger Allocation event without blocking on snapshot

2023-02-08 Thread via GitHub


holdenk commented on code in PR #39825:
URL: https://github.com/apache/spark/pull/39825#discussion_r1100813043


##
resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala:
##
@@ -161,7 +161,7 @@ class ExecutorPodsAllocatorSuite extends SparkFunSuite with 
BeforeAndAfter {
 assert(ExecutorPodsAllocator.splitSlots(seq2, 4) === Seq(("a", 2), ("b", 
1), ("c", 1)))
   }
 
-  test("SPARK-36052: pending pod limit with multiple resource profiles") {
+  test("SPARK-42261: Allow allocations without snapshot up to min of max 
pending & alloc size.") {

Review Comment:
   Sounds good, will refactor next week.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #39941: [MINOR][DOCS] Add link to Hadoop docs

2023-02-08 Thread via GitHub


itholic commented on code in PR #39941:
URL: https://github.com/apache/spark/pull/39941#discussion_r1100795489


##
docs/rdd-programming-guide.md:
##
@@ -442,7 +442,7 @@ Apart from text files, Spark's Python API also supports 
several other data forma
 
 **Writable Support**
 
-PySpark SequenceFile support loads an RDD of key-value pairs within Java, 
converts Writables to base Java types, and pickles the
+PySpark SequenceFile support loads an RDD of key-value pairs within Java, 
converts 
[Writables](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Writable.html)
 to base Java types, and pickles the

Review Comment:
   nit: Maybe `Writables` -> `Writable` ??
   Seems like `Writable` is an official naming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #39941: [MINOR][DOCS] Add link to Hadoop docs

2023-02-08 Thread via GitHub


itholic commented on code in PR #39941:
URL: https://github.com/apache/spark/pull/39941#discussion_r1100795489


##
docs/rdd-programming-guide.md:
##
@@ -442,7 +442,7 @@ Apart from text files, Spark's Python API also supports 
several other data forma
 
 **Writable Support**
 
-PySpark SequenceFile support loads an RDD of key-value pairs within Java, 
converts Writables to base Java types, and pickles the
+PySpark SequenceFile support loads an RDD of key-value pairs within Java, 
converts 
[Writables](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Writable.html)
 to base Java types, and pickles the

Review Comment:
   nit: Maybe `Writables` -> `Writable` ??
   Seems like Writable is an official naming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #39933: [SPARK-42377][CONNECT][TESTS] Test framework for Spark Connect Scala Client

2023-02-08 Thread via GitHub


hvanhovell commented on PR #39933:
URL: https://github.com/apache/spark/pull/39933#issuecomment-1423312392

   @LuciferYang can you check if maven works now? It should. There is an 
outstanding issue with one of the UDF tests, but that is for a different PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR closed pull request #39936: [SPARK-42379][SS] Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists

2023-02-08 Thread via GitHub


HeartSaVioR closed pull request #39936: [SPARK-42379][SS] Use FileSystem.exists 
in FileSystemBasedCheckpointFileManager.exists
URL: https://github.com/apache/spark/pull/39936


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #39936: [SPARK-42379][SS] Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists

2023-02-08 Thread via GitHub


HeartSaVioR commented on PR #39936:
URL: https://github.com/apache/spark/pull/39936#issuecomment-1423307538

   Thanks, merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #39906: [SPARK-41962][MINOR][SQL] Update the order of imports in class SpecificParquetRecordReaderBase

2023-02-08 Thread via GitHub


dongjoon-hyun commented on PR #39906:
URL: https://github.com/apache/spark/pull/39906#issuecomment-1423293234

   Thank you for the confirmation, @wayneguow .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wayneguow commented on pull request #39906: [SPARK-41962][MINOR][SQL] Update the order of imports in class SpecificParquetRecordReaderBase

2023-02-08 Thread via GitHub


wayneguow commented on PR #39906:
URL: https://github.com/apache/spark/pull/39906#issuecomment-1423286391

   > You can make a new backporting PR to branch-3.2 if needed, @wayneguow .
   
   https://user-images.githubusercontent.com/16032294/217658729-c709f636-1c3a-429b-b7e8-9929961eba49.png";>
   It seems that it is ok without those unordered imports in branch-3.2, so 
just reverting this for branch-3.2 is enough.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #39933: [SPARK-42377][CONNECT][TESTS] Test framework for Spark Connect Scala Client

2023-02-08 Thread via GitHub


hvanhovell commented on PR #39933:
URL: https://github.com/apache/spark/pull/39933#issuecomment-1423286123

   Alright the golden files for the `ProtoToParsedPlanTestSuite` have been 
updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #39906: [SPARK-41962][MINOR][SQL] Update the order of imports in class SpecificParquetRecordReaderBase

2023-02-08 Thread via GitHub


dongjoon-hyun commented on PR #39906:
URL: https://github.com/apache/spark/pull/39906#issuecomment-1423275269

   Also, cc @parthchandra for branch-3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #39906: [SPARK-41962][MINOR][SQL] Update the order of imports in class SpecificParquetRecordReaderBase

2023-02-08 Thread via GitHub


dongjoon-hyun commented on PR #39906:
URL: https://github.com/apache/spark/pull/39906#issuecomment-1423274262

   You can make a new backporting PR to branch-3.2 if needed, @wayneguow .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wayneguow commented on pull request #39906: [SPARK-41962][MINOR][SQL] Update the order of imports in class SpecificParquetRecordReaderBase

2023-02-08 Thread via GitHub


wayneguow commented on PR #39906:
URL: https://github.com/apache/spark/pull/39906#issuecomment-1423272897

   > I reverted the commit from `branch-3.2` because the `import` statement is 
unused.
   > 
   > * 
[c8b47ec](https://github.com/apache/spark/commit/c8b47ec2d0a066f539a98488502fce99efe006f0)
   > 
   > ![Screenshot 2023-02-08 at 1 34 13 
PM](https://user-images.githubusercontent.com/9700541/217656121-3874fb1a-b2d9-4cc9-acf6-7461b0731b69.png)
   
   @dongjoon-hyun  Thank you for resolving it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #39906: [SPARK-41962][MINOR][SQL] Update the order of imports in class SpecificParquetRecordReaderBase

2023-02-08 Thread via GitHub


dongjoon-hyun commented on PR #39906:
URL: https://github.com/apache/spark/pull/39906#issuecomment-1423268335

   I reverted the commit because the `import` statement is unused.
   - 
https://github.com/apache/spark/commit/c8b47ec2d0a066f539a98488502fce99efe006f0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #39906: [SPARK-41962][MINOR][SQL] Update the order of imports in class SpecificParquetRecordReaderBase

2023-02-08 Thread via GitHub


dongjoon-hyun commented on PR #39906:
URL: https://github.com/apache/spark/pull/39906#issuecomment-1423266159

   Hi, @wayneguow and @HyukjinKwon  This broke branch-3.2.
   - https://github.com/apache/spark/commits/branch-3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] db-scnakandala commented on pull request #39722: [SPARK-42162] Introduce MultiCommutativeOp expression as a memory optimization for canonicalizing large trees of commutative expressio

2023-02-08 Thread via GitHub


db-scnakandala commented on PR #39722:
URL: https://github.com/apache/spark/pull/39722#issuecomment-1423248966

   > > @peter-toth Can you elaborate more on the statement "shouldn't 
canonicalize a single expression to MultiCommutativeOp"
   > > From what I understand since the operator is a commutative expression, 
it should at least have two operands (?)
   > 
   > I would keep the canonicalization of a single commutative expression like: 
`Add(x, y)` as it is: `Add(x, y)`, but when we have at least 2 nested: 
`Add(Add(x, y), z)` then as `MultiCommutativeOp(x, y, z, Add)`
   
   If I understand it correctly, setting the threshold to `3` should achieve 
this requirement, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #39946: [SPARK-42310][SQL] Assign name to _LEGACY_ERROR_TEMP_1289

2023-02-08 Thread via GitHub


MaxGekk commented on code in PR #39946:
URL: https://github.com/apache/spark/pull/39946#discussion_r1100686476


##
core/src/main/resources/error/error-classes.json:
##
@@ -774,6 +774,12 @@
   "The expected format is ByteString, but was  ()."
 ]
   },
+  "INVALID_COLUMN_NAME" : {

Review Comment:
   How about to make it more specific:
   ```suggestion
 "INVALID_COLUMN_NAME_AS_PATH" : {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #39878: [SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-08 Thread via GitHub


srowen closed pull request #39878: [SPARK-42335][SQL] Pass the comment option 
through to univocity if users set it explicitly in CSV dataSource
URL: https://github.com/apache/spark/pull/39878


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #39878: [SPARK-42335][SQL] Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-08 Thread via GitHub


srowen commented on PR #39878:
URL: https://github.com/apache/spark/pull/39878#issuecomment-1423245678

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #39946: [SPARK-42310][SQL] Assign name to _LEGACY_ERROR_TEMP_1289

2023-02-08 Thread via GitHub


MaxGekk commented on code in PR #39946:
URL: https://github.com/apache/spark/pull/39946#discussion_r1100680165


##
core/src/main/resources/error/error-classes.json:
##
@@ -774,6 +774,12 @@
   "The expected format is ByteString, but was  ()."
 ]
   },
+  "INVALID_COLUMN_NAME" : {
+"message" : [
+  "Column name  contains invalid character(s). Please use 
alias to rename it."

Review Comment:
   The message is slightly generic, and doesn't give a clear picture. From the 
source code, the error is a datasource specific, and raised when the datasource 
doesn't support a file name that is the same as a column name. Can we rephrase 
it:
   ```suggestion
 "The datasource  cannot save the column  
because its name contains some characters that are not allowed in file paths. 
Please, use an alias to rename it."
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on pull request #39722: [SPARK-42162] Introduce MultiCommutativeOp expression as a memory optimization for canonicalizing large trees of commutative expressions

2023-02-08 Thread via GitHub


peter-toth commented on PR #39722:
URL: https://github.com/apache/spark/pull/39722#issuecomment-1423185628

   > @peter-toth Can you elaborate more on the statement "shouldn't 
canonicalize a single expression to MultiCommutativeOp"
   > 
   > From what I understand since the operator is a commutative expression, it 
should at least have two operands (?)
   
   I would keep the canonicalization of a single commutative expression like: 
`Add(x, y)` as it is: `Add(x, y)`, but when we have at least 2 nested: 
`Add(Add(x, y), z)` then as `MultiCommutativeOp(x, y, z, Add)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #39891: [SPARK-42318][SPARK-42319][SQL] Assign name to _LEGACY_ERROR_TEMP_(2123|2125)

2023-02-08 Thread via GitHub


MaxGekk closed pull request #39891: [SPARK-42318][SPARK-42319][SQL] Assign name 
to _LEGACY_ERROR_TEMP_(2123|2125)
URL: https://github.com/apache/spark/pull/39891


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #39891: [SPARK-42318][SPARK-42319][SQL] Assign name to _LEGACY_ERROR_TEMP_(2123|2125)

2023-02-08 Thread via GitHub


MaxGekk commented on PR #39891:
URL: https://github.com/apache/spark/pull/39891#issuecomment-1423117120

   +1, LGTM. Merging to master/3.4.
   Thank you, @itholic and @srielau for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] db-scnakandala commented on pull request #39722: [SPARK-42162] Introduce MultiCommutativeOp expression as a memory optimization for canonicalizing large trees of commutative expressio

2023-02-08 Thread via GitHub


db-scnakandala commented on PR #39722:
URL: https://github.com/apache/spark/pull/39722#issuecomment-1423094159

   > > Maybe we can change the default value to 2. WDYT?
   > 
   > Now the description of the config says the `The minimum number of 
consecutive commutative expressions`, but seemingly the logic uses the number 
of non-commutative operads below those commutative expressions 
(`operands.length < SQLConf.get.getConf(MULTI_COMMUTATIVE_OP_OPT_THRESHOLD)`). 
So I would argue that the default value 2 sounds good, but the logic should 
follow that description and we should optimize only when there are at least 2 
nested expressions and shouldn't canonicalize a single expression to 
`MultiCommutativeOp`.
   
   @peter-toth Can you elaborate more on the statement "shouldn't canonicalize 
a single expression to `MultiCommutativeOp`"
   
   From what I understand since the operator is a commutative expression, it 
should at least have two operands (?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bersprockets commented on a diff in pull request #39945: [SPARK-42384][SQL] Check for null input in generated code for mask function

2023-02-08 Thread via GitHub


bersprockets commented on code in PR #39945:
URL: https://github.com/apache/spark/pull/39945#discussion_r1100528061


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala:
##
@@ -223,16 +223,32 @@ case class Mask(
 val fifthGen = children(4).genCode(ctx)
 val resultCode =
   f(firstGen.value, secondGen.value, thirdGen.value, fourthGen.value, 
fifthGen.value)
-ev.copy(
-  code = code"""
+if (nullable) {
+  // this function is somewhat like a `UnaryExpression`, in that only the 
first child
+  // determines whether the result is null
+  val nullSafeEval = ctx.nullSafeExec(children(0).nullable, 
firstGen.isNull)(resultCode)
+  ev.copy(code = code"""
+${firstGen.code}
+${secondGen.code}
+${thirdGen.code}
+${fourthGen.code}
+${fifthGen.code}
+boolean ${ev.isNull} = ${firstGen.isNull};
+${CodeGenerator.javaType(dataType)} ${ev.value} = 
${CodeGenerator.defaultValue(dataType)};
+$nullSafeEval
+  """)
+} else {

Review Comment:
   Btw, the new generated code will look like this (if the first child is null, 
then the result is null):
   ```
   /* 031 */ boolean isNull_1 = i.isNullAt(0);
   /* 032 */ UTF8String value_1 = isNull_1 ?
   /* 033 */ null : (i.getUTF8String(0));
   /* 034 */
   /* 035 */
   /* 036 */
   /* 037 */
   /* 038 */ boolean isNull_0 = isNull_1;
   /* 039 */ UTF8String value_0 = null;
   /* 040 */
   /* 041 */ if (!isNull_1) {
   /* 042 */   value_0 = 
org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, 
((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* 
literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) 
references[3] /* literal */));;
   /* 043 */ }
   /* 044 */ if (isNull_0) {
   /* 045 */   mutableStateArray_0[0].setNullAt(0);
   /* 046 */ } else {
   /* 047 */   mutableStateArray_0[0].write(0, value_0);
   /* 048 */ }
   /* 049 */ return (mutableStateArray_0[0].getRow());
   ```
   Versus the old generated code, which looks like this (call 
`Mask.transformInput` even when input is null, and call `UnsafeWriter.write(0, 
value_0)` even if `value_0` is null):
   ```
   /* 031 */ boolean isNull_1 = i.isNullAt(0);
   /* 032 */ UTF8String value_1 = isNull_1 ?
   /* 033 */ null : (i.getUTF8String(0));
   /* 034 */
   /* 035 */
   /* 036 */
   /* 037 */
   /* 038 */ UTF8String value_0 = null;
   /* 039 */ value_0 = 
org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, 
((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* 
literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) 
references[3] /* literal */));;
   /* 040 */ if (false) {
   /* 041 */   mutableStateArray_0[0].setNullAt(0);
   /* 042 */ } else {
   /* 043 */   mutableStateArray_0[0].write(0, value_0);
   /* 044 */ }
   /* 045 */ return (mutableStateArray_0[0].getRow());
   /* 046 */   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #39893: [SPARK-42350][SQL][K8S][SS] Replcace `get().getOrElse` with `getOrElse`

2023-02-08 Thread via GitHub


srowen commented on PR #39893:
URL: https://github.com/apache/spark/pull/39893#issuecomment-1423047291

   Can you look for more instances across the code base?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #39880: typo: StogeLevel -> StorageLevel

2023-02-08 Thread via GitHub


srowen commented on PR #39880:
URL: https://github.com/apache/spark/pull/39880#issuecomment-1423046235

   Just for completeness can you try rerunning tests? I'm sure it doesn't 
affect functionality 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic opened a new pull request, #39946: [SPARK-42310][SQL] Assign name to _LEGACY_ERROR_TEMP_1289

2023-02-08 Thread via GitHub


itholic opened a new pull request, #39946:
URL: https://github.com/apache/spark/pull/39946

   
   
   ### What changes were proposed in this pull request?
   
   This PR proposes to assign name to _LEGACY_ERROR_TEMP_1289, 
"INVALID_COLUMN_NAME".
   
   
   ### Why are the changes needed?
   
   
   We should assign proper name to _LEGACY_ERROR_TEMP_*
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   
   
   ### How was this patch tested?
   
   
   `./build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #39937: [SPARK-42309][SQL] Assign name to _LEGACY_ERROR_TEMP_1204

2023-02-08 Thread via GitHub


itholic commented on code in PR #39937:
URL: https://github.com/apache/spark/pull/39937#discussion_r1100501971


##
core/src/main/resources/error/error-classes.json:
##
@@ -116,6 +116,12 @@
   ""
 ]
   },
+  "CANNOT_WRITE_INCOMPATIBLE_DATA_TO_TABLE" : {
+"message" : [
+  "Cannot write incompatible data to table :",
+  "- ."

Review Comment:
   Sounds good!
   Will separate into sub-error classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #39937: [SPARK-42309][SQL] Assign name to _LEGACY_ERROR_TEMP_1204

2023-02-08 Thread via GitHub


MaxGekk commented on code in PR #39937:
URL: https://github.com/apache/spark/pull/39937#discussion_r1100495181


##
core/src/main/resources/error/error-classes.json:
##
@@ -116,6 +116,12 @@
   ""
 ]
   },
+  "CANNOT_WRITE_INCOMPATIBLE_DATA_TO_TABLE" : {
+"message" : [
+  "Cannot write incompatible data to table :",
+  "- ."

Review Comment:
   Regarding to the general field ``. I would prefer to see separate 
errors classes (sub-classes) per every `addError()`:
   - 
https://github.com/apache/spark/blob/df36124636c600937644734f439670cbd35dbdf6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L80
   - 
https://github.com/apache/spark/blob/df36124636c600937644734f439670cbd35dbdf6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L83
   - 
https://github.com/apache/spark/blob/df36124636c600937644734f439670cbd35dbdf6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L117
   - 
https://github.com/apache/spark/blob/df36124636c600937644734f439670cbd35dbdf6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L135



##
core/src/main/resources/error/error-classes.json:
##
@@ -116,6 +116,12 @@
   ""
 ]
   },
+  "CANNOT_WRITE_INCOMPATIBLE_DATA_TO_TABLE" : {

Review Comment:
   Can you make it shorter like `INCOMPATIBLE_DATA ...` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #39945: [SPARK-42384][SQL] Check for null input in generated code for mask function

2023-02-08 Thread via GitHub


LuciferYang commented on PR #39945:
URL: https://github.com/apache/spark/pull/39945#issuecomment-1423021598

   cc @gengliangwang  FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #39890: [SPARK-42314][SQL] Assign name to _LEGACY_ERROR_TEMP_2127

2023-02-08 Thread via GitHub


MaxGekk closed pull request #39890: [SPARK-42314][SQL] Assign name to 
_LEGACY_ERROR_TEMP_2127
URL: https://github.com/apache/spark/pull/39890


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #39890: [SPARK-42314][SQL] Assign name to _LEGACY_ERROR_TEMP_2127

2023-02-08 Thread via GitHub


MaxGekk commented on PR #39890:
URL: https://github.com/apache/spark/pull/39890#issuecomment-1423008240

   +1, LGTM. Merging to master/3.4.
   Thank you, @itholic and @srielau for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] awdavidson commented on pull request #39943: [SPARK-40819][SQL][FOLLOWUP] Update SqlConf version for nanosAsLong configuration

2023-02-08 Thread via GitHub


awdavidson commented on PR #39943:
URL: https://github.com/apache/spark/pull/39943#issuecomment-1423004010

   > PR title should be `[SPARK-40819][SQL][FOLLOWUP]...`
   > 
   > And the pr title and description a little difficult to understand for me 
if I didn't look at the changes ...
   
   No problem - updated both title and description. Hoping the clarifies the 
change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #39866: [SPARK-42287][CONNECT][BUILD] Fix the client dependency jars

2023-02-08 Thread via GitHub


LuciferYang commented on code in PR #39866:
URL: https://github.com/apache/spark/pull/39866#discussion_r1100468326


##
connector/connect/client/jvm/pom.xml:
##
@@ -112,40 +124,57 @@
   false
   
 
+  com.google.android:*
+  com.google.api.grpc:*
+  com.google.code.findbugs:*
+  com.google.code.gson:*
+  com.google.errorprone:*
   com.google.guava:*
-  io.grpc:*
+  com.google.j2objc:*
   com.google.protobuf:*
+  io.grpc:*
+  io.netty:*
+  io.perfmark:*
+  org.codehaus.mojo:*
+  org.checkerframework:*
   
org.apache.spark:spark-connect-common_${scala.binary.version}
 
   
   
 
   io.grpc
-  
${spark.shade.packageName}.connect.client.grpc
+  
${spark.shade.packageName}.connect.client.io.grpc
   
 io.grpc.**
   
 
 
-  com.google.protobuf
-  
${spark.shade.packageName}.connect.protobuf
-  
-com.google.protobuf.**
-  
+  com.google
+  
${spark.shade.packageName}.connect.client.com.google
 
 
-  com.google.common
-  
${spark.shade.packageName}.connect.client.guava
-  
-com.google.common.**
-  
+  io.netty

Review Comment:
   Yes,  you are right.
   
   To clarify, I mean we need to ensure that the `native-image.properties` 
files in `META-INF/native-image/io.netty/netty-xxx/` are rewritten correctly if 
we need to shade netty



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steveloughran commented on pull request #39124: [DON'T MERGE] Test build and test with hadoop 3.3.5-RC0

2023-02-08 Thread via GitHub


steveloughran commented on PR #39124:
URL: https://github.com/apache/spark/pull/39124#issuecomment-1422989141

   bTW, I've been testing #39185 on 3.3.5, switching to the new manifest 
committer added for abfs/gcs commit performance; works well. That change 
doesn't depend on this PR, it just chooses the new committer if found on the 
classpath


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #39866: [SPARK-42287][CONNECT][BUILD] Fix the client dependency jars

2023-02-08 Thread via GitHub


LuciferYang commented on code in PR #39866:
URL: https://github.com/apache/spark/pull/39866#discussion_r1100468326


##
connector/connect/client/jvm/pom.xml:
##
@@ -112,40 +124,57 @@
   false
   
 
+  com.google.android:*
+  com.google.api.grpc:*
+  com.google.code.findbugs:*
+  com.google.code.gson:*
+  com.google.errorprone:*
   com.google.guava:*
-  io.grpc:*
+  com.google.j2objc:*
   com.google.protobuf:*
+  io.grpc:*
+  io.netty:*
+  io.perfmark:*
+  org.codehaus.mojo:*
+  org.checkerframework:*
   
org.apache.spark:spark-connect-common_${scala.binary.version}
 
   
   
 
   io.grpc
-  
${spark.shade.packageName}.connect.client.grpc
+  
${spark.shade.packageName}.connect.client.io.grpc
   
 io.grpc.**
   
 
 
-  com.google.protobuf
-  
${spark.shade.packageName}.connect.protobuf
-  
-com.google.protobuf.**
-  
+  com.google
+  
${spark.shade.packageName}.connect.client.com.google
 
 
-  com.google.common
-  
${spark.shade.packageName}.connect.client.guava
-  
-com.google.common.**
-  
+  io.netty

Review Comment:
   Yes,  you are right.
   
   To clarify, I mean we need to ensure that the `native-image.properties` 
files in `META-INF/native-image/io.netty/netty-codec-xxx/` are rewritten 
correctly if we need to shade netty



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhenlineo commented on a diff in pull request #39866: [SPARK-42287][CONNECT][BUILD] Fix the client dependency jars

2023-02-08 Thread via GitHub


zhenlineo commented on code in PR #39866:
URL: https://github.com/apache/spark/pull/39866#discussion_r1100451807


##
connector/connect/client/jvm/pom.xml:
##
@@ -112,40 +124,57 @@
   false
   
 
+  com.google.android:*
+  com.google.api.grpc:*
+  com.google.code.findbugs:*
+  com.google.code.gson:*
+  com.google.errorprone:*
   com.google.guava:*
-  io.grpc:*
+  com.google.j2objc:*
   com.google.protobuf:*
+  io.grpc:*
+  io.netty:*
+  io.perfmark:*
+  org.codehaus.mojo:*
+  org.checkerframework:*
   
org.apache.spark:spark-connect-common_${scala.binary.version}
 
   
   
 
   io.grpc
-  
${spark.shade.packageName}.connect.client.grpc
+  
${spark.shade.packageName}.connect.client.io.grpc
   
 io.grpc.**
   
 
 
-  com.google.protobuf
-  
${spark.shade.packageName}.connect.protobuf
-  
-com.google.protobuf.**
-  
+  com.google
+  
${spark.shade.packageName}.connect.client.com.google
 
 
-  com.google.common
-  
${spark.shade.packageName}.connect.client.guava
-  
-com.google.common.**
-  
+  io.netty

Review Comment:
   Let me come back to you. grpc has this grpc with shaded netty and the one 
without. We used the one without netty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #39941: [DOCS] Add link to Hadoop docs

2023-02-08 Thread via GitHub


itholic commented on PR #39941:
URL: https://github.com/apache/spark/pull/39941#issuecomment-1422967169

   Anyway, seems fine to me when the test passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #39941: [DOCS] Add link to Hadoop docs

2023-02-08 Thread via GitHub


itholic commented on PR #39941:
URL: https://github.com/apache/spark/pull/39941#issuecomment-1422965975

   Also please check 
https://github.com/apache/spark/pull/39941/checks?check_run_id=11190702189 out 
to enable GitHub Actions in your forked repository


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #39941: [DOCS] Add link to Hadoop docs

2023-02-08 Thread via GitHub


itholic commented on PR #39941:
URL: https://github.com/apache/spark/pull/39941#issuecomment-1422965066

   Can we create corresponding JIRA ticket 
[here](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-34827?filter=allopenissues),
 and add a JIRA number to the title??
   e.g. `[SPARK-12345][DOCS] Add link to Hadoop docs`
   or we can just add `[MINOR]` tag without creating ticket to JIRA, when the 
change is small enough.
   e.g. `[MINOR][DOCS] Add link to Hadoop docs`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bersprockets opened a new pull request, #39945: [SPARK-42384][SQL] Check for null input in generated code for mask function

2023-02-08 Thread via GitHub


bersprockets opened a new pull request, #39945:
URL: https://github.com/apache/spark/pull/39945

   ### What changes were proposed in this pull request?
   
   When generating code for the mask function, call `ctx.nullSafeExec` to 
produce null safe code.
   
   This change assumes that the mask function returns null only when the input 
is null (which appears to be the case, from reading the code of 
`Mask.transformInput`).
   
   
   ### Why are the changes needed?
   
   The following query fails with a `NullPointerException`:
   ```
   create or replace temp view v1 as
   select * from values
   (null),
   ('AbCD123-@$#')
   as data(col1);
   
   cache table v1;
   
   select mask(col1) from v1;
   
   23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
   java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
   ```
   The generated code calls `UnsafeWriter.write(0, value_0)` regardless of 
whether `Mask.transformInput` returns null or not. The `UnsafeWriter.write` 
method for `UTF8String` does not expect a null pointer.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #39937: [SPARK-42309][SQL] Assign name to _LEGACY_ERROR_TEMP_1204

2023-02-08 Thread via GitHub


itholic commented on PR #39937:
URL: https://github.com/apache/spark/pull/39937#issuecomment-1422948702

   cc @srielau @MaxGekk @cloud-fan Could you review this error class when you 
find some time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #39936: [SPARK-42379][SS] Use FileSystem.exists in FileSystemBasedCheckpointFileManager.exists

2023-02-08 Thread via GitHub


viirya commented on PR #39936:
URL: https://github.com/apache/spark/pull/39936#issuecomment-1422929142

   Sounds reasonable to remove it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-08 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1100333545


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -77,6 +77,11 @@ class BlockManagerMasterEndpoint(
   // Mapping from block id to the set of block managers that have the block.
   private val blockLocations = new JHashMap[BlockId, 
mutable.HashSet[BlockManagerId]]
 
+  // Mapping from task id to the set of rdd blocks which are generated from 
the task.
+  private val tidToRddBlockIds = new mutable.HashMap[Long, 
mutable.HashSet[RDDBlockId]]
+  // Record the visible RDD blocks which have been generated at least from one 
successful task.
+  private val visibleRDDBlocks = new mutable.HashSet[RDDBlockId]

Review Comment:
   Let me explain this further. If we track visible blocks, it's clear that we 
always know which blocks are visible.
   
   If we track invisible blocks, the way we consider a block as visible is that 
at least one block exists and it's not in invisible lists. So if the existing 
blocks got lost, we will lose the information. Next time the cache is 
re-computed, we will do this again(firstly put it into invisible lists, then 
promote it to visible by removing it from invisible list once task finished 
successfully). And after doing the process again, the cache would be visible 
then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >