[GitHub] [spark] huaxingao opened a new pull request, #38434: [SPARK-40946][SQL] Add a new DataSource V2 interface SupportsPushDownClusterKeys

2022-10-28 Thread GitBox


huaxingao opened a new pull request, #38434:
URL: https://github.com/apache/spark/pull/38434

   
   
   ### What changes were proposed in this pull request?
   ```
   /**
* A mix-in interface for {@link ScanBuilder}. Data sources can implement 
this interface to
* push down all the join or aggregate keys to data sources. A return value 
true indicates
* that data source will return input partitions (via planInputPartitions} 
following the
* clustering keys. Otherwise, a false return value indicates the data 
source doesn't make
* such a guarantee, even though it may still report a partitioning that may 
or may not
* be compatible with the given clustering keys, and it's Spark's 
responsibility to group
* the input partitions whether it can be applied.
*
* @since 3.4.0
*/
   @Evolving
   public interface SupportsPushDownClusterKeys extends ScanBuilder {
   ```
   
   
   ### Why are the changes needed?
   Pass down the information of join keys to v2 data sources so the data 
sources can decide how to combine the input splits according to the joins keys.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, new interface `SupportsPushDownClusterKeys`
   
   
   ### How was this patch tested?
   new tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options

2022-10-28 Thread GitBox


AmplabJenkins commented on PR #38418:
URL: https://github.com/apache/spark/pull/38418#issuecomment-1295679598

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation

2022-10-28 Thread GitBox


AmplabJenkins commented on PR #38415:
URL: https://github.com/apache/spark/pull/38415#issuecomment-1295679630

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #37219: [WIP][SPARK-39794][PYTHON] Introduce parametric singleton for DataType

2022-10-28 Thread GitBox


github-actions[bot] closed pull request #37219: [WIP][SPARK-39794][PYTHON] 
Introduce parametric singleton for DataType
URL: https://github.com/apache/spark/pull/37219


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Kimahriman commented on pull request #38229: [SPARK-40775][SQL] Fix duplicate description entries for V2 file scans

2022-10-28 Thread GitBox


Kimahriman commented on PR #38229:
URL: https://github.com/apache/spark/pull/38229#issuecomment-1295630445

   cc @dongjoon-hyun @cloud-fan @maropu, you guys were on the original PR 
review for the metadata addition


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] warrenzhu25 commented on pull request #38233: [SPARK-40781][CORE] Explain exit code 137 as killed due to OOM

2022-10-28 Thread GitBox


warrenzhu25 commented on PR #38233:
URL: https://github.com/apache/spark/pull/38233#issuecomment-1295617623

   > Can you point me to the reference where yarn mode is using `137` for OOM ? 
IIRC we simply kill the executor process in case there is an OOM - which 
usually results in `143` as the return code ...
   > 
   > ```
   > public class Test {
   > 
   >   public static void main(String[] args) {
   > long[] arr = new long[1024 * 1024 * 999];
   >   }
   > }
   > 
   > $ java -XX:OnOutOfMemoryError='kill %p'  -Xmx12m -cp . Test ; echo Error 
code: $?
   > ```
   > 
   > This is coming from `128 + SIGTERM` - and sigterm is 15 in linux. See here 
for more [1]
   > 
   > [1] https://tldp.org/LDP/abs/html/exitcodes.html
   
   My main goal is making exit code more readable in standalone, but this code 
is shared by both standalone and yarn. I see yarn exit code 137 from 
https://aws.amazon.com/premiumsupport/knowledge-center/container-killed-on-request-137-emr/#:~:text=When%20a%20container%20(Spark%20executor,in%20narrow%20and%20wide%20transformations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ben-zhang opened a new pull request, #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands

2022-10-28 Thread GitBox


ben-zhang opened a new pull request, #38433:
URL: https://github.com/apache/spark/pull/38433

   
   
   ### What changes were proposed in this pull request?
   Make the `MSCK` keyword optional in `MSCK REPAIR TABLE` commands so that it 
can be omitted.
   
   
   ### Why are the changes needed?
   The use of the keyword `MSCK`, meaning metastore check, is arcane and does 
not add value for the command. Removing it makes the meaning of `REPAIR TABLE` 
commands more clear.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, previously, it was no possible to specify only `REPAIR TABLE` without 
`MSCK`. Now, it is possible. No changes to existing behaviour using the 
original `MSCK REPAIR TABLE` syntax are in this PR.
   
   
   ### How was this patch tested?
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38393: [SPARK-40915][CONNECT] Improve `on` in Join in Python client

2022-10-28 Thread GitBox


amaliujia commented on code in PR #38393:
URL: https://github.com/apache/spark/pull/38393#discussion_r1008513991


##
python/pyspark/sql/connect/plan.py:
##
@@ -537,7 +537,7 @@ def __init__(
 self,
 left: Optional["LogicalPlan"],
 right: "LogicalPlan",
-on: "ColumnOrString",
+on: Optional[Union[str, List[str], ColumnRef]],

Review Comment:
   hmmm I gave a try but failed to make it work for mypy and UT together (not a 
python expert):
   
   I found the trick case is `str` seems to be also consider-red as `Sequence` 
or `Iterable`.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] thejdeep commented on a diff in pull request #36165: [SPARK-36620][SHUFFLE] Add Push Based Shuffle client side metrics

2022-10-28 Thread GitBox


thejdeep commented on code in PR #36165:
URL: https://github.com/apache/spark/pull/36165#discussion_r1008470301


##
core/src/main/scala/org/apache/spark/status/storeTypes.scala:
##
@@ -233,6 +243,38 @@ private[spark] class TaskDataWrapper(
 val shuffleLocalBytesRead: Long,
 @KVIndexParam(value = TaskIndexNames.SHUFFLE_READ_RECORDS, parent = 
TaskIndexNames.STAGE)
 val shuffleRecordsRead: Long,
+@KVIndexParam(
+  value = TaskIndexNames.PUSH_BASED_SHUFFLE_CORRUPT_MERGED_BLOCK_CHUNKS,
+  parent = TaskIndexNames.STAGE)
+val shuffleCorruptMergedBlockChunks: Long,
+@KVIndexParam(value = TaskIndexNames.PUSH_BASED_SHUFFLE_FALLBACK_COUNT,
+  parent = TaskIndexNames.STAGE)
+val shuffleFallbackCount: Long,
+@KVIndexParam(
+  value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_REMOTE_BLOCKS, parent = 
TaskIndexNames.STAGE)
+val shuffleMergedRemoteBlocksFetched: Long,
+@KVIndexParam(
+  value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_LOCAL_BLOCKS, parent = 
TaskIndexNames.STAGE)
+val shuffleMergedLocalBlocksFetched: Long,
+@KVIndexParam(
+  value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_REMOTE_CHUNKS, parent = 
TaskIndexNames.STAGE)
+val shuffleMergedRemoteChunksFetched: Long,
+@KVIndexParam(
+  value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_LOCAL_CHUNKS, parent = 
TaskIndexNames.STAGE)
+val shuffleMergedLocalChunksFetched: Long,
+@KVIndexParam(
+  value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_REMOTE_READS, parent = 
TaskIndexNames.STAGE)
+val shuffleMergedRemoteBlocksBytesRead: Long,
+@KVIndexParam(
+  value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_LOCAL_READS, parent = 
TaskIndexNames.STAGE)
+val shuffleMergedLocalBlocksBytesRead: Long,
+@KVIndexParam(
+  value = TaskIndexNames.PUSH_BASED_SHUFFLE_REMOTE_REQS_DURATION, parent = 
TaskIndexNames.STAGE)
+val shuffleRemoteReqsDuration: Long,
+@KVIndexParam(
+  value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_REMOTE_REQS_DURATION,
+  parent = TaskIndexNames.STAGE)

Review Comment:
   Indexes for these fields is a requirement when we are computing quantiles 
and reading `TaskDataWrapper`s.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a diff in pull request #36165: [SPARK-36620][SHUFFLE] Add Push Based Shuffle client side metrics

2022-10-28 Thread GitBox


otterc commented on code in PR #36165:
URL: https://github.com/apache/spark/pull/36165#discussion_r1008467914


##
core/src/main/scala/org/apache/spark/storage/PushBasedFetchHelper.scala:
##
@@ -290,6 +301,7 @@ private class PushBasedFetchHelper(
   address: BlockManagerId): Unit = {
 assert(blockId.isInstanceOf[ShuffleMergedBlockId] || 
blockId.isInstanceOf[ShuffleBlockChunkId])
 logWarning(s"Falling back to fetch the original blocks for push-merged 
block $blockId")
+shuffleMetrics.incFallbackCount(1)

Review Comment:
   Fallback count measures the number of times fallback is initiated for a 
block which is either a `shuffleMergedBlock` or a `shuffleBlockChunk`. I don't 
understand why we need to move into the case that you suggested.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008467821


##
core/src/main/resources/error/error-classes.json:
##
@@ -792,6 +887,21 @@
   "Unable to acquire  bytes of memory, got "
 ]
   },
+  "UNABLE_TO_CONVERT_TO_PROTOBUF_TYPE" : {
+"message" : [
+  "Unable to convert SQL type  to Protobuf type ."
+]
+  },
+  "UNABLE_TO_LOCATE_PROTOBUF_MESSAGE_ERROR" : {

Review Comment:
   @srielau fixed all the error class names



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008467248


##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {

Review Comment:
   @srielau Along with the message added the pathname of the descriptor file 
and the same for the one below. Aside from the error message, we don't have 
much information at this time. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008463500


##
core/src/main/resources/error/error-classes.json:
##
@@ -65,6 +70,11 @@
 ],
 "sqlState" : "22005"
   },
+  "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR" : {
+"message" : [
+  "Cannot convert SQL  to Protobuf  because 
 cannot be written since it's not defined in ENUM "

Review Comment:
   @srielau there is no ENUM type in spark, so we convert the protobuf enum to 
spark string.  so we are trying to throw an enum-specific error here.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a diff in pull request #36165: [SPARK-36620][SHUFFLE] Add Push Based Shuffle client side metrics

2022-10-28 Thread GitBox


otterc commented on code in PR #36165:
URL: https://github.com/apache/spark/pull/36165#discussion_r1008462544


##
core/src/test/scala/org/apache/spark/ui/StagePageSuite.scala:
##
@@ -78,6 +78,16 @@ class StagePageSuite extends SparkFunSuite with 
LocalSparkContext {
 shuffleLocalBytesRead = 1L,
 shuffleReadBytes = 1L,
 shuffleReadRecords = 1L,
+shuffleCorruptMergedBlockChunks = 2L,
+shuffleFallbackCount = 2L,
+shuffleMergedRemoteBlocksFetched = 1L,
+shuffleMergedLocalBlocksFetched = 1L,
+shuffleMergedRemoteChunksFetched = 1L,
+shuffleMergedLocalChunksFetched = 1L,
+shuffleMergedRemoteBytesRead = 1L,
+shuffleMergedLocalBytesRead = 1L,
+shuffleRemoteReqsDuration = 1L,
+shuffleMergedRemoteReqsDuration = 1L,

Review Comment:
   We didn't put much thought into it. But `shuffleMergerCount` can be 0 
because this stage may not push but this stage can still read shuffle data 
generated by the previous stage. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on pull request #38429: [SPARK-40800][SQL][FOLLOW-UP] Add a config to control whether to always inline one-row relation subquery

2022-10-28 Thread GitBox


allisonwang-db commented on PR #38429:
URL: https://github.com/apache/spark/pull/38429#issuecomment-1295441951

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008417281


##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {
+"message" : [
+  "Error parsing descriptor byte[] into Descriptor object"
+]
+  },
+  "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : {
+"message" : [
+  "Error constructing FileDescriptor"
+]
+  },
+  "PROTOBUF_FIELD_MISSING_ERROR" : {
+"message" : [
+  "Searching for  in Protobuf schema at  gave 
 matches. Candidates: "
+]
+  },
+  "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : {
+"message" : [
+  "Found  in Protobuf schema but there is no match in the SQL 
schema"
+]
+  },
+  "PROTOBUF_FIELD_TYPE_MISMATCH" : {
+"message" : [
+  "Type mismatch encountered for field: "
+]
+  },
+  "PROTOBUF_MESSAGE_TYPE_ERROR" : {
+"message" : [
+  " is not a Protobuf message type"
+]
+  },
+  "PROTOBUF_RECURSION_ERROR" : {
+"message" : [
+  "Found recursive reference in Protobuf schema, which can not be 
processed by Spark: "
+]
+  },
+  "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : {
+"message" : [
+  "Protobuf type not yet supported: ."
+]
+  },
+  "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : {
+"message" : [
+  "Unable to convert  of Protobuf to SQL type ."
+]
+  },
+  "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR" : {

Review Comment:
   this gives which protobuf field had failed to convert, this clearly 
describes the field name and field type. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008417312


##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {
+"message" : [
+  "Error parsing descriptor byte[] into Descriptor object"
+]
+  },
+  "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : {
+"message" : [
+  "Error constructing FileDescriptor"
+]
+  },
+  "PROTOBUF_FIELD_MISSING_ERROR" : {
+"message" : [
+  "Searching for  in Protobuf schema at  gave 
 matches. Candidates: "
+]
+  },
+  "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : {
+"message" : [
+  "Found  in Protobuf schema but there is no match in the SQL 
schema"
+]
+  },
+  "PROTOBUF_FIELD_TYPE_MISMATCH" : {
+"message" : [
+  "Type mismatch encountered for field: "
+]
+  },
+  "PROTOBUF_MESSAGE_TYPE_ERROR" : {
+"message" : [
+  " is not a Protobuf message type"
+]
+  },
+  "PROTOBUF_RECURSION_ERROR" : {
+"message" : [
+  "Found recursive reference in Protobuf schema, which can not be 
processed by Spark: "
+]
+  },
+  "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : {
+"message" : [
+  "Protobuf type not yet supported: ."
+]
+  },
+  "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : {

Review Comment:
   this gives which protobuf message had failed to convert, it would be the 
protobuf root name since the protobuf message might have a long list of fields, 
just keeping the root name. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008417281


##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {
+"message" : [
+  "Error parsing descriptor byte[] into Descriptor object"
+]
+  },
+  "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : {
+"message" : [
+  "Error constructing FileDescriptor"
+]
+  },
+  "PROTOBUF_FIELD_MISSING_ERROR" : {
+"message" : [
+  "Searching for  in Protobuf schema at  gave 
 matches. Candidates: "
+]
+  },
+  "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : {
+"message" : [
+  "Found  in Protobuf schema but there is no match in the SQL 
schema"
+]
+  },
+  "PROTOBUF_FIELD_TYPE_MISMATCH" : {
+"message" : [
+  "Type mismatch encountered for field: "
+]
+  },
+  "PROTOBUF_MESSAGE_TYPE_ERROR" : {
+"message" : [
+  " is not a Protobuf message type"
+]
+  },
+  "PROTOBUF_RECURSION_ERROR" : {
+"message" : [
+  "Found recursive reference in Protobuf schema, which can not be 
processed by Spark: "
+]
+  },
+  "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : {
+"message" : [
+  "Protobuf type not yet supported: ."
+]
+  },
+  "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : {
+"message" : [
+  "Unable to convert  of Protobuf to SQL type ."
+]
+  },
+  "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR" : {

Review Comment:
   this gives which protobuf field had failed to convert



##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {
+"message" : [
+  "Error parsing descriptor byte[] into Descriptor object"
+]
+  },
+  "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : {
+"message" : [
+  "Error constructing FileDescriptor"
+]
+  },
+  "PROTOBUF_FIELD_MISSING_ERROR" : {
+"message" : [
+  "Searching for  in Protobuf schema at  gave 
 matches. Candidates: "
+]
+  },
+  "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : {
+"message" : [
+  "Found  in Protobuf schema but there is no match in the SQL 
schema"
+]
+  },
+  "PROTOBUF_FIELD_TYPE_MISMATCH" : {
+"message" : [
+  "Type mismatch encountered for field: "
+]
+  },
+  "PROTOBUF_MESSAGE_TYPE_ERROR" : {
+"message" : [
+  " is not a Protobuf message type"
+]
+  },
+  "PROTOBUF_RECURSION_ERROR" : {
+"message" : [
+  "Found recursive reference in Protobuf schema, which can not be 
processed by Spark: "
+]
+  },
+  "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : {
+"message" : [
+  "Protobuf type not yet supported: ."
+]
+  },
+  "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : {

Review Comment:
   this gives which protobuf message had failed to convert 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008417312


##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {
+"message" : [
+  "Error parsing descriptor byte[] into Descriptor object"
+]
+  },
+  "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : {
+"message" : [
+  "Error constructing FileDescriptor"
+]
+  },
+  "PROTOBUF_FIELD_MISSING_ERROR" : {
+"message" : [
+  "Searching for  in Protobuf schema at  gave 
 matches. Candidates: "
+]
+  },
+  "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : {
+"message" : [
+  "Found  in Protobuf schema but there is no match in the SQL 
schema"
+]
+  },
+  "PROTOBUF_FIELD_TYPE_MISMATCH" : {
+"message" : [
+  "Type mismatch encountered for field: "
+]
+  },
+  "PROTOBUF_MESSAGE_TYPE_ERROR" : {
+"message" : [
+  " is not a Protobuf message type"
+]
+  },
+  "PROTOBUF_RECURSION_ERROR" : {
+"message" : [
+  "Found recursive reference in Protobuf schema, which can not be 
processed by Spark: "
+]
+  },
+  "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : {
+"message" : [
+  "Protobuf type not yet supported: ."
+]
+  },
+  "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : {

Review Comment:
   this is gives which message had failed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008417312


##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {
+"message" : [
+  "Error parsing descriptor byte[] into Descriptor object"
+]
+  },
+  "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : {
+"message" : [
+  "Error constructing FileDescriptor"
+]
+  },
+  "PROTOBUF_FIELD_MISSING_ERROR" : {
+"message" : [
+  "Searching for  in Protobuf schema at  gave 
 matches. Candidates: "
+]
+  },
+  "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : {
+"message" : [
+  "Found  in Protobuf schema but there is no match in the SQL 
schema"
+]
+  },
+  "PROTOBUF_FIELD_TYPE_MISMATCH" : {
+"message" : [
+  "Type mismatch encountered for field: "
+]
+  },
+  "PROTOBUF_MESSAGE_TYPE_ERROR" : {
+"message" : [
+  " is not a Protobuf message type"
+]
+  },
+  "PROTOBUF_RECURSION_ERROR" : {
+"message" : [
+  "Found recursive reference in Protobuf schema, which can not be 
processed by Spark: "
+]
+  },
+  "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : {
+"message" : [
+  "Protobuf type not yet supported: ."
+]
+  },
+  "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : {

Review Comment:
   this is thrown at protobuf message level



##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {
+"message" : [
+  "Error parsing descriptor byte[] into Descriptor object"
+]
+  },
+  "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : {
+"message" : [
+  "Error constructing FileDescriptor"
+]
+  },
+  "PROTOBUF_FIELD_MISSING_ERROR" : {
+"message" : [
+  "Searching for  in Protobuf schema at  gave 
 matches. Candidates: "
+]
+  },
+  "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : {
+"message" : [
+  "Found  in Protobuf schema but there is no match in the SQL 
schema"
+]
+  },
+  "PROTOBUF_FIELD_TYPE_MISMATCH" : {
+"message" : [
+  "Type mismatch encountered for field: "
+]
+  },
+  "PROTOBUF_MESSAGE_TYPE_ERROR" : {
+"message" : [
+  " is not a Protobuf message type"
+]
+  },
+  "PROTOBUF_RECURSION_ERROR" : {
+"message" : [
+  "Found recursive reference in Protobuf schema, which can not be 
processed by Spark: "
+]
+  },
+  "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : {
+"message" : [
+  "Protobuf type not yet supported: ."
+]
+  },
+  "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : {
+"message" : [
+  "Unable to convert  of Protobuf to SQL type ."
+]
+  },
+  "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR" : {

Review Comment:
   this is thrown at protobuf field level



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008417281


##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {
+"message" : [
+  "Error parsing descriptor byte[] into Descriptor object"
+]
+  },
+  "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : {
+"message" : [
+  "Error constructing FileDescriptor"
+]
+  },
+  "PROTOBUF_FIELD_MISSING_ERROR" : {
+"message" : [
+  "Searching for  in Protobuf schema at  gave 
 matches. Candidates: "
+]
+  },
+  "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : {
+"message" : [
+  "Found  in Protobuf schema but there is no match in the SQL 
schema"
+]
+  },
+  "PROTOBUF_FIELD_TYPE_MISMATCH" : {
+"message" : [
+  "Type mismatch encountered for field: "
+]
+  },
+  "PROTOBUF_MESSAGE_TYPE_ERROR" : {
+"message" : [
+  " is not a Protobuf message type"
+]
+  },
+  "PROTOBUF_RECURSION_ERROR" : {
+"message" : [
+  "Found recursive reference in Protobuf schema, which can not be 
processed by Spark: "
+]
+  },
+  "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : {
+"message" : [
+  "Protobuf type not yet supported: ."
+]
+  },
+  "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : {
+"message" : [
+  "Unable to convert  of Protobuf to SQL type ."
+]
+  },
+  "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR" : {

Review Comment:
   this is thrown when failing to convert protobuf fields into SQL type. 



##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {
+"message" : [
+  "Error parsing descriptor byte[] into Descriptor object"
+]
+  },
+  "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : {
+"message" : [
+  "Error constructing FileDescriptor"
+]
+  },
+  "PROTOBUF_FIELD_MISSING_ERROR" : {
+"message" : [
+  "Searching for  in Protobuf schema at  gave 
 matches. Candidates: "
+]
+  },
+  "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : {
+"message" : [
+  "Found  in Protobuf schema but there is no match in the SQL 
schema"
+]
+  },
+  "PROTOBUF_FIELD_TYPE_MISMATCH" : {
+"message" : [
+  "Type mismatch encountered for field: "
+]
+  },
+  "PROTOBUF_MESSAGE_TYPE_ERROR" : {
+"message" : [
+  " is not a Protobuf message type"
+]
+  },
+  "PROTOBUF_RECURSION_ERROR" : {
+"message" : [
+  "Found recursive reference in Protobuf schema, which can not be 
processed by Spark: "
+]
+  },
+  "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : {
+"message" : [
+  "Protobuf type not yet supported: ."
+]
+  },
+  "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : {

Review Comment:
   this is thrown when failing to convert the protobuf message to SQL struct 
type. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jchen5 commented on pull request #38432: [SPARK-36124][SQL][WIP] Support subqueries with correlation through set operators

2022-10-28 Thread GitBox


jchen5 commented on PR #38432:
URL: https://github.com/apache/spark/pull/38432#issuecomment-1295378945

   @allisonwang-db @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jchen5 opened a new pull request, #38432: [SPARK-36124][SQL][WIP] Support subqueries with correlation through set operators

2022-10-28 Thread GitBox


jchen5 opened a new pull request, #38432:
URL: https://github.com/apache/spark/pull/38432

   ## What changes were proposed in this pull request?
   
   Adds support for subquery decorrelation with UNION, INTERSECT, and EXCEPT on 
the correlation paths. For example:
   ```
   SELECT t1a, (
 SELECT sum(b) FROM (
   SELECT t2b as b FROM t2 WHERE  t2a = t1.t1a
   INTERSECT
   SELECT t3b as b FROM t3 WHERE  t3a = t1.t1a
   ))
   FROM t1
   ```
   
   ### Why are the changes needed?
   To improve subquery support in Spark.
   
   ### Does this PR introduce _any_ user-facing change?
   Before this change, queries like this would return an error like: 
`Decorrelate inner query through Union is not supported.`
   
   After this PR, this query can run successfully.
   
   ### How was this patch tested?
   Unit tests and SQL query tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist

2022-10-28 Thread GitBox


amaliujia commented on PR #38426:
URL: https://github.com/apache/spark/pull/38426#issuecomment-1295352747

   Late LGTM. 
   
   Thanks for this improvement!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


srielau commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008327776


##
core/src/main/resources/error/error-classes.json:
##
@@ -23,6 +23,11 @@
 ],
 "sqlState" : "42000"
   },
+  "CANNOT_FIND_PROTOBUF_DESCRIPTOR_FILE_ERROR" : {

Review Comment:
   Definitely remove _ERROR.
   Would be nice if we could gravitate towards *_NOT_FOUND and *_ALREADY_EXIST 
style naming convention:
   PROTOBUF_DESCRIPTOR_FILE_NOT_FOUND



##
core/src/main/resources/error/error-classes.json:
##
@@ -65,6 +70,11 @@
 ],
 "sqlState" : "22005"
   },
+  "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR" : {

Review Comment:
   I don't think we call out Catalyst anywhere else. How is that relevant? Can 
we be generic?



##
core/src/main/resources/error/error-classes.json:
##
@@ -65,6 +70,11 @@
 ],
 "sqlState" : "22005"
   },
+  "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR" : {
+"message" : [
+  "Cannot convert SQL  to Protobuf  because 
 cannot be written since it's not defined in ENUM "

Review Comment:
   That's a because "squared"... I don't have enough context to propose a 
rephrase..
   Also If we have SQL here, maybe we can do CATALYST -> SQL for the name? 



##
core/src/main/resources/error/error-classes.json:
##
@@ -579,6 +594,11 @@
   }
 }
   },
+  "MALFORMED_PROTOBUF_MESSAGE_ERROR" : {

Review Comment:
   Remove _ERROR



##
core/src/main/resources/error/error-classes.json:
##
@@ -629,11 +649,21 @@
 ],
 "sqlState" : "42000"
   },
+  "NO_CATALYST_TYPE_IN_PROTOBUF_SCHEMA" : {
+"message" : [
+  "Cannot find  in Protobuf schema"
+]
+  },
   "NO_HANDLER_FOR_UDAF" : {
 "message" : [
   "No handler for UDAF ''. Use 
sparkSession.udf.register(...) instead."
 ]
   },
+  "NO_PROTOBUF_MESSAGE_TYPE_ERROR" : {

Review Comment:
   Remove _ERROR



##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {

Review Comment:
   ```suggestion
 "CANNOT_PARSE_PROTOBUF_DESCRIPTOR" : {
   ```
   
   Any hope we know why and can tell? This just says: "Bad Kitty"



##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {
+"message" : [
+  "Error parsing descriptor byte[] into Descriptor object"
+]
+  },
+  "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : {

Review Comment:
   ```suggestion
 "CANNOT_CONSTRUCT_PROTOBUF_DESCRIPTOR" : {
   ```
   Same as above.. Not sure how helpful this message is.



##
core/src/main/resources/error/error-classes.json:
##
@@ -706,6 +736,66 @@
 ],
 "sqlState" : "42000"
   },
+  "PROTOBUF_CLASS_LOAD_ERROR" : {
+"message" : [
+  "Could not load Protobuf class with name "
+]
+  },
+  "PROTOBUF_DEPENDENCY_ERROR" : {
+"message" : [
+  "Could not find dependency: "
+]
+  },
+  "PROTOBUF_DESCRIPTOR_ERROR" : {
+"message" : [
+  "Error parsing descriptor byte[] into Descriptor object"
+]
+  },
+  "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : {
+"message" : [
+  "Error constructing FileDescriptor"
+]
+  },
+  "PROTOBUF_FIELD_MISSING_ERROR" : {
+"message" : [
+  "Searching for  in Protobuf schema at  gave 
 matches. Candidates: "
+]
+  },
+  "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : {
+"message" : [
+  "Found  in Protobuf schema but there is no match in the SQL 
schema"
+]
+  },
+  "PROTOBUF_FIELD_TYPE_MISMATCH" : {
+"message" : [
+  "Type mismatch encountered for field: "
+]
+  },
+  "PROTOBUF_MESSAGE_TYPE_ERROR" : {
+"message" : [
+  " is not a Protobuf message type"
+]
+  },
+  "PROTOBUF_RECURSION_ERROR" : {

Review Comment:
   ```suggestion
 "RECURSIVE_PROTOBUF_SCHEMA" : {
   ```



##
core/src/main/resources/error/error-classes.json:
##
@@ -760,6 +850,11 @@
 ],
 "sqlState" : "22023"
   },
+  "SQL_TYPE_TO_PROTOBUF_TYPE_ERROR" : {

Review Comment:
   ```suggestion
 "CANNOT_CONVERT_SQL_TYPE_TO_PROTOBUF_TYPE" : {
   ```



##
core/src/main/resources/error/error-classes.json:
##
@@ -792,6 +887,21 @@
   "Unable to acquire  bytes of memory, got "
 ]
   },
+  "UNABLE_TO_CONVERT_TO_PROTOBUF_TYPE" : {
+"message" : [
+  "Unable to convert SQL type  to Protobuf type ."
+]
+  },
+  

[GitHub] [spark] rahulsmahadev commented on a diff in pull request #38404: [SPARK-40956] SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread GitBox


rahulsmahadev commented on code in PR #38404:
URL: https://github.com/apache/spark/pull/38404#discussion_r1008360452


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1276,6 +1276,24 @@ class Analyzer(override val catalogManager: 
CatalogManager)
   AppendData.byPosition(r, query)
 } else if (conf.partitionOverwriteMode == 
PartitionOverwriteMode.DYNAMIC) {
   OverwritePartitionsDynamic.byPosition(r, query)
+} else if (i.replacePredicates.nonEmpty) {
+  def findAttrInRelation(name: String): Attribute = {

Review Comment:
   add comments here since this is not obvious.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rahulsmahadev commented on a diff in pull request #38404: [SPARK-40956] SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread GitBox


rahulsmahadev commented on code in PR #38404:
URL: https://github.com/apache/spark/pull/38404#discussion_r1008360097


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1259,8 +1259,8 @@ class Analyzer(override val catalogManager: 
CatalogManager)
   object ResolveInsertInto extends Rule[LogicalPlan] {
 override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsWithPruning(
   AlwaysProcess.fn, ruleId) {
-  case i @ InsertIntoStatement(r: DataSourceV2Relation, _, _, _, _, _)
-  if i.query.resolved && i.userSpecifiedCols.isEmpty =>
+  case i @ InsertIntoStatement(r: DataSourceV2Relation, _, _, _, _, _, _)
+if i.query.resolved && i.userSpecifiedCols.isEmpty =>

Review Comment:
   nit: undo space



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #38420: [SPARK-40947][PS][INFRA] Upgrade pandas to 1.5.1

2022-10-28 Thread GitBox


itholic commented on PR #38420:
URL: https://github.com/apache/spark/pull/38420#issuecomment-1295286024

   Ahh! Let me take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jerrypeng commented on pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-10-28 Thread GitBox


jerrypeng commented on PR #38430:
URL: https://github.com/apache/spark/pull/38430#issuecomment-1295283632

   @HeartSaVioR @zsxwing please review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] juliuszsompolski commented on pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output

2022-10-28 Thread GitBox


juliuszsompolski commented on PR #38397:
URL: https://github.com/apache/spark/pull/38397#issuecomment-1295275605

   3.3 PR: https://github.com/apache/spark/pull/38431


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] juliuszsompolski commented on pull request #38431: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output

2022-10-28 Thread GitBox


juliuszsompolski commented on PR #38431:
URL: https://github.com/apache/spark/pull/38431#issuecomment-1295275295

   @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] juliuszsompolski opened a new pull request, #38431: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output

2022-10-28 Thread GitBox


juliuszsompolski opened a new pull request, #38431:
URL: https://github.com/apache/spark/pull/38431

   ### What changes were proposed in this pull request?
   
   We move the decision about supporting columnar output based on WSCG one 
level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass 
it as a new required option for ParquetFileFormat / OrcFileFormat. Now the 
semantics is as follows:
   * `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` 
returns whether it **can**, not necessarily **will** return columnar output.
   * To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` 
needs to be passed to `buildReaderWithPartitionValues` in these two file 
formats. It should only be set to `true` if `supportsBatch` is also `true`, but 
it can be set to `false` if we don't want columnar output nevertheless - this 
way, `FileSourceScanExec` can set it to false when there are more than 100 
columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to 
concern itself about WSCG limits.
   * To avoid not passing it by accident, this option is made required. Making 
it required requires updating a few places that use it, but an error resulting 
from this is very obscure. It's better to fail early and explicitly here.
   
   ### Why are the changes needed?
   
   This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the 
same issue.
   
   `java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch 
cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown 
because ParquetReader was outputting columnar batches, while FileSourceScanExec 
expected row output.
   
   The mismatch comes from the fact that `ParquetFileFormat.supportBatch` 
depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the 
threshold is 100 fields.
   
   When this is used in `FileSourceScanExec`:
   ```
 override lazy val supportsColumnar: Boolean = {
 relation.fileFormat.supportBatch(relation.sparkSession, schema)
 }
   ```
   the `schema` comes from output attributes, which includes extra metadata 
attributes.
   
   However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was 
calculated again as
   ```
 relation.fileFormat.buildReaderWithPartitionValues(
   sparkSession = relation.sparkSession,
   dataSchema = relation.dataSchema,
   partitionSchema = relation.partitionSchema,
   requiredSchema = requiredSchema,
   filters = pushedDownFilters,
   options = options,
   hadoopConf = hadoopConf
   ...
   val resultSchema = StructType(requiredSchema.fields ++ 
partitionSchema.fields)
   ...
   val returningBatch = supportBatch(sparkSession, resultSchema)
   ```
   
   Where `requiredSchema` and `partitionSchema` wouldn't include the metadata 
columns:
   ```
   FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, 
file_path#6388)
   FileSourceScanExec: dataSchema: 
StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
   FileSourceScanExec: partitionSchema: StructType()
   FileSourceScanExec: requiredSchema: 
StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
   ```
   
   Column like `file_path#6388` are added by the scan, and contain metadata 
added by the scan, not by the file reader which concerns itself with what is 
within the file.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Not a public API change, but it is now required to pass 
`FileFormat.OPTION_RETURNING_BATCH` in `options` to 
`ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API 
in Apache Spark is `FileSourceScanExec`.
   
   ### How was this patch tested?
   
   Tests added
   
   Closes #38397 from juliuszsompolski/SPARK-40918.
   
   Authored-by: Juliusz Sompolski 
   Signed-off-by: Wenchen Fan 
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jerrypeng opened a new pull request, #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-10-28 Thread GitBox


jerrypeng opened a new pull request, #38430:
URL: https://github.com/apache/spark/pull/38430

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist

2022-10-28 Thread GitBox


dongjoon-hyun closed pull request #38426: [SPARK-40951][PYSPARK][TESTS] 
`pyspark-connect` tests should be skipped if `pandas` doesn't exist
URL: https://github.com/apache/spark/pull/38426


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options

2022-10-28 Thread GitBox


dtenedor commented on code in PR #38418:
URL: https://github.com/apache/spark/pull/38418#discussion_r1008275592


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##
@@ -2813,13 +2813,41 @@ class AstBuilder extends 
SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit
   ctx: CreateOrReplaceTableColTypeContext): StructField = withOrigin(ctx) {
 import ctx._
 
+// Check that no duplicates exist among any CREATE TABLE column options 
specified.
+var isNotNull: Option[Boolean] = None

Review Comment:
   That's a good point. We can just name this `nullable` and set it to 
initially `true`, and change it to `false` if the SQL syntax includes `NOT 
NULL`. Updated accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] johanl-db commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO

2022-10-28 Thread GitBox


johanl-db commented on code in PR #38400:
URL: https://github.com/apache/spark/pull/38400#discussion_r1008270419


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##
@@ -451,7 +451,22 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] 
with SQLConfHelper wit
 }
   }
 }
-if (matchedActions.isEmpty && notMatchedActions.isEmpty) {
+val notMatchedBySourceActions = ctx.notMatchedBySourceClause().asScala.map 
{
+  clause => {
+val notMatchedBySourceAction = clause.notMatchedBySourceAction()
+if (notMatchedBySourceAction.DELETE() != null) {
+  DeleteAction(Option(clause.notMatchedBySourceCond).map(expression))
+} else if (notMatchedBySourceAction.UPDATE() != null) {
+  val condition = Option(clause.notMatchedBySourceCond).map(expression)
+  UpdateAction(condition,
+
withAssignments(clause.notMatchedBySourceAction().assignmentList()))
+} else {
+  // It should not be here.
+  throw 
QueryParsingErrors.unrecognizedNotMatchedBySourceActionError(clause)

Review Comment:
   Parsing will fail if a user tries to use an invalid action. We don't cover 
this error in tests since it can't surface unless there's a bug. I replaced it 
(and similar  error for matched / not matched) with internal errors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist

2022-10-28 Thread GitBox


dongjoon-hyun commented on PR #38426:
URL: https://github.com/apache/spark/pull/38426#issuecomment-1295238073

   The last commit only changes the annotation and I verified it once more 
manually. Merged to master for Apache Spark 3.4.
   ```
   % PYTHON_EXECUTABLE=python3.9 python/run-tests --modules pyspark-connect
   Running PySpark tests. Output is in 
/Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
   Will test against the following Python executables: ['python3.9']
   Will test the following Python modules: ['pyspark-connect']
   python3.9 python_implementation is CPython
   python3.9 version is: Python 3.9.15
   Starting test(python3.9): 
pyspark.sql.tests.connect.test_connect_column_expressions (temp output: 
/Users/dongjoon/APACHE/spark-merge/python/target/11141642-6c91-4d03-9039-7e4ca2a946d9/python3.9__pyspark.sql.tests.connect.test_connect_column_expressions__ufodn8um.log)
   Starting test(python3.9): pyspark.sql.tests.connect.test_connect_basic (temp 
output: 
/Users/dongjoon/APACHE/spark-merge/python/target/4701bdaf-a321-4175-8dfb-dea06c91b45a/python3.9__pyspark.sql.tests.connect.test_connect_basic__daej1unt.log)
   Starting test(python3.9): pyspark.sql.tests.connect.test_connect_select_ops 
(temp output: 
/Users/dongjoon/APACHE/spark-merge/python/target/b84efc53-9fba-4073-a088-09dad28fb5a0/python3.9__pyspark.sql.tests.connect.test_connect_select_ops__34nud2k0.log)
   Starting test(python3.9): pyspark.sql.tests.connect.test_connect_plan_only 
(temp output: 
/Users/dongjoon/APACHE/spark-merge/python/target/66f0-5751-49e7-8bb1-72633c30d7b1/python3.9__pyspark.sql.tests.connect.test_connect_plan_only__inix46uc.log)
   Finished test(python3.9): pyspark.sql.tests.connect.test_connect_basic (2s) 
... 6 tests were skipped
   Finished test(python3.9): pyspark.sql.tests.connect.test_connect_plan_only 
(2s) ... 10 tests were skipped
   Finished test(python3.9): pyspark.sql.tests.connect.test_connect_select_ops 
(2s) ... 2 tests were skipped
   Finished test(python3.9): 
pyspark.sql.tests.connect.test_connect_column_expressions (2s) ... 2 tests were 
skipped
   Tests passed in 2 seconds
   
   Skipped tests in pyspark.sql.tests.connect.test_connect_basic with python3.9:
 test_limit_offset 
(pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip 
(0.003s)
 test_schema 
(pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip 
(0.000s)
 test_simple_datasource_read 
(pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip 
(0.000s)
 test_simple_explain_string 
(pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip 
(0.000s)
 test_simple_read 
(pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip 
(0.000s)
 test_simple_udf 
(pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip 
(0.000s)
   
   Skipped tests in pyspark.sql.tests.connect.test_connect_column_expressions 
with python3.9:
 test_column_literals 
(pyspark.sql.tests.connect.test_connect_column_expressions.SparkConnectColumnExpressionSuite)
 ... skip (0.000s)
 test_simple_column_expressions 
(pyspark.sql.tests.connect.test_connect_column_expressions.SparkConnectColumnExpressionSuite)
 ... skip (0.000s)
   
   Skipped tests in pyspark.sql.tests.connect.test_connect_plan_only with 
python3.9:
 test_all_the_plans 
(pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) 
... skip (0.003s)
 test_datasource_read 
(pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) 
... skip (0.000s)
 test_deduplicate 
(pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) 
... skip (0.001s)
 test_filter 
(pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) 
... skip (0.000s)
 test_limit 
(pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) 
... skip (0.000s)
 test_offset 
(pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) 
... skip (0.000s)
 test_relation_alias 
(pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) 
... skip (0.000s)
 test_sample 
(pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) 
... skip (0.001s)
 test_simple_project 
(pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) 
... skip (0.000s)
 test_simple_udf 
(pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) 
... skip (0.000s)
   
   Skipped tests in pyspark.sql.tests.connect.test_connect_select_ops with 
python3.9:
 test_join_with_join_type 
(pyspark.sql.tests.connect.test_connect_select_ops.SparkConnectToProtoSuite) 
... skip (0.003s)
 test_select_with_columns_and_strings 
(pyspark.sql.tests.connect.test_connect_select_ops.SparkConnectToProtoSuite) 
... skip (0.000s)
   ```


-- 
This 

[GitHub] [spark] dtenedor commented on a diff in pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'

2022-10-28 Thread GitBox


dtenedor commented on code in PR #38263:
URL: https://github.com/apache/spark/pull/38263#discussion_r1008272185


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.commons.codec.digest.DigestUtils
+
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, 
ExprCode}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(str) - Returns hash of the given value. The hash is 
consistent and can be used to join masked values together across tables",

Review Comment:
   This description probably needs a bit more information. For example, can 
this mention:
   
   * This can be useful for creating copies of tables with sensitive 
information removed, but retaining the same schema.
   * Error behavior: there are no error cases for this expression, it always 
returns a result string for every input string.
   * this looks similar to the existing `md5` function [1]. How is this 
different?
   
   [1] 
https://github.com/apache/spark/blob/d2bdd6595ec3495672c60e225948c99bfaeff04a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2415



##
sql/core/src/test/resources/sql-tests/inputs/string-functions.sql:
##
@@ -58,6 +58,14 @@ SELECT substring('Spark SQL' from 5);
 SELECT substring('Spark SQL' from -3);
 SELECT substring('Spark SQL' from 5 for 1);
 
+-- mask function
+select mask_hash('TestString-123'),

Review Comment:
   also add a case where the input is NULL?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #38146: [SPARK-40687][SQL] Support data masking built-in function 'mask'

2022-10-28 Thread GitBox


dtenedor commented on code in PR #38146:
URL: https://github.com/apache/spark/pull/38146#discussion_r1008264498


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala:
##
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.expressions.codegen._
+import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType}
+import org.apache.spark.unsafe.types.UTF8String
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage =
+"""_FUNC_(input[, upperChar, lowerChar, digitChar, otherChar]) - masks the 
given string value""",
+  arguments = """
+Arguments:
+  * input  - string value to mask. Supported types: STRING, VARCHAR, 
CHAR
+  * upperChar  - character to replace upper-case characters with. Specify 
-1 to retain original character. Default value: 'X'
+  * lowerChar  - character to replace lower-case characters with. Specify 
-1 to retain original character. Default value: 'x'
+  * digitChar  - character to replace digit characters with. Specify -1 to 
retain original character. Default value: 'n'
+  * otherChar  - character to replace all other characters with. Specify 
-1 to retain original character. Default value: -1
+  """,
+  examples = """
+Examples:
+  > SELECT _FUNC_('abcd-EFGH-8765-4321');
+---
+  > SELECT _FUNC_('abcd-EFGH-8765-4321', 'Q');
+---
+  > SELECT _FUNC_('AbCD123-@$#', 'Q','q');
+QqQQnnn-@$#
+  > SELECT _FUNC_('AbCD123-@$#');
+XxXXnnn-@$#
+  > SELECT _FUNC_('AbCD123-@$#', 'Q');
+QxQQnnn-@$#
+  > SELECT _FUNC_('AbCD123-@$#', 'Q','q');
+QqQQnnn-@$#
+  > SELECT _FUNC_('AbCD123-@$#', 'Q','q', 'd');
+QqQQddd-@$#
+  > SELECT _FUNC_('AbCD123-@$#', 'Q','q', 'd', 'o');
+QqQQddd
+  > SELECT _FUNC_('AbCD123-@$#', -1, 'q', 'd', 'o');
+AqCDddd
+  > SELECT _FUNC_('AbCD123-@$#', -1,-1, 'd', 'o');
+AbCDddd
+  > SELECT _FUNC_('AbCD123-@$#', -1,-1, -1, 'o');
+AbCD123
+  > SELECT _FUNC_(NULL, -1,-1, -1, 'o');
+NULL
+  > SELECT _FUNC_(NULL);
+NULL
+  > SELECT _FUNC_('AbCD123-@$#', -1, -1, -1, -1);
+AbCD123-@$#
+  """,
+  since = "3.4.0",
+  group = "string_funcs")
+// scalastyle:on line.size.limit
+case class Mask(
+input: Expression,
+upperChar: Expression,
+lowerChar: Expression,
+digitChar: Expression,
+otherChar: Expression)
+extends QuinaryExpression
+with ImplicitCastInputTypes
+with NullIntolerant {
+
+  def this(input: Expression) =
+this(
+  input,
+  Literal(Mask.MASKED_UPPERCASE),
+  Literal(Mask.MASKED_LOWERCASE),
+  Literal(Mask.MASKED_DIGIT),
+  Literal(Mask.MASKED_IGNORE))
+
+  def this(input: Expression, upperChar: Expression) =
+this(
+  input,
+  upperChar,
+  Literal(Mask.MASKED_LOWERCASE),
+  Literal(Mask.MASKED_DIGIT),
+  Literal(Mask.MASKED_IGNORE))
+
+  def this(input: Expression, upperChar: Expression, lowerChar: Expression) =
+this(input, upperChar, lowerChar, Literal(Mask.MASKED_DIGIT), 
Literal(Mask.MASKED_IGNORE))
+
+  def this(
+  input: Expression,
+  upperChar: Expression,
+  lowerChar: Expression,
+  digitChar: Expression) =
+this(input, upperChar, lowerChar, digitChar, Literal(Mask.MASKED_IGNORE))
+
+  /**
+   * Expected input types from child expressions. The i-th position in the 
returned seq indicates
+   * the type requirement for the i-th child.
+   *
+   * The possible values at each position are:
+   *   1. a specific data type, e.g. LongType, StringType. 2. a non-leaf 
abstract data type, e.g.
+   *  NumericType, IntegralType, FractionalType.
+   */
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringType, StringType, StringType, StringType, StringType)
+
+  /**
+   * Called by default [[eval]] implementation. If subclass of 
QuinaryExpression keep the default
+   * nullability, they can override this method to save null-check code. If we 
need full control
+   * 

[GitHub] [spark] johanl-db commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO

2022-10-28 Thread GitBox


johanl-db commented on code in PR #38400:
URL: https://github.com/apache/spark/pull/38400#discussion_r1008270419


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##
@@ -451,7 +451,22 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] 
with SQLConfHelper wit
 }
   }
 }
-if (matchedActions.isEmpty && notMatchedActions.isEmpty) {
+val notMatchedBySourceActions = ctx.notMatchedBySourceClause().asScala.map 
{
+  clause => {
+val notMatchedBySourceAction = clause.notMatchedBySourceAction()
+if (notMatchedBySourceAction.DELETE() != null) {
+  DeleteAction(Option(clause.notMatchedBySourceCond).map(expression))
+} else if (notMatchedBySourceAction.UPDATE() != null) {
+  val condition = Option(clause.notMatchedBySourceCond).map(expression)
+  UpdateAction(condition,
+
withAssignments(clause.notMatchedBySourceAction().assignmentList()))
+} else {
+  // It should not be here.
+  throw 
QueryParsingErrors.unrecognizedNotMatchedBySourceActionError(clause)

Review Comment:
   Parsing will fail if a user tries to use an invalid action. We don't cover 
this error in tests since it can't surface unless there's a but. I replaced it 
(and similar  error for matched / not matched) by internal errors.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1563,13 +1563,29 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 }
 InsertAction(
   resolvedInsertCondition,
-  resolveAssignments(assignments, m, 
resolveValuesWithSourceOnly = true))
+  resolveAssignments(assignments, m, resolveValuesFrom = 
sourceTable))
   case o => o
 }
+val newNotMatchedBySourceActions = m.notMatchedBySourceActions.map 
{
+  case DeleteAction(deleteCondition) =>
+val resolvedDeleteCondition = deleteCondition.map(
+  resolveExpressionByPlanChildren(_, Project(Nil, 
targetTable)))

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rangadi commented on a diff in pull request #38286: [SPARK-40657] Add support for Java classes in Protobuf functions

2022-10-28 Thread GitBox


rangadi commented on code in PR #38286:
URL: https://github.com/apache/spark/pull/38286#discussion_r1008242208


##
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala:
##
@@ -24,6 +24,7 @@ import org.apache.spark.sql.{RandomDataGenerator, Row}
 import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, 
NoopFilters, OrderedFilters, StructFilters}
 import org.apache.spark.sql.catalyst.expressions.{ExpressionEvalHelper, 
GenericInternalRow, Literal}
 import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, 
GenericArrayData, MapData}
+import org.apache.spark.sql.protobuf.protos.CatalystTypes.BytesMsg

Review Comment:
   @MaxGekk did you import it as maven or sbt project? Both should work though. 
   With maven, 'generating sources' should fix that (in maven 'Tool Window'). I 
am also seeing similar issue with it, IntelliJ does not seem to run the 
`generate-test-sources` phase. 
   
   I tried just now and running `../../build/mvn test` in protobuf directory 
fixed this issue. 
   
   https://user-images.githubusercontent.com/502522/198683965-67b5c95f-f1e6-4a9b-9f96-1607026946a0.png;>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on pull request #38065: [SPARK-40625] Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values

2022-10-28 Thread GitBox


dtenedor commented on PR #38065:
URL: https://github.com/apache/spark/pull/38065#issuecomment-1295190482

   going to close this in favor of the other work by @vinodkc instead


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on pull request #38146: [SPARK-40687][SQL] Support data masking built-in function 'mask'

2022-10-28 Thread GitBox


dtenedor commented on PR #38146:
URL: https://github.com/apache/spark/pull/38146#issuecomment-1295191010

   @vinodkc sounds good! Thanks for working on this  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor closed pull request #38065: [SPARK-40625] Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values

2022-10-28 Thread GitBox


dtenedor closed pull request #38065: [SPARK-40625] Add MASK_CCN and 
TRY_MASK_CCN functions to redact credit card string values
URL: https://github.com/apache/spark/pull/38065


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on pull request #38101: [WIP][SPARK-40652][SQL] Add MASK_PHONE and TRY_MASK_PHONE functions to redact phone number string values

2022-10-28 Thread GitBox


dtenedor commented on PR #38101:
URL: https://github.com/apache/spark/pull/38101#issuecomment-1295190096

   Going to close this in favor of the other effort by @vinodkc instead


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor closed pull request #38101: [WIP][SPARK-40652][SQL] Add MASK_PHONE and TRY_MASK_PHONE functions to redact phone number string values

2022-10-28 Thread GitBox


dtenedor closed pull request #38101: [WIP][SPARK-40652][SQL] Add MASK_PHONE and 
TRY_MASK_PHONE functions to redact phone number string values
URL: https://github.com/apache/spark/pull/38101


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] antonipp commented on a diff in pull request #38376: [SPARK-40817] [Kubernetes] Do not discard remote user-specified files when launching Spark jobs on Kubernetes

2022-10-28 Thread GitBox


antonipp commented on code in PR #38376:
URL: https://github.com/apache/spark/pull/38376#discussion_r1008235962


##
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:
##
@@ -1609,6 +1609,16 @@ class TestFileSystem extends 
org.apache.hadoop.fs.LocalFileSystem {
   }
 
   override def open(path: Path): FSDataInputStream = super.open(local(path))
+
+  // No-op methods
+
+  override def copyFromLocalFile(

Review Comment:
   Fixed the indentation in a new commit.
   
   I also spent some time trying to make integration tests work on my machine 
but encountered a blocking issue with my local setup, I opened a ticket here: 
https://issues.apache.org/jira/browse/SPARK-40954 
   Could we merge this PR as it is? I can spend more time on fixing my 
integration test setup + writing a new test some time later but I think the 
unit test I added already gives us pretty good coverage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist

2022-10-28 Thread GitBox


dongjoon-hyun commented on PR #38426:
URL: https://github.com/apache/spark/pull/38426#issuecomment-1295178713

   Thank you, @zhengruifeng . I'll fix `mypy` style check failure at linter.
   ```
   starting mypy annotations test...
   annotations failed mypy checks:
   python/pyspark/sql/tests/connect/test_connect_select_ops.py:29: error: 
Argument 2 to "skipIf" has incompatible type "Optional[str]"; expected "str"  
[arg-type]
   python/pyspark/sql/tests/connect/test_connect_plan_only.py:29: error: 
Argument 2 to "skipIf" has incompatible type "Optional[str]"; expected "str"  
[arg-type]
   python/pyspark/sql/tests/connect/test_connect_column_expressions.py:30: 
error: Argument 2 to "skipIf" has incompatible type "Optional[str]"; expected 
"str"  [arg-type]
   Found 3 errors in 3 files (checked 369 source files)
   1
   Error: Process completed with exit code 1.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008228337


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##
@@ -3210,4 +3211,190 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase {
   messageParameters = Map("expression" -> toSQLExpr(expression))
 )
   }
+
+  def cannotConvertProtobufTypeToSqlTypeError(
+  protobufColumn: String,
+  sqlColumn: String,
+  protobufType: String,
+  sqlType: DataType): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR",
+  messageParameters = Map(
+"protobufColumn" -> protobufColumn,
+"sqlColumn" -> toSQLId(sqlColumn),
+"protobufType" -> protobufType,
+"sqlType" -> toSQLType(sqlType)))
+  }
+
+  def cannotConvertCatalystTypeToProtobufTypeError(
+  sqlColumn: String,
+  protobufColumn: String,
+  sqlType: DataType,
+  protobufType: String): Throwable = {
+new AnalysisException(
+  errorClass = "SQL_TYPE_TO_PROTOBUF_TYPE_ERROR",
+  messageParameters = Map(
+"sqlColumn" -> toSQLId(sqlColumn),
+"protobufColumn" -> protobufColumn,
+"sqlType" -> toSQLType(sqlType),
+"protobufType" -> protobufType))
+  }
+
+  def cannotConvertCatalystTypeToProtobufEnumTypeError(
+  sqlColumn: String,
+  protobufColumn: String,
+  data: String,
+  enumString: String): Throwable = {
+new AnalysisException(
+  errorClass = "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR",
+  messageParameters = Map(
+"sqlColumn" -> sqlColumn,
+"protobufColumn" -> protobufColumn,
+"data" -> data,
+"enumString" -> enumString))
+  }
+
+  def cannotConvertProtobufTypeToCatalystTypeError(
+  protobufType: String,
+  sqlType: DataType,
+  cause: Throwable): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR",
+  messageParameters = Map(
+"protobufType" -> protobufType,
+"toType" -> toSQLType(sqlType)),
+  cause = Some(cause.getCause))
+  }
+
+  def cannotConvertSqlTypeToProtobufError(
+  protobufType: String,
+  sqlType: DataType,
+  cause: Throwable): Throwable = {
+new AnalysisException(
+  errorClass = "UNABLE_TO_CONVERT_TO_PROTOBUF_TYPE",
+  messageParameters = Map(
+"protobufType" -> protobufType,
+"toType" -> toSQLType(sqlType)),
+  cause = Some(cause.getCause))

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008228857


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##
@@ -3210,4 +3211,190 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase {
   messageParameters = Map("expression" -> toSQLExpr(expression))
 )
   }
+
+  def cannotConvertProtobufTypeToSqlTypeError(
+  protobufColumn: String,
+  sqlColumn: String,
+  protobufType: String,
+  sqlType: DataType): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR",
+  messageParameters = Map(
+"protobufColumn" -> protobufColumn,
+"sqlColumn" -> toSQLId(sqlColumn),
+"protobufType" -> protobufType,
+"sqlType" -> toSQLType(sqlType)))
+  }
+
+  def cannotConvertCatalystTypeToProtobufTypeError(
+  sqlColumn: String,
+  protobufColumn: String,
+  sqlType: DataType,
+  protobufType: String): Throwable = {
+new AnalysisException(
+  errorClass = "SQL_TYPE_TO_PROTOBUF_TYPE_ERROR",
+  messageParameters = Map(
+"sqlColumn" -> toSQLId(sqlColumn),
+"protobufColumn" -> protobufColumn,
+"sqlType" -> toSQLType(sqlType),
+"protobufType" -> protobufType))
+  }
+
+  def cannotConvertCatalystTypeToProtobufEnumTypeError(
+  sqlColumn: String,
+  protobufColumn: String,
+  data: String,
+  enumString: String): Throwable = {
+new AnalysisException(
+  errorClass = "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR",
+  messageParameters = Map(
+"sqlColumn" -> sqlColumn,
+"protobufColumn" -> protobufColumn,
+"data" -> data,
+"enumString" -> enumString))
+  }
+
+  def cannotConvertProtobufTypeToCatalystTypeError(
+  protobufType: String,
+  sqlType: DataType,
+  cause: Throwable): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR",
+  messageParameters = Map(
+"protobufType" -> protobufType,
+"toType" -> toSQLType(sqlType)),
+  cause = Some(cause.getCause))
+  }
+
+  def cannotConvertSqlTypeToProtobufError(
+  protobufType: String,
+  sqlType: DataType,
+  cause: Throwable): Throwable = {
+new AnalysisException(
+  errorClass = "UNABLE_TO_CONVERT_TO_PROTOBUF_TYPE",
+  messageParameters = Map(
+"protobufType" -> protobufType,
+"toType" -> toSQLType(sqlType)),
+  cause = Some(cause.getCause))
+  }
+
+  def protobufTypeUnsupportedYetError(protobufType: String): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_TYPE_NOT_SUPPORT_ERROR",
+  messageParameters = Map("protobufType" -> protobufType))
+  }
+
+  def unknownProtobufMessageTypeError(
+  descriptorName: String,
+  containingType: String): Throwable = {
+new AnalysisException(
+  errorClass = "UNKNOWN_PROTOBUF_MESSAGE_TYPE",
+  messageParameters = Map(
+"descriptorName" -> descriptorName,
+"containingType" -> containingType))
+  }
+
+  def cannotFindCatalystTypeInProtobufSchemaError(catalystFieldPath: String): 
Throwable = {
+new AnalysisException(
+  errorClass = "NO_CATALYST_TYPE_IN_PROTOBUF_SCHEMA",
+  messageParameters = Map("catalystFieldPath" -> catalystFieldPath))
+  }
+
+  def cannotFindProtobufFieldInCatalystError(field: String): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA",
+  messageParameters = Map("field" -> field))
+  }
+
+  def protobufFieldMatchError(field: String,
+  protobufSchema: String,
+  matchSize: String,
+  matches: String): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_FIELD_MISSING_ERROR",
+  messageParameters = Map(
+"field" -> field,
+"protobufSchema" -> protobufSchema,
+"matchSize" -> matchSize,
+"matches" -> matches))
+  }
+
+  def unableToLocateProtobufMessageError(messageName: String): Throwable = {
+new AnalysisException(
+  errorClass = "UNABLE_TO_LOCATE_PROTOBUF_MESSAGE_ERROR",
+  messageParameters = Map("messageName" -> messageName))
+  }
+
+  def descrioptorParseError(cause: Throwable): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_DESCRIPTOR_ERROR",
+  messageParameters = Map.empty(),
+  cause = Some(cause.getCause))
+  }
+
+  def cannotFindDescriptorFileError(filePath: String, cause: Throwable): 
Throwable = {
+new AnalysisException(
+  errorClass = "CANNOT_FIND_PROTOBUF_DESCRIPTOR_FILE_ERROR",
+  messageParameters = Map("filePath" -> filePath),
+  cause = Some(cause.getCause))
+  }
+
+  def noProtobufMessageTypeReturnError(descriptorName: String): Throwable = {
+new AnalysisException(
+  errorClass = "NO_PROTOBUF_MESSAGE_TYPE_ERROR",
+  messageParameters = Map("descriptorName" -> 

[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008228200


##
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufSerializer.scala:
##
@@ -215,10 +217,12 @@ private[sql] class ProtobufSerializer(
   duration.build()
 
   case _ =>
-throw new IncompatibleSchemaException(
-  errorPrefix +
-s"schema is incompatible (sqlType = ${catalystType.sql}, " +
-s"protoType = ${fieldDescriptor.getJavaType})")
+throw 
QueryCompilationErrors.cannotConvertCatalystTypeToProtobufTypeError(
+  toFieldStr(catalystPath),

Review Comment:
   okay, will call toSQLType inside the error class, avoid using toFieldStr for 
sqlType



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`

2022-10-28 Thread GitBox


dongjoon-hyun commented on PR #38425:
URL: https://github.com/apache/spark/pull/38425#issuecomment-1295175461

   Thank you, @Yikun .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db opened a new pull request, #38429: [SPARK-40800][SQL][FOLLOW-UP] Add a config to control whether to always inline one-row relation subquery

2022-10-28 Thread GitBox


allisonwang-db opened a new pull request, #38429:
URL: https://github.com/apache/spark/pull/38429

   
   
   ### What changes were proposed in this pull request?
   
   This PR is a follow-up for SPARK-40800. It introduces a new Spark config to 
control the behavior of whether to always inline one-row relation subquery: 
`spark.sql.optimizer.optimizeOneRowRelationSubquery.alwaysInline` (default: 
true).
   
   ### Why are the changes needed?
   
   To give users more flexibility to control the correlated subquery 
performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Edited existing unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #38350: [SPARK-40752][SQL] Migrate type check failures of misc expressions onto error classes

2022-10-28 Thread GitBox


panbingkun commented on code in PR #38350:
URL: https://github.com/apache/spark/pull/38350#discussion_r1008152723


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala:
##
@@ -72,7 +74,11 @@ case class SortOrder(
 if (RowOrdering.isOrderable(dataType)) {
   TypeCheckResult.TypeCheckSuccess
 } else {
-  TypeCheckResult.TypeCheckFailure(s"cannot sort data type 
${dataType.catalogString}")
+  DataTypeMismatch(
+errorSubClass = "UNSUPPORTED_INPUT_TYPE",

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'

2022-10-28 Thread GitBox


cloud-fan commented on PR #38263:
URL: https://github.com/apache/spark/pull/38263#issuecomment-1295085134

   one more reference: 
https://docs.snowflake.com/en/sql-reference/functions/sha2.html
   
   We can probably find more references and decide the function name and 
signature for consistent hash.
   
   cc @srielau 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO

2022-10-28 Thread GitBox


cloud-fan commented on code in PR #38400:
URL: https://github.com/apache/spark/pull/38400#discussion_r1008135830


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##
@@ -451,7 +451,22 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] 
with SQLConfHelper wit
 }
   }
 }
-if (matchedActions.isEmpty && notMatchedActions.isEmpty) {
+val notMatchedBySourceActions = ctx.notMatchedBySourceClause().asScala.map 
{
+  clause => {
+val notMatchedBySourceAction = clause.notMatchedBySourceAction()
+if (notMatchedBySourceAction.DELETE() != null) {
+  DeleteAction(Option(clause.notMatchedBySourceCond).map(expression))
+} else if (notMatchedBySourceAction.UPDATE() != null) {
+  val condition = Option(clause.notMatchedBySourceCond).map(expression)
+  UpdateAction(condition,
+
withAssignments(clause.notMatchedBySourceAction().assignmentList()))
+} else {
+  // It should not be here.
+  throw 
QueryParsingErrors.unrecognizedNotMatchedBySourceActionError(clause)

Review Comment:
   If this means a bug, we should throw `SparkException.internalError`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a diff in pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'

2022-10-28 Thread GitBox


vinodkc commented on code in PR #38263:
URL: https://github.com/apache/spark/pull/38263#discussion_r1008128017


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3950,6 +3950,14 @@ object SQLConf {
 .checkValues(ErrorMessageFormat.values.map(_.toString))
 .createWithDefault(ErrorMessageFormat.PRETTY.toString)
 
+  val SPARK_MASKING_ALGO = buildConf("spark.sql.masking.algo")

Review Comment:
   Yes, we can add a second boolean parameter to control the fips mode. I'll 
update the code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO

2022-10-28 Thread GitBox


cloud-fan commented on code in PR #38400:
URL: https://github.com/apache/spark/pull/38400#discussion_r1008130147


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -1563,13 +1563,29 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 }
 InsertAction(
   resolvedInsertCondition,
-  resolveAssignments(assignments, m, 
resolveValuesWithSourceOnly = true))
+  resolveAssignments(assignments, m, resolveValuesFrom = 
sourceTable))
   case o => o
 }
+val newNotMatchedBySourceActions = m.notMatchedBySourceActions.map 
{
+  case DeleteAction(deleteCondition) =>
+val resolvedDeleteCondition = deleteCondition.map(
+  resolveExpressionByPlanChildren(_, Project(Nil, 
targetTable)))

Review Comment:
   nit: we can call `resolveExpressionByPlanOutput`, then we don't need to wrap 
the plan with `Project(Nil, ...)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a diff in pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'

2022-10-28 Thread GitBox


vinodkc commented on code in PR #38263:
URL: https://github.com/apache/spark/pull/38263#discussion_r1008128017


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3950,6 +3950,14 @@ object SQLConf {
 .checkValues(ErrorMessageFormat.values.map(_.toString))
 .createWithDefault(ErrorMessageFormat.PRETTY.toString)
 
+  val SPARK_MASKING_ALGO = buildConf("spark.sql.masking.algo")

Review Comment:
   Yes, we can add a third boolean parameter to control the fips mode. I'll 
update the code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO

2022-10-28 Thread GitBox


cloud-fan commented on code in PR #38400:
URL: https://github.com/apache/spark/pull/38400#discussion_r1008124090


##
core/src/main/resources/error/error-classes.json:
##
@@ -801,6 +819,21 @@
 ],
 "sqlState" : "42000"
   },
+  "UNRECOGNIZED_MATCHED_ACTION" : {
+"message" : [
+  "Unrecognized matched action: ."
+]
+  },
+  "UNRECOGNIZED_NOT_MATCHED_ACTION" : {

Review Comment:
   ditto, `UNRECOGNIZED_NOT_MATCHED_BY_TARGET_ACTION`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO

2022-10-28 Thread GitBox


cloud-fan commented on code in PR #38400:
URL: https://github.com/apache/spark/pull/38400#discussion_r1008123172


##
core/src/main/resources/error/error-classes.json:
##
@@ -602,6 +602,24 @@
 ],
 "sqlState" : "42000"
   },
+  "NON_LAST_MATCHED_CLAUSE_OMIT_CONDITION" : {
+"message" : [
+  "When there are more than one MATCHED clauses in a MERGE statement, only 
the last MATCHED clause can omit the condition."
+],
+"sqlState" : "42000"
+  },
+  "NON_LAST_NOT_MATCHED_BY_SOURCE_CLAUSE_OMIT_CONDITION" : {
+"message" : [
+  "When there are more than one NOT MATCHED BY SOURCE clauses in a MERGE 
statement, only the last NOT MATCHED BY SOURCE clause can omit the condition."
+],
+"sqlState" : "42000"
+  },
+  "NON_LAST_NOT_MATCHED_CLAUSE_OMIT_CONDITION" : {

Review Comment:
   is it more accurate to name it 
`NON_LAST_NOT_MATCHED_BY_TARGET_CLAUSE_OMIT_CONDITION `



##
core/src/main/resources/error/error-classes.json:
##
@@ -602,6 +602,24 @@
 ],
 "sqlState" : "42000"
   },
+  "NON_LAST_MATCHED_CLAUSE_OMIT_CONDITION" : {
+"message" : [
+  "When there are more than one MATCHED clauses in a MERGE statement, only 
the last MATCHED clause can omit the condition."
+],
+"sqlState" : "42000"
+  },
+  "NON_LAST_NOT_MATCHED_BY_SOURCE_CLAUSE_OMIT_CONDITION" : {
+"message" : [
+  "When there are more than one NOT MATCHED BY SOURCE clauses in a MERGE 
statement, only the last NOT MATCHED BY SOURCE clause can omit the condition."
+],
+"sqlState" : "42000"
+  },
+  "NON_LAST_NOT_MATCHED_CLAUSE_OMIT_CONDITION" : {

Review Comment:
   is it more accurate to name it 
`NON_LAST_NOT_MATCHED_BY_TARGET_CLAUSE_OMIT_CONDITION`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #38286: [SPARK-40657] Add support for Java classes in Protobuf functions

2022-10-28 Thread GitBox


MaxGekk commented on code in PR #38286:
URL: https://github.com/apache/spark/pull/38286#discussion_r1008102188


##
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala:
##
@@ -24,6 +24,7 @@ import org.apache.spark.sql.{RandomDataGenerator, Row}
 import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, 
NoopFilters, OrderedFilters, StructFilters}
 import org.apache.spark.sql.catalyst.expressions.{ExpressionEvalHelper, 
GenericInternalRow, Literal}
 import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, 
GenericArrayData, MapData}
+import org.apache.spark.sql.protobuf.protos.CatalystTypes.BytesMsg

Review Comment:
   I am struggling in generating this file in IDEA:
   https://user-images.githubusercontent.com/1580697/198630233-33341e69-276f-4505-9d52-874ddd1e6bc1.png;>
   
   Is any way to generate all required dependencies?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

2022-10-28 Thread GitBox


AmplabJenkins commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1295039735

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38428: [SPARK-40912][CORE][WIP] Overhead of Exceptions in DeserializationStream

2022-10-28 Thread GitBox


AmplabJenkins commented on PR #38428:
URL: https://github.com/apache/spark/pull/38428#issuecomment-1295039656

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38350: [SPARK-40752][SQL] Migrate type check failures of misc expressions onto error classes

2022-10-28 Thread GitBox


MaxGekk commented on PR #38350:
URL: https://github.com/apache/spark/pull/38350#issuecomment-1295031180

   @panbingkun Please, remove duplicate items in PR's description:
   ```
   - 1.Add new UT
   - 2.Update existed UT
   - 3.Pass GA.
   ```
   to
   ```
   1. Add new UT
   2. Update existed UT
   3. Pass GA.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #38350: [SPARK-40752][SQL] Migrate type check failures of misc expressions onto error classes

2022-10-28 Thread GitBox


MaxGekk commented on code in PR #38350:
URL: https://github.com/apache/spark/pull/38350#discussion_r1008092467


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala:
##
@@ -72,7 +74,11 @@ case class SortOrder(
 if (RowOrdering.isOrderable(dataType)) {
   TypeCheckResult.TypeCheckSuccess
 } else {
-  TypeCheckResult.TypeCheckFailure(s"cannot sort data type 
${dataType.catalogString}")
+  DataTypeMismatch(
+errorSubClass = "UNSUPPORTED_INPUT_TYPE",

Review Comment:
   Could you re-use the existing error class `INVALID_ORDERING_TYPE`. Just call 
`TypeUtils.checkForOrderingExpr(...)` here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38350: [SPARK-40752][SQL] Migrate type check failures of misc expressions onto error classes

2022-10-28 Thread GitBox


MaxGekk commented on PR #38350:
URL: https://github.com/apache/spark/pull/38350#issuecomment-1295009404

   @panbingkun
   This pr **replace** TypeCheckFailure -> This pr **replaces** TypeCheckFailure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] derhagen commented on a diff in pull request #38389: [MINOR][DOCS][PYTHON] Fix the truncation of API reference in several DataTypes

2022-10-28 Thread GitBox


derhagen commented on code in PR #38389:
URL: https://github.com/apache/spark/pull/38389#discussion_r1008071665


##
python/pyspark/sql/types.py:
##
@@ -350,21 +350,21 @@ class FloatType(FractionalType, 
metaclass=DataTypeSingleton):
 
 
 class ByteType(IntegralType):
-"""Byte data type, i.e. a signed integer in a single byte."""
+"""Byte data type, i.e.\  a signed integer in a single byte."""

Review Comment:
   starting the docstrings with an 'r' string prefix might be the right thing 
to do here, but I'm travelling right now. I'll test it next week, if I find the 
time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #38352: [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10

2022-10-28 Thread GitBox


srowen commented on PR #38352:
URL: https://github.com/apache/spark/pull/38352#issuecomment-1295001164

   The latter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sithmein commented on pull request #38352: [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10

2022-10-28 Thread GitBox


sithmein commented on PR #38352:
URL: https://github.com/apache/spark/pull/38352#issuecomment-1294996843

   Is Spark actually affected by the problem in the `StringLookup` class or is 
this dependency update only to get rid of all the alarms raised by code 
scanners?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`

2022-10-28 Thread GitBox


cloud-fan commented on code in PR #38395:
URL: https://github.com/apache/spark/pull/38395#discussion_r1008046680


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -2100,3 +2100,53 @@ object AsOfJoin {
 }
   }
 }
+
+
+/**
+ * A logical plan for summary.
+ */
+case class Summary(

Review Comment:
   they are not a function in my opinion, but more like a dataframe operation. 
e.g. we won't call `df.filter` a DataFrame function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #38410: [SPARK-40932][CORE] Fix issue messages for allGather are overridden

2022-10-28 Thread GitBox


cloud-fan closed pull request #38410: [SPARK-40932][CORE] Fix issue messages 
for allGather are overridden
URL: https://github.com/apache/spark/pull/38410


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38410: [SPARK-40932][CORE] Fix issue messages for allGather are overridden

2022-10-28 Thread GitBox


cloud-fan commented on PR #38410:
URL: https://github.com/apache/spark/pull/38410#issuecomment-1294978376

   thanks, merging to master/3.3!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-28 Thread GitBox


MaxGekk commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1008034311


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##
@@ -3210,4 +3211,190 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase {
   messageParameters = Map("expression" -> toSQLExpr(expression))
 )
   }
+
+  def cannotConvertProtobufTypeToSqlTypeError(
+  protobufColumn: String,
+  sqlColumn: String,
+  protobufType: String,
+  sqlType: DataType): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR",
+  messageParameters = Map(
+"protobufColumn" -> protobufColumn,
+"sqlColumn" -> toSQLId(sqlColumn),
+"protobufType" -> protobufType,
+"sqlType" -> toSQLType(sqlType)))
+  }
+
+  def cannotConvertCatalystTypeToProtobufTypeError(
+  sqlColumn: String,
+  protobufColumn: String,
+  sqlType: DataType,
+  protobufType: String): Throwable = {
+new AnalysisException(
+  errorClass = "SQL_TYPE_TO_PROTOBUF_TYPE_ERROR",
+  messageParameters = Map(
+"sqlColumn" -> toSQLId(sqlColumn),
+"protobufColumn" -> protobufColumn,
+"sqlType" -> toSQLType(sqlType),
+"protobufType" -> protobufType))
+  }
+
+  def cannotConvertCatalystTypeToProtobufEnumTypeError(
+  sqlColumn: String,
+  protobufColumn: String,
+  data: String,
+  enumString: String): Throwable = {
+new AnalysisException(
+  errorClass = "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR",
+  messageParameters = Map(
+"sqlColumn" -> sqlColumn,
+"protobufColumn" -> protobufColumn,
+"data" -> data,
+"enumString" -> enumString))
+  }
+
+  def cannotConvertProtobufTypeToCatalystTypeError(
+  protobufType: String,
+  sqlType: DataType,
+  cause: Throwable): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR",
+  messageParameters = Map(
+"protobufType" -> protobufType,
+"toType" -> toSQLType(sqlType)),
+  cause = Some(cause.getCause))
+  }
+
+  def cannotConvertSqlTypeToProtobufError(
+  protobufType: String,
+  sqlType: DataType,
+  cause: Throwable): Throwable = {
+new AnalysisException(
+  errorClass = "UNABLE_TO_CONVERT_TO_PROTOBUF_TYPE",
+  messageParameters = Map(
+"protobufType" -> protobufType,
+"toType" -> toSQLType(sqlType)),
+  cause = Some(cause.getCause))
+  }
+
+  def protobufTypeUnsupportedYetError(protobufType: String): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_TYPE_NOT_SUPPORT_ERROR",
+  messageParameters = Map("protobufType" -> protobufType))
+  }
+
+  def unknownProtobufMessageTypeError(
+  descriptorName: String,
+  containingType: String): Throwable = {
+new AnalysisException(
+  errorClass = "UNKNOWN_PROTOBUF_MESSAGE_TYPE",
+  messageParameters = Map(
+"descriptorName" -> descriptorName,
+"containingType" -> containingType))
+  }
+
+  def cannotFindCatalystTypeInProtobufSchemaError(catalystFieldPath: String): 
Throwable = {
+new AnalysisException(
+  errorClass = "NO_CATALYST_TYPE_IN_PROTOBUF_SCHEMA",
+  messageParameters = Map("catalystFieldPath" -> catalystFieldPath))
+  }
+
+  def cannotFindProtobufFieldInCatalystError(field: String): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA",
+  messageParameters = Map("field" -> field))
+  }
+
+  def protobufFieldMatchError(field: String,
+  protobufSchema: String,
+  matchSize: String,
+  matches: String): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_FIELD_MISSING_ERROR",
+  messageParameters = Map(
+"field" -> field,
+"protobufSchema" -> protobufSchema,
+"matchSize" -> matchSize,
+"matches" -> matches))
+  }
+
+  def unableToLocateProtobufMessageError(messageName: String): Throwable = {
+new AnalysisException(
+  errorClass = "UNABLE_TO_LOCATE_PROTOBUF_MESSAGE_ERROR",
+  messageParameters = Map("messageName" -> messageName))
+  }
+
+  def descrioptorParseError(cause: Throwable): Throwable = {
+new AnalysisException(
+  errorClass = "PROTOBUF_DESCRIPTOR_ERROR",
+  messageParameters = Map.empty(),
+  cause = Some(cause.getCause))
+  }
+
+  def cannotFindDescriptorFileError(filePath: String, cause: Throwable): 
Throwable = {
+new AnalysisException(
+  errorClass = "CANNOT_FIND_PROTOBUF_DESCRIPTOR_FILE_ERROR",
+  messageParameters = Map("filePath" -> filePath),
+  cause = Some(cause.getCause))
+  }
+
+  def noProtobufMessageTypeReturnError(descriptorName: String): Throwable = {
+new AnalysisException(
+  errorClass = "NO_PROTOBUF_MESSAGE_TYPE_ERROR",
+  messageParameters = Map("descriptorName" -> descriptorName))
+  

[GitHub] [spark] eejbyfeldt opened a new pull request, #38428: [SPARK-40912][CORE][WIP] Overhead of Exceptions in DeserializationStream

2022-10-28 Thread GitBox


eejbyfeldt opened a new pull request, #38428:
URL: https://github.com/apache/spark/pull/38428

   
   
   ### What changes were proposed in this pull request?
   This PR avoid exceptions in the implementation of KryoDeserializationStream.
   
   
   
   ### Why are the changes needed?
   Using an exceptions for end of stream is slow, especially for small streams. 
It also problematic as it the exception caught in the KryoDeserializationStream 
could also be caused by corrupt data which would just be ignored in the current 
implementation.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, it changes so some method on KryoDeserializationStream no longer raises 
EOFException.
   
   
   
   
   ### How was this patch tested?
   Existing tests.
   
   
   This PR only changes KryoDeserializationStream as a proof of concept. If 
this is the direction we want to go we should probably change 
DerserializationStream isntead so that the interface is consistent.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output

2022-10-28 Thread GitBox


cloud-fan commented on PR #38397:
URL: https://github.com/apache/spark/pull/38397#issuecomment-1294971438

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output

2022-10-28 Thread GitBox


cloud-fan commented on PR #38397:
URL: https://github.com/apache/spark/pull/38397#issuecomment-1294971845

   unfortunately it conflicts with 3.3, @juliuszsompolski could you open a 
backport PR? thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output

2022-10-28 Thread GitBox


cloud-fan closed pull request #38397: [SPARK-40918][SQL] Mismatch between 
FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
URL: https://github.com/apache/spark/pull/38397


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist

2022-10-28 Thread GitBox


zhengruifeng commented on PR #38426:
URL: https://github.com/apache/spark/pull/38426#issuecomment-1294931880

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38424: [SPARK-40953][CONNECT][PYTHON] Fix `Dataframe.head`

2022-10-28 Thread GitBox


zhengruifeng commented on PR #38424:
URL: https://github.com/apache/spark/pull/38424#issuecomment-1294926943

   @dongjoon-hyun thanks for the reminder


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38170: [WIP][SPARK-40663][SQL] Migrate execution errors onto error classes: _LEGACY_ERROR_TEMP_2201-2225

2022-10-28 Thread GitBox


MaxGekk commented on PR #38170:
URL: https://github.com/apache/spark/pull/38170#issuecomment-1294926445

   @itholic Please, resolve conflicts and fix tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38420: [SPARK-40947][PS][INFRA] Upgrade pandas to 1.5.1

2022-10-28 Thread GitBox


zhengruifeng commented on PR #38420:
URL: https://github.com/apache/spark/pull/38420#issuecomment-1294924307

   seems some tests failed again...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #38422: [SPARK-40948][SQL] Introduce new error class: DATA_PATH_NOT_EXIST

2022-10-28 Thread GitBox


MaxGekk commented on code in PR #38422:
URL: https://github.com/apache/spark/pull/38422#discussion_r1007996419


##
core/src/main/resources/error/error-classes.json:
##
@@ -98,6 +98,11 @@
   "The value  () cannot be converted to  because it 
is malformed. Correct the value as per the syntax, or change its format. Use 
 to tolerate malformed input and return NULL instead."
 ]
   },
+  "DATA_PATH_NOT_EXIST" : {

Review Comment:
   How about `NON_EXISTENT_PATH`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #38421: [SPARK-40889][SQL][TESTS] Check error classes in PlanResolutionSuite

2022-10-28 Thread GitBox


MaxGekk closed pull request #38421: [SPARK-40889][SQL][TESTS] Check error 
classes in PlanResolutionSuite
URL: https://github.com/apache/spark/pull/38421


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38421: [SPARK-40889][SQL][TESTS] Check error classes in PlanResolutionSuite

2022-10-28 Thread GitBox


MaxGekk commented on PR #38421:
URL: https://github.com/apache/spark/pull/38421#issuecomment-1294918032

   +1, LGTM. Merging to master.
   Thank you, @panbingkun and @itholic for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #38413: [SPARK-40936][SQL][TESTS] Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the `SparkFunSuite#checkError`

2022-10-28 Thread GitBox


MaxGekk closed pull request #38413: [SPARK-40936][SQL][TESTS] Refactor 
`AnalysisTest#assertAnalysisErrorClass` by reusing the 
`SparkFunSuite#checkError`
URL: https://github.com/apache/spark/pull/38413


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38413: [SPARK-40936][SQL][TESTS] Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the `SparkFunSuite#checkError`

2022-10-28 Thread GitBox


MaxGekk commented on PR #38413:
URL: https://github.com/apache/spark/pull/38413#issuecomment-1294915767

   +1, LGTM. Merging to master.
   Thank you, @LuciferYang.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] eejbyfeldt opened a new pull request, #38427: [SPAR-40950

2022-10-28 Thread GitBox


eejbyfeldt opened a new pull request, #38427:
URL: https://github.com/apache/spark/pull/38427

   
   
   ### What changes were proposed in this pull request?
   In FetchBlockRequest we currently store a `Seq[FetchBlockInfo]` as part of 
the function `isRemoteAddressMaxedOut` (probably other places as well, but this 
is the function that showd up in my profileling) we use the length of this Seq. 
In scala 2.12 `Seq` is an alias for `scala.collection.Seq` but in 2.13 it an 
alias for `scala.collection.immutable.Seq`. This means that in when for example 
we call `toSeq` on a `ArrayBuffer` in 2.12 we do nothing and the `blocks` in 
the `FetchRequest` will be backed by something with a cheap `length` but in 
2.13 we end up copying the data to a `List`. 
   
   This PR solves this changing the `Seq` to and `IndexedSeq` and therefore 
making the expectation of a cheap length function explicit. This means that we 
some places will do an extra copy in scala 2.13 compared to 2.12 (was also the 
case before this PR). If we wanted to avoid this copy we should instead change 
it to use `scala.collection.IndexedSeq` so we would have the same types in both 
2.13 and 2.12.  
   
   
   
   ### Why are the changes needed?
   The performance for ShuffleBlockFetcherIterator is much worse on Scala 2.13 
than 2.12. Have seen cases were the overhead of repeatedly calculating the 
length is as much as 20% of cpu time (and could probably be even worse for 
larger shuffles).
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No. I think the interface changes are only on private classes.
   
   
   
   ### How was this patch tested?
   Existing specs. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38422: [SPARK-40948][SQL] Introduce new error class: DATA_PATH_NOT_EXIST

2022-10-28 Thread GitBox


MaxGekk commented on PR #38422:
URL: https://github.com/apache/spark/pull/38422#issuecomment-1294872638

   @itholic Could you add a test for this new error class, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`

2022-10-28 Thread GitBox


Yikun commented on PR #38425:
URL: https://github.com/apache/spark/pull/38425#issuecomment-1294863094

   @dongjoon-hyun @itholic Thanks, merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun closed pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`

2022-10-28 Thread GitBox


Yikun closed pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add 
`openpyxl` to `requirements.txt`
URL: https://github.com/apache/spark/pull/38425


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] juliuszsompolski commented on pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output

2022-10-28 Thread GitBox


juliuszsompolski commented on PR #38397:
URL: https://github.com/apache/spark/pull/38397#issuecomment-1294826483

   Thanks. I updated the title after fixing the Orc, but forgot to update the 
description which still was describing about Parquet only.
   This bug exists in branch-3.3 as well btw; if it doesn't merge cleanly I can 
open a separate PR later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode

2022-10-28 Thread GitBox


shrprasa commented on PR #37880:
URL: https://github.com/apache/spark/pull/37880#issuecomment-1294815181

   ping @gaborgsomogyi @dongjoon-hyun @HyukjinKwon @squito 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist

2022-10-28 Thread GitBox


dongjoon-hyun commented on PR #38426:
URL: https://github.com/apache/spark/pull/38426#issuecomment-1294811494

   Could you review this, @HyukjinKwon ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun opened a new pull request, #38426: [SPARK-40951][PYSPARK][TESTS] pyspark-connect tests should be skipped if pandas doesn't exist

2022-10-28 Thread GitBox


dongjoon-hyun opened a new pull request, #38426:
URL: https://github.com/apache/spark/pull/38426

   … 
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`

2022-10-28 Thread GitBox


dongjoon-hyun commented on PR #38425:
URL: https://github.com/apache/spark/pull/38425#issuecomment-1294793919

   Thank you, @itholic .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`

2022-10-28 Thread GitBox


itholic commented on PR #38425:
URL: https://github.com/apache/spark/pull/38425#issuecomment-1294763957

   LGTM when the test pass. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mcdull-zhang commented on pull request #36301: [SPARK-21697][SQL] NPE & ExceptionInInitializerError trying to load UDF from HDFS

2022-10-28 Thread GitBox


mcdull-zhang commented on PR #36301:
URL: https://github.com/apache/spark/pull/36301#issuecomment-1294759394

   > How to reproduce this problem?
   > 
   > 1. Make sure `commons-logging` is in front of the classpath.
   >You can use `SPARK_PRINT_LAUNCH_COMMAND` to get the startup command and 
manually modify the classpath.
   > 2. Add two hdfs jars with different names.
   >`ADD JAR hdfs://`
   > 3. Create a temporary function, the class of udf does not matter whether 
it exists or not.
   >`CREATE temporary FUNCTION  AS 'xx' `
   > 4. An error occurred.
   
   Why do you need to add two jars in step 2? 
   I feel that there will be a problem with adding one jar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] juliuszsompolski commented on pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output

2022-10-28 Thread GitBox


juliuszsompolski commented on PR #38397:
URL: https://github.com/apache/spark/pull/38397#issuecomment-1294754916

   > Given that OrcFileFormat has no issue like _metadata columns
   
   @dongjoon-hyun I think OrcFIleFormat has exactly the same issue as 
ParquetFileFormat, like @cloud-fan pointed out?  When there was a column like 
`_metadata.file_path` requested for OrcFileFormat, it would also count that 
column in FileSourceScanExec.supporsBatch, but not count it in 
OrcFileFormat.supportsBatch during buildReaderWithPartitionValues. The code 
changes I made to OrcFileFormat exactly mirror what I did to ParquetFileFormat.
   I updated the description to descibe "`ParquetFileFormat` or 
`OrcFileFormat`" in various places, but the issue seems exactly the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`

2022-10-28 Thread GitBox


dongjoon-hyun commented on PR #38425:
URL: https://github.com/apache/spark/pull/38425#issuecomment-1294754038

   cc @itholic , @HyukjinKwon , @Yikun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >