[GitHub] [spark] huaxingao opened a new pull request, #38434: [SPARK-40946][SQL] Add a new DataSource V2 interface SupportsPushDownClusterKeys
huaxingao opened a new pull request, #38434: URL: https://github.com/apache/spark/pull/38434 ### What changes were proposed in this pull request? ``` /** * A mix-in interface for {@link ScanBuilder}. Data sources can implement this interface to * push down all the join or aggregate keys to data sources. A return value true indicates * that data source will return input partitions (via planInputPartitions} following the * clustering keys. Otherwise, a false return value indicates the data source doesn't make * such a guarantee, even though it may still report a partitioning that may or may not * be compatible with the given clustering keys, and it's Spark's responsibility to group * the input partitions whether it can be applied. * * @since 3.4.0 */ @Evolving public interface SupportsPushDownClusterKeys extends ScanBuilder { ``` ### Why are the changes needed? Pass down the information of join keys to v2 data sources so the data sources can decide how to combine the input splits according to the joins keys. ### Does this PR introduce _any_ user-facing change? Yes, new interface `SupportsPushDownClusterKeys` ### How was this patch tested? new tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options
AmplabJenkins commented on PR #38418: URL: https://github.com/apache/spark/pull/38418#issuecomment-1295679598 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
AmplabJenkins commented on PR #38415: URL: https://github.com/apache/spark/pull/38415#issuecomment-1295679630 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37219: [WIP][SPARK-39794][PYTHON] Introduce parametric singleton for DataType
github-actions[bot] closed pull request #37219: [WIP][SPARK-39794][PYTHON] Introduce parametric singleton for DataType URL: https://github.com/apache/spark/pull/37219 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Kimahriman commented on pull request #38229: [SPARK-40775][SQL] Fix duplicate description entries for V2 file scans
Kimahriman commented on PR #38229: URL: https://github.com/apache/spark/pull/38229#issuecomment-1295630445 cc @dongjoon-hyun @cloud-fan @maropu, you guys were on the original PR review for the metadata addition -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] warrenzhu25 commented on pull request #38233: [SPARK-40781][CORE] Explain exit code 137 as killed due to OOM
warrenzhu25 commented on PR #38233: URL: https://github.com/apache/spark/pull/38233#issuecomment-1295617623 > Can you point me to the reference where yarn mode is using `137` for OOM ? IIRC we simply kill the executor process in case there is an OOM - which usually results in `143` as the return code ... > > ``` > public class Test { > > public static void main(String[] args) { > long[] arr = new long[1024 * 1024 * 999]; > } > } > > $ java -XX:OnOutOfMemoryError='kill %p' -Xmx12m -cp . Test ; echo Error code: $? > ``` > > This is coming from `128 + SIGTERM` - and sigterm is 15 in linux. See here for more [1] > > [1] https://tldp.org/LDP/abs/html/exitcodes.html My main goal is making exit code more readable in standalone, but this code is shared by both standalone and yarn. I see yarn exit code 137 from https://aws.amazon.com/premiumsupport/knowledge-center/container-killed-on-request-137-emr/#:~:text=When%20a%20container%20(Spark%20executor,in%20narrow%20and%20wide%20transformations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ben-zhang opened a new pull request, #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands
ben-zhang opened a new pull request, #38433: URL: https://github.com/apache/spark/pull/38433 ### What changes were proposed in this pull request? Make the `MSCK` keyword optional in `MSCK REPAIR TABLE` commands so that it can be omitted. ### Why are the changes needed? The use of the keyword `MSCK`, meaning metastore check, is arcane and does not add value for the command. Removing it makes the meaning of `REPAIR TABLE` commands more clear. ### Does this PR introduce _any_ user-facing change? Yes, previously, it was no possible to specify only `REPAIR TABLE` without `MSCK`. Now, it is possible. No changes to existing behaviour using the original `MSCK REPAIR TABLE` syntax are in this PR. ### How was this patch tested? Unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38393: [SPARK-40915][CONNECT] Improve `on` in Join in Python client
amaliujia commented on code in PR #38393: URL: https://github.com/apache/spark/pull/38393#discussion_r1008513991 ## python/pyspark/sql/connect/plan.py: ## @@ -537,7 +537,7 @@ def __init__( self, left: Optional["LogicalPlan"], right: "LogicalPlan", -on: "ColumnOrString", +on: Optional[Union[str, List[str], ColumnRef]], Review Comment: hmmm I gave a try but failed to make it work for mypy and UT together (not a python expert): I found the trick case is `str` seems to be also consider-red as `Sequence` or `Iterable`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] thejdeep commented on a diff in pull request #36165: [SPARK-36620][SHUFFLE] Add Push Based Shuffle client side metrics
thejdeep commented on code in PR #36165: URL: https://github.com/apache/spark/pull/36165#discussion_r1008470301 ## core/src/main/scala/org/apache/spark/status/storeTypes.scala: ## @@ -233,6 +243,38 @@ private[spark] class TaskDataWrapper( val shuffleLocalBytesRead: Long, @KVIndexParam(value = TaskIndexNames.SHUFFLE_READ_RECORDS, parent = TaskIndexNames.STAGE) val shuffleRecordsRead: Long, +@KVIndexParam( + value = TaskIndexNames.PUSH_BASED_SHUFFLE_CORRUPT_MERGED_BLOCK_CHUNKS, + parent = TaskIndexNames.STAGE) +val shuffleCorruptMergedBlockChunks: Long, +@KVIndexParam(value = TaskIndexNames.PUSH_BASED_SHUFFLE_FALLBACK_COUNT, + parent = TaskIndexNames.STAGE) +val shuffleFallbackCount: Long, +@KVIndexParam( + value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_REMOTE_BLOCKS, parent = TaskIndexNames.STAGE) +val shuffleMergedRemoteBlocksFetched: Long, +@KVIndexParam( + value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_LOCAL_BLOCKS, parent = TaskIndexNames.STAGE) +val shuffleMergedLocalBlocksFetched: Long, +@KVIndexParam( + value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_REMOTE_CHUNKS, parent = TaskIndexNames.STAGE) +val shuffleMergedRemoteChunksFetched: Long, +@KVIndexParam( + value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_LOCAL_CHUNKS, parent = TaskIndexNames.STAGE) +val shuffleMergedLocalChunksFetched: Long, +@KVIndexParam( + value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_REMOTE_READS, parent = TaskIndexNames.STAGE) +val shuffleMergedRemoteBlocksBytesRead: Long, +@KVIndexParam( + value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_LOCAL_READS, parent = TaskIndexNames.STAGE) +val shuffleMergedLocalBlocksBytesRead: Long, +@KVIndexParam( + value = TaskIndexNames.PUSH_BASED_SHUFFLE_REMOTE_REQS_DURATION, parent = TaskIndexNames.STAGE) +val shuffleRemoteReqsDuration: Long, +@KVIndexParam( + value = TaskIndexNames.PUSH_BASED_SHUFFLE_MERGED_REMOTE_REQS_DURATION, + parent = TaskIndexNames.STAGE) Review Comment: Indexes for these fields is a requirement when we are computing quantiles and reading `TaskDataWrapper`s. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a diff in pull request #36165: [SPARK-36620][SHUFFLE] Add Push Based Shuffle client side metrics
otterc commented on code in PR #36165: URL: https://github.com/apache/spark/pull/36165#discussion_r1008467914 ## core/src/main/scala/org/apache/spark/storage/PushBasedFetchHelper.scala: ## @@ -290,6 +301,7 @@ private class PushBasedFetchHelper( address: BlockManagerId): Unit = { assert(blockId.isInstanceOf[ShuffleMergedBlockId] || blockId.isInstanceOf[ShuffleBlockChunkId]) logWarning(s"Falling back to fetch the original blocks for push-merged block $blockId") +shuffleMetrics.incFallbackCount(1) Review Comment: Fallback count measures the number of times fallback is initiated for a block which is either a `shuffleMergedBlock` or a `shuffleBlockChunk`. I don't understand why we need to move into the case that you suggested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008467821 ## core/src/main/resources/error/error-classes.json: ## @@ -792,6 +887,21 @@ "Unable to acquire bytes of memory, got " ] }, + "UNABLE_TO_CONVERT_TO_PROTOBUF_TYPE" : { +"message" : [ + "Unable to convert SQL type to Protobuf type ." +] + }, + "UNABLE_TO_LOCATE_PROTOBUF_MESSAGE_ERROR" : { Review Comment: @srielau fixed all the error class names -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008467248 ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { Review Comment: @srielau Along with the message added the pathname of the descriptor file and the same for the one below. Aside from the error message, we don't have much information at this time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008463500 ## core/src/main/resources/error/error-classes.json: ## @@ -65,6 +70,11 @@ ], "sqlState" : "22005" }, + "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR" : { +"message" : [ + "Cannot convert SQL to Protobuf because cannot be written since it's not defined in ENUM " Review Comment: @srielau there is no ENUM type in spark, so we convert the protobuf enum to spark string. so we are trying to throw an enum-specific error here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a diff in pull request #36165: [SPARK-36620][SHUFFLE] Add Push Based Shuffle client side metrics
otterc commented on code in PR #36165: URL: https://github.com/apache/spark/pull/36165#discussion_r1008462544 ## core/src/test/scala/org/apache/spark/ui/StagePageSuite.scala: ## @@ -78,6 +78,16 @@ class StagePageSuite extends SparkFunSuite with LocalSparkContext { shuffleLocalBytesRead = 1L, shuffleReadBytes = 1L, shuffleReadRecords = 1L, +shuffleCorruptMergedBlockChunks = 2L, +shuffleFallbackCount = 2L, +shuffleMergedRemoteBlocksFetched = 1L, +shuffleMergedLocalBlocksFetched = 1L, +shuffleMergedRemoteChunksFetched = 1L, +shuffleMergedLocalChunksFetched = 1L, +shuffleMergedRemoteBytesRead = 1L, +shuffleMergedLocalBytesRead = 1L, +shuffleRemoteReqsDuration = 1L, +shuffleMergedRemoteReqsDuration = 1L, Review Comment: We didn't put much thought into it. But `shuffleMergerCount` can be 0 because this stage may not push but this stage can still read shuffle data generated by the previous stage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on pull request #38429: [SPARK-40800][SQL][FOLLOW-UP] Add a config to control whether to always inline one-row relation subquery
allisonwang-db commented on PR #38429: URL: https://github.com/apache/spark/pull/38429#issuecomment-1295441951 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008417281 ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { +"message" : [ + "Error parsing descriptor byte[] into Descriptor object" +] + }, + "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : { +"message" : [ + "Error constructing FileDescriptor" +] + }, + "PROTOBUF_FIELD_MISSING_ERROR" : { +"message" : [ + "Searching for in Protobuf schema at gave matches. Candidates: " +] + }, + "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : { +"message" : [ + "Found in Protobuf schema but there is no match in the SQL schema" +] + }, + "PROTOBUF_FIELD_TYPE_MISMATCH" : { +"message" : [ + "Type mismatch encountered for field: " +] + }, + "PROTOBUF_MESSAGE_TYPE_ERROR" : { +"message" : [ + " is not a Protobuf message type" +] + }, + "PROTOBUF_RECURSION_ERROR" : { +"message" : [ + "Found recursive reference in Protobuf schema, which can not be processed by Spark: " +] + }, + "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : { +"message" : [ + "Protobuf type not yet supported: ." +] + }, + "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : { +"message" : [ + "Unable to convert of Protobuf to SQL type ." +] + }, + "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR" : { Review Comment: this gives which protobuf field had failed to convert, this clearly describes the field name and field type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008417312 ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { +"message" : [ + "Error parsing descriptor byte[] into Descriptor object" +] + }, + "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : { +"message" : [ + "Error constructing FileDescriptor" +] + }, + "PROTOBUF_FIELD_MISSING_ERROR" : { +"message" : [ + "Searching for in Protobuf schema at gave matches. Candidates: " +] + }, + "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : { +"message" : [ + "Found in Protobuf schema but there is no match in the SQL schema" +] + }, + "PROTOBUF_FIELD_TYPE_MISMATCH" : { +"message" : [ + "Type mismatch encountered for field: " +] + }, + "PROTOBUF_MESSAGE_TYPE_ERROR" : { +"message" : [ + " is not a Protobuf message type" +] + }, + "PROTOBUF_RECURSION_ERROR" : { +"message" : [ + "Found recursive reference in Protobuf schema, which can not be processed by Spark: " +] + }, + "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : { +"message" : [ + "Protobuf type not yet supported: ." +] + }, + "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : { Review Comment: this gives which protobuf message had failed to convert, it would be the protobuf root name since the protobuf message might have a long list of fields, just keeping the root name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008417281 ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { +"message" : [ + "Error parsing descriptor byte[] into Descriptor object" +] + }, + "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : { +"message" : [ + "Error constructing FileDescriptor" +] + }, + "PROTOBUF_FIELD_MISSING_ERROR" : { +"message" : [ + "Searching for in Protobuf schema at gave matches. Candidates: " +] + }, + "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : { +"message" : [ + "Found in Protobuf schema but there is no match in the SQL schema" +] + }, + "PROTOBUF_FIELD_TYPE_MISMATCH" : { +"message" : [ + "Type mismatch encountered for field: " +] + }, + "PROTOBUF_MESSAGE_TYPE_ERROR" : { +"message" : [ + " is not a Protobuf message type" +] + }, + "PROTOBUF_RECURSION_ERROR" : { +"message" : [ + "Found recursive reference in Protobuf schema, which can not be processed by Spark: " +] + }, + "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : { +"message" : [ + "Protobuf type not yet supported: ." +] + }, + "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : { +"message" : [ + "Unable to convert of Protobuf to SQL type ." +] + }, + "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR" : { Review Comment: this gives which protobuf field had failed to convert ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { +"message" : [ + "Error parsing descriptor byte[] into Descriptor object" +] + }, + "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : { +"message" : [ + "Error constructing FileDescriptor" +] + }, + "PROTOBUF_FIELD_MISSING_ERROR" : { +"message" : [ + "Searching for in Protobuf schema at gave matches. Candidates: " +] + }, + "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : { +"message" : [ + "Found in Protobuf schema but there is no match in the SQL schema" +] + }, + "PROTOBUF_FIELD_TYPE_MISMATCH" : { +"message" : [ + "Type mismatch encountered for field: " +] + }, + "PROTOBUF_MESSAGE_TYPE_ERROR" : { +"message" : [ + " is not a Protobuf message type" +] + }, + "PROTOBUF_RECURSION_ERROR" : { +"message" : [ + "Found recursive reference in Protobuf schema, which can not be processed by Spark: " +] + }, + "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : { +"message" : [ + "Protobuf type not yet supported: ." +] + }, + "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : { Review Comment: this gives which protobuf message had failed to convert -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008417312 ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { +"message" : [ + "Error parsing descriptor byte[] into Descriptor object" +] + }, + "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : { +"message" : [ + "Error constructing FileDescriptor" +] + }, + "PROTOBUF_FIELD_MISSING_ERROR" : { +"message" : [ + "Searching for in Protobuf schema at gave matches. Candidates: " +] + }, + "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : { +"message" : [ + "Found in Protobuf schema but there is no match in the SQL schema" +] + }, + "PROTOBUF_FIELD_TYPE_MISMATCH" : { +"message" : [ + "Type mismatch encountered for field: " +] + }, + "PROTOBUF_MESSAGE_TYPE_ERROR" : { +"message" : [ + " is not a Protobuf message type" +] + }, + "PROTOBUF_RECURSION_ERROR" : { +"message" : [ + "Found recursive reference in Protobuf schema, which can not be processed by Spark: " +] + }, + "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : { +"message" : [ + "Protobuf type not yet supported: ." +] + }, + "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : { Review Comment: this is gives which message had failed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008417312 ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { +"message" : [ + "Error parsing descriptor byte[] into Descriptor object" +] + }, + "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : { +"message" : [ + "Error constructing FileDescriptor" +] + }, + "PROTOBUF_FIELD_MISSING_ERROR" : { +"message" : [ + "Searching for in Protobuf schema at gave matches. Candidates: " +] + }, + "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : { +"message" : [ + "Found in Protobuf schema but there is no match in the SQL schema" +] + }, + "PROTOBUF_FIELD_TYPE_MISMATCH" : { +"message" : [ + "Type mismatch encountered for field: " +] + }, + "PROTOBUF_MESSAGE_TYPE_ERROR" : { +"message" : [ + " is not a Protobuf message type" +] + }, + "PROTOBUF_RECURSION_ERROR" : { +"message" : [ + "Found recursive reference in Protobuf schema, which can not be processed by Spark: " +] + }, + "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : { +"message" : [ + "Protobuf type not yet supported: ." +] + }, + "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : { Review Comment: this is thrown at protobuf message level ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { +"message" : [ + "Error parsing descriptor byte[] into Descriptor object" +] + }, + "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : { +"message" : [ + "Error constructing FileDescriptor" +] + }, + "PROTOBUF_FIELD_MISSING_ERROR" : { +"message" : [ + "Searching for in Protobuf schema at gave matches. Candidates: " +] + }, + "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : { +"message" : [ + "Found in Protobuf schema but there is no match in the SQL schema" +] + }, + "PROTOBUF_FIELD_TYPE_MISMATCH" : { +"message" : [ + "Type mismatch encountered for field: " +] + }, + "PROTOBUF_MESSAGE_TYPE_ERROR" : { +"message" : [ + " is not a Protobuf message type" +] + }, + "PROTOBUF_RECURSION_ERROR" : { +"message" : [ + "Found recursive reference in Protobuf schema, which can not be processed by Spark: " +] + }, + "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : { +"message" : [ + "Protobuf type not yet supported: ." +] + }, + "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : { +"message" : [ + "Unable to convert of Protobuf to SQL type ." +] + }, + "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR" : { Review Comment: this is thrown at protobuf field level -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008417281 ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { +"message" : [ + "Error parsing descriptor byte[] into Descriptor object" +] + }, + "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : { +"message" : [ + "Error constructing FileDescriptor" +] + }, + "PROTOBUF_FIELD_MISSING_ERROR" : { +"message" : [ + "Searching for in Protobuf schema at gave matches. Candidates: " +] + }, + "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : { +"message" : [ + "Found in Protobuf schema but there is no match in the SQL schema" +] + }, + "PROTOBUF_FIELD_TYPE_MISMATCH" : { +"message" : [ + "Type mismatch encountered for field: " +] + }, + "PROTOBUF_MESSAGE_TYPE_ERROR" : { +"message" : [ + " is not a Protobuf message type" +] + }, + "PROTOBUF_RECURSION_ERROR" : { +"message" : [ + "Found recursive reference in Protobuf schema, which can not be processed by Spark: " +] + }, + "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : { +"message" : [ + "Protobuf type not yet supported: ." +] + }, + "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : { +"message" : [ + "Unable to convert of Protobuf to SQL type ." +] + }, + "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR" : { Review Comment: this is thrown when failing to convert protobuf fields into SQL type. ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { +"message" : [ + "Error parsing descriptor byte[] into Descriptor object" +] + }, + "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : { +"message" : [ + "Error constructing FileDescriptor" +] + }, + "PROTOBUF_FIELD_MISSING_ERROR" : { +"message" : [ + "Searching for in Protobuf schema at gave matches. Candidates: " +] + }, + "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : { +"message" : [ + "Found in Protobuf schema but there is no match in the SQL schema" +] + }, + "PROTOBUF_FIELD_TYPE_MISMATCH" : { +"message" : [ + "Type mismatch encountered for field: " +] + }, + "PROTOBUF_MESSAGE_TYPE_ERROR" : { +"message" : [ + " is not a Protobuf message type" +] + }, + "PROTOBUF_RECURSION_ERROR" : { +"message" : [ + "Found recursive reference in Protobuf schema, which can not be processed by Spark: " +] + }, + "PROTOBUF_TYPE_NOT_SUPPORT_ERROR" : { +"message" : [ + "Protobuf type not yet supported: ." +] + }, + "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR" : { Review Comment: this is thrown when failing to convert the protobuf message to SQL struct type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jchen5 commented on pull request #38432: [SPARK-36124][SQL][WIP] Support subqueries with correlation through set operators
jchen5 commented on PR #38432: URL: https://github.com/apache/spark/pull/38432#issuecomment-1295378945 @allisonwang-db @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jchen5 opened a new pull request, #38432: [SPARK-36124][SQL][WIP] Support subqueries with correlation through set operators
jchen5 opened a new pull request, #38432: URL: https://github.com/apache/spark/pull/38432 ## What changes were proposed in this pull request? Adds support for subquery decorrelation with UNION, INTERSECT, and EXCEPT on the correlation paths. For example: ``` SELECT t1a, ( SELECT sum(b) FROM ( SELECT t2b as b FROM t2 WHERE t2a = t1.t1a INTERSECT SELECT t3b as b FROM t3 WHERE t3a = t1.t1a )) FROM t1 ``` ### Why are the changes needed? To improve subquery support in Spark. ### Does this PR introduce _any_ user-facing change? Before this change, queries like this would return an error like: `Decorrelate inner query through Union is not supported.` After this PR, this query can run successfully. ### How was this patch tested? Unit tests and SQL query tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist
amaliujia commented on PR #38426: URL: https://github.com/apache/spark/pull/38426#issuecomment-1295352747 Late LGTM. Thanks for this improvement! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
srielau commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008327776 ## core/src/main/resources/error/error-classes.json: ## @@ -23,6 +23,11 @@ ], "sqlState" : "42000" }, + "CANNOT_FIND_PROTOBUF_DESCRIPTOR_FILE_ERROR" : { Review Comment: Definitely remove _ERROR. Would be nice if we could gravitate towards *_NOT_FOUND and *_ALREADY_EXIST style naming convention: PROTOBUF_DESCRIPTOR_FILE_NOT_FOUND ## core/src/main/resources/error/error-classes.json: ## @@ -65,6 +70,11 @@ ], "sqlState" : "22005" }, + "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR" : { Review Comment: I don't think we call out Catalyst anywhere else. How is that relevant? Can we be generic? ## core/src/main/resources/error/error-classes.json: ## @@ -65,6 +70,11 @@ ], "sqlState" : "22005" }, + "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR" : { +"message" : [ + "Cannot convert SQL to Protobuf because cannot be written since it's not defined in ENUM " Review Comment: That's a because "squared"... I don't have enough context to propose a rephrase.. Also If we have SQL here, maybe we can do CATALYST -> SQL for the name? ## core/src/main/resources/error/error-classes.json: ## @@ -579,6 +594,11 @@ } } }, + "MALFORMED_PROTOBUF_MESSAGE_ERROR" : { Review Comment: Remove _ERROR ## core/src/main/resources/error/error-classes.json: ## @@ -629,11 +649,21 @@ ], "sqlState" : "42000" }, + "NO_CATALYST_TYPE_IN_PROTOBUF_SCHEMA" : { +"message" : [ + "Cannot find in Protobuf schema" +] + }, "NO_HANDLER_FOR_UDAF" : { "message" : [ "No handler for UDAF ''. Use sparkSession.udf.register(...) instead." ] }, + "NO_PROTOBUF_MESSAGE_TYPE_ERROR" : { Review Comment: Remove _ERROR ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { Review Comment: ```suggestion "CANNOT_PARSE_PROTOBUF_DESCRIPTOR" : { ``` Any hope we know why and can tell? This just says: "Bad Kitty" ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { +"message" : [ + "Error parsing descriptor byte[] into Descriptor object" +] + }, + "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : { Review Comment: ```suggestion "CANNOT_CONSTRUCT_PROTOBUF_DESCRIPTOR" : { ``` Same as above.. Not sure how helpful this message is. ## core/src/main/resources/error/error-classes.json: ## @@ -706,6 +736,66 @@ ], "sqlState" : "42000" }, + "PROTOBUF_CLASS_LOAD_ERROR" : { +"message" : [ + "Could not load Protobuf class with name " +] + }, + "PROTOBUF_DEPENDENCY_ERROR" : { +"message" : [ + "Could not find dependency: " +] + }, + "PROTOBUF_DESCRIPTOR_ERROR" : { +"message" : [ + "Error parsing descriptor byte[] into Descriptor object" +] + }, + "PROTOBUF_DESCRIPTOR_PARSING_ERROR" : { +"message" : [ + "Error constructing FileDescriptor" +] + }, + "PROTOBUF_FIELD_MISSING_ERROR" : { +"message" : [ + "Searching for in Protobuf schema at gave matches. Candidates: " +] + }, + "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA" : { +"message" : [ + "Found in Protobuf schema but there is no match in the SQL schema" +] + }, + "PROTOBUF_FIELD_TYPE_MISMATCH" : { +"message" : [ + "Type mismatch encountered for field: " +] + }, + "PROTOBUF_MESSAGE_TYPE_ERROR" : { +"message" : [ + " is not a Protobuf message type" +] + }, + "PROTOBUF_RECURSION_ERROR" : { Review Comment: ```suggestion "RECURSIVE_PROTOBUF_SCHEMA" : { ``` ## core/src/main/resources/error/error-classes.json: ## @@ -760,6 +850,11 @@ ], "sqlState" : "22023" }, + "SQL_TYPE_TO_PROTOBUF_TYPE_ERROR" : { Review Comment: ```suggestion "CANNOT_CONVERT_SQL_TYPE_TO_PROTOBUF_TYPE" : { ``` ## core/src/main/resources/error/error-classes.json: ## @@ -792,6 +887,21 @@ "Unable to acquire bytes of memory, got " ] }, + "UNABLE_TO_CONVERT_TO_PROTOBUF_TYPE" : { +"message" : [ + "Unable to convert SQL type to Protobuf type ." +] + }, +
[GitHub] [spark] rahulsmahadev commented on a diff in pull request #38404: [SPARK-40956] SQL Equivalent for Dataframe overwrite command
rahulsmahadev commented on code in PR #38404: URL: https://github.com/apache/spark/pull/38404#discussion_r1008360452 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -1276,6 +1276,24 @@ class Analyzer(override val catalogManager: CatalogManager) AppendData.byPosition(r, query) } else if (conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC) { OverwritePartitionsDynamic.byPosition(r, query) +} else if (i.replacePredicates.nonEmpty) { + def findAttrInRelation(name: String): Attribute = { Review Comment: add comments here since this is not obvious. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rahulsmahadev commented on a diff in pull request #38404: [SPARK-40956] SQL Equivalent for Dataframe overwrite command
rahulsmahadev commented on code in PR #38404: URL: https://github.com/apache/spark/pull/38404#discussion_r1008360097 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -1259,8 +1259,8 @@ class Analyzer(override val catalogManager: CatalogManager) object ResolveInsertInto extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( AlwaysProcess.fn, ruleId) { - case i @ InsertIntoStatement(r: DataSourceV2Relation, _, _, _, _, _) - if i.query.resolved && i.userSpecifiedCols.isEmpty => + case i @ InsertIntoStatement(r: DataSourceV2Relation, _, _, _, _, _, _) +if i.query.resolved && i.userSpecifiedCols.isEmpty => Review Comment: nit: undo space -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38420: [SPARK-40947][PS][INFRA] Upgrade pandas to 1.5.1
itholic commented on PR #38420: URL: https://github.com/apache/spark/pull/38420#issuecomment-1295286024 Ahh! Let me take a look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jerrypeng commented on pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog
jerrypeng commented on PR #38430: URL: https://github.com/apache/spark/pull/38430#issuecomment-1295283632 @HeartSaVioR @zsxwing please review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] juliuszsompolski commented on pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
juliuszsompolski commented on PR #38397: URL: https://github.com/apache/spark/pull/38397#issuecomment-1295275605 3.3 PR: https://github.com/apache/spark/pull/38431 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] juliuszsompolski commented on pull request #38431: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
juliuszsompolski commented on PR #38431: URL: https://github.com/apache/spark/pull/38431#issuecomment-1295275295 @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] juliuszsompolski opened a new pull request, #38431: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
juliuszsompolski opened a new pull request, #38431: URL: https://github.com/apache/spark/pull/38431 ### What changes were proposed in this pull request? We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows: * `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` returns whether it **can**, not necessarily **will** return columnar output. * To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` needs to be passed to `buildReaderWithPartitionValues` in these two file formats. It should only be set to `true` if `supportsBatch` is also `true`, but it can be set to `false` if we don't want columnar output nevertheless - this way, `FileSourceScanExec` can set it to false when there are more than 100 columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to concern itself about WSCG limits. * To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here. ### Why are the changes needed? This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the same issue. `java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output. The mismatch comes from the fact that `ParquetFileFormat.supportBatch` depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the threshold is 100 fields. When this is used in `FileSourceScanExec`: ``` override lazy val supportsColumnar: Boolean = { relation.fileFormat.supportBatch(relation.sparkSession, schema) } ``` the `schema` comes from output attributes, which includes extra metadata attributes. However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was calculated again as ``` relation.fileFormat.buildReaderWithPartitionValues( sparkSession = relation.sparkSession, dataSchema = relation.dataSchema, partitionSchema = relation.partitionSchema, requiredSchema = requiredSchema, filters = pushedDownFilters, options = options, hadoopConf = hadoopConf ... val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) ... val returningBatch = supportBatch(sparkSession, resultSchema) ``` Where `requiredSchema` and `partitionSchema` wouldn't include the metadata columns: ``` FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388) FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) FileSourceScanExec: partitionSchema: StructType() FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true)) ``` Column like `file_path#6388` are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file. ### Does this PR introduce _any_ user-facing change? Not a public API change, but it is now required to pass `FileFormat.OPTION_RETURNING_BATCH` in `options` to `ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API in Apache Spark is `FileSourceScanExec`. ### How was this patch tested? Tests added Closes #38397 from juliuszsompolski/SPARK-40918. Authored-by: Juliusz Sompolski Signed-off-by: Wenchen Fan ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jerrypeng opened a new pull request, #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog
jerrypeng opened a new pull request, #38430: URL: https://github.com/apache/spark/pull/38430 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist
dongjoon-hyun closed pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist URL: https://github.com/apache/spark/pull/38426 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options
dtenedor commented on code in PR #38418: URL: https://github.com/apache/spark/pull/38418#discussion_r1008275592 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -2813,13 +2813,41 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit ctx: CreateOrReplaceTableColTypeContext): StructField = withOrigin(ctx) { import ctx._ +// Check that no duplicates exist among any CREATE TABLE column options specified. +var isNotNull: Option[Boolean] = None Review Comment: That's a good point. We can just name this `nullable` and set it to initially `true`, and change it to `false` if the SQL syntax includes `NOT NULL`. Updated accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] johanl-db commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO
johanl-db commented on code in PR #38400: URL: https://github.com/apache/spark/pull/38400#discussion_r1008270419 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -451,7 +451,22 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit } } } -if (matchedActions.isEmpty && notMatchedActions.isEmpty) { +val notMatchedBySourceActions = ctx.notMatchedBySourceClause().asScala.map { + clause => { +val notMatchedBySourceAction = clause.notMatchedBySourceAction() +if (notMatchedBySourceAction.DELETE() != null) { + DeleteAction(Option(clause.notMatchedBySourceCond).map(expression)) +} else if (notMatchedBySourceAction.UPDATE() != null) { + val condition = Option(clause.notMatchedBySourceCond).map(expression) + UpdateAction(condition, + withAssignments(clause.notMatchedBySourceAction().assignmentList())) +} else { + // It should not be here. + throw QueryParsingErrors.unrecognizedNotMatchedBySourceActionError(clause) Review Comment: Parsing will fail if a user tries to use an invalid action. We don't cover this error in tests since it can't surface unless there's a bug. I replaced it (and similar error for matched / not matched) with internal errors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist
dongjoon-hyun commented on PR #38426: URL: https://github.com/apache/spark/pull/38426#issuecomment-1295238073 The last commit only changes the annotation and I verified it once more manually. Merged to master for Apache Spark 3.4. ``` % PYTHON_EXECUTABLE=python3.9 python/run-tests --modules pyspark-connect Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log Will test against the following Python executables: ['python3.9'] Will test the following Python modules: ['pyspark-connect'] python3.9 python_implementation is CPython python3.9 version is: Python 3.9.15 Starting test(python3.9): pyspark.sql.tests.connect.test_connect_column_expressions (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/11141642-6c91-4d03-9039-7e4ca2a946d9/python3.9__pyspark.sql.tests.connect.test_connect_column_expressions__ufodn8um.log) Starting test(python3.9): pyspark.sql.tests.connect.test_connect_basic (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/4701bdaf-a321-4175-8dfb-dea06c91b45a/python3.9__pyspark.sql.tests.connect.test_connect_basic__daej1unt.log) Starting test(python3.9): pyspark.sql.tests.connect.test_connect_select_ops (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/b84efc53-9fba-4073-a088-09dad28fb5a0/python3.9__pyspark.sql.tests.connect.test_connect_select_ops__34nud2k0.log) Starting test(python3.9): pyspark.sql.tests.connect.test_connect_plan_only (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/66f0-5751-49e7-8bb1-72633c30d7b1/python3.9__pyspark.sql.tests.connect.test_connect_plan_only__inix46uc.log) Finished test(python3.9): pyspark.sql.tests.connect.test_connect_basic (2s) ... 6 tests were skipped Finished test(python3.9): pyspark.sql.tests.connect.test_connect_plan_only (2s) ... 10 tests were skipped Finished test(python3.9): pyspark.sql.tests.connect.test_connect_select_ops (2s) ... 2 tests were skipped Finished test(python3.9): pyspark.sql.tests.connect.test_connect_column_expressions (2s) ... 2 tests were skipped Tests passed in 2 seconds Skipped tests in pyspark.sql.tests.connect.test_connect_basic with python3.9: test_limit_offset (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.003s) test_schema (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.000s) test_simple_datasource_read (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.000s) test_simple_explain_string (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.000s) test_simple_read (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.000s) test_simple_udf (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.000s) Skipped tests in pyspark.sql.tests.connect.test_connect_column_expressions with python3.9: test_column_literals (pyspark.sql.tests.connect.test_connect_column_expressions.SparkConnectColumnExpressionSuite) ... skip (0.000s) test_simple_column_expressions (pyspark.sql.tests.connect.test_connect_column_expressions.SparkConnectColumnExpressionSuite) ... skip (0.000s) Skipped tests in pyspark.sql.tests.connect.test_connect_plan_only with python3.9: test_all_the_plans (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.003s) test_datasource_read (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s) test_deduplicate (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.001s) test_filter (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s) test_limit (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s) test_offset (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s) test_relation_alias (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s) test_sample (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.001s) test_simple_project (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s) test_simple_udf (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s) Skipped tests in pyspark.sql.tests.connect.test_connect_select_ops with python3.9: test_join_with_join_type (pyspark.sql.tests.connect.test_connect_select_ops.SparkConnectToProtoSuite) ... skip (0.003s) test_select_with_columns_and_strings (pyspark.sql.tests.connect.test_connect_select_ops.SparkConnectToProtoSuite) ... skip (0.000s) ``` -- This
[GitHub] [spark] dtenedor commented on a diff in pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'
dtenedor commented on code in PR #38263: URL: https://github.com/apache/spark/pull/38263#discussion_r1008272185 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala: ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.commons.codec.digest.DigestUtils + +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(str) - Returns hash of the given value. The hash is consistent and can be used to join masked values together across tables", Review Comment: This description probably needs a bit more information. For example, can this mention: * This can be useful for creating copies of tables with sensitive information removed, but retaining the same schema. * Error behavior: there are no error cases for this expression, it always returns a result string for every input string. * this looks similar to the existing `md5` function [1]. How is this different? [1] https://github.com/apache/spark/blob/d2bdd6595ec3495672c60e225948c99bfaeff04a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2415 ## sql/core/src/test/resources/sql-tests/inputs/string-functions.sql: ## @@ -58,6 +58,14 @@ SELECT substring('Spark SQL' from 5); SELECT substring('Spark SQL' from -3); SELECT substring('Spark SQL' from 5 for 1); +-- mask function +select mask_hash('TestString-123'), Review Comment: also add a case where the input is NULL? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #38146: [SPARK-40687][SQL] Support data masking built-in function 'mask'
dtenedor commented on code in PR #38146: URL: https://github.com/apache/spark/pull/38146#discussion_r1008264498 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala: ## @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.expressions.codegen._ +import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType} +import org.apache.spark.unsafe.types.UTF8String + +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = +"""_FUNC_(input[, upperChar, lowerChar, digitChar, otherChar]) - masks the given string value""", + arguments = """ +Arguments: + * input - string value to mask. Supported types: STRING, VARCHAR, CHAR + * upperChar - character to replace upper-case characters with. Specify -1 to retain original character. Default value: 'X' + * lowerChar - character to replace lower-case characters with. Specify -1 to retain original character. Default value: 'x' + * digitChar - character to replace digit characters with. Specify -1 to retain original character. Default value: 'n' + * otherChar - character to replace all other characters with. Specify -1 to retain original character. Default value: -1 + """, + examples = """ +Examples: + > SELECT _FUNC_('abcd-EFGH-8765-4321'); +--- + > SELECT _FUNC_('abcd-EFGH-8765-4321', 'Q'); +--- + > SELECT _FUNC_('AbCD123-@$#', 'Q','q'); +QqQQnnn-@$# + > SELECT _FUNC_('AbCD123-@$#'); +XxXXnnn-@$# + > SELECT _FUNC_('AbCD123-@$#', 'Q'); +QxQQnnn-@$# + > SELECT _FUNC_('AbCD123-@$#', 'Q','q'); +QqQQnnn-@$# + > SELECT _FUNC_('AbCD123-@$#', 'Q','q', 'd'); +QqQQddd-@$# + > SELECT _FUNC_('AbCD123-@$#', 'Q','q', 'd', 'o'); +QqQQddd + > SELECT _FUNC_('AbCD123-@$#', -1, 'q', 'd', 'o'); +AqCDddd + > SELECT _FUNC_('AbCD123-@$#', -1,-1, 'd', 'o'); +AbCDddd + > SELECT _FUNC_('AbCD123-@$#', -1,-1, -1, 'o'); +AbCD123 + > SELECT _FUNC_(NULL, -1,-1, -1, 'o'); +NULL + > SELECT _FUNC_(NULL); +NULL + > SELECT _FUNC_('AbCD123-@$#', -1, -1, -1, -1); +AbCD123-@$# + """, + since = "3.4.0", + group = "string_funcs") +// scalastyle:on line.size.limit +case class Mask( +input: Expression, +upperChar: Expression, +lowerChar: Expression, +digitChar: Expression, +otherChar: Expression) +extends QuinaryExpression +with ImplicitCastInputTypes +with NullIntolerant { + + def this(input: Expression) = +this( + input, + Literal(Mask.MASKED_UPPERCASE), + Literal(Mask.MASKED_LOWERCASE), + Literal(Mask.MASKED_DIGIT), + Literal(Mask.MASKED_IGNORE)) + + def this(input: Expression, upperChar: Expression) = +this( + input, + upperChar, + Literal(Mask.MASKED_LOWERCASE), + Literal(Mask.MASKED_DIGIT), + Literal(Mask.MASKED_IGNORE)) + + def this(input: Expression, upperChar: Expression, lowerChar: Expression) = +this(input, upperChar, lowerChar, Literal(Mask.MASKED_DIGIT), Literal(Mask.MASKED_IGNORE)) + + def this( + input: Expression, + upperChar: Expression, + lowerChar: Expression, + digitChar: Expression) = +this(input, upperChar, lowerChar, digitChar, Literal(Mask.MASKED_IGNORE)) + + /** + * Expected input types from child expressions. The i-th position in the returned seq indicates + * the type requirement for the i-th child. + * + * The possible values at each position are: + * 1. a specific data type, e.g. LongType, StringType. 2. a non-leaf abstract data type, e.g. + * NumericType, IntegralType, FractionalType. + */ + override def inputTypes: Seq[AbstractDataType] = +Seq(StringType, StringType, StringType, StringType, StringType) + + /** + * Called by default [[eval]] implementation. If subclass of QuinaryExpression keep the default + * nullability, they can override this method to save null-check code. If we need full control + *
[GitHub] [spark] johanl-db commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO
johanl-db commented on code in PR #38400: URL: https://github.com/apache/spark/pull/38400#discussion_r1008270419 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -451,7 +451,22 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit } } } -if (matchedActions.isEmpty && notMatchedActions.isEmpty) { +val notMatchedBySourceActions = ctx.notMatchedBySourceClause().asScala.map { + clause => { +val notMatchedBySourceAction = clause.notMatchedBySourceAction() +if (notMatchedBySourceAction.DELETE() != null) { + DeleteAction(Option(clause.notMatchedBySourceCond).map(expression)) +} else if (notMatchedBySourceAction.UPDATE() != null) { + val condition = Option(clause.notMatchedBySourceCond).map(expression) + UpdateAction(condition, + withAssignments(clause.notMatchedBySourceAction().assignmentList())) +} else { + // It should not be here. + throw QueryParsingErrors.unrecognizedNotMatchedBySourceActionError(clause) Review Comment: Parsing will fail if a user tries to use an invalid action. We don't cover this error in tests since it can't surface unless there's a but. I replaced it (and similar error for matched / not matched) by internal errors. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -1563,13 +1563,29 @@ class Analyzer(override val catalogManager: CatalogManager) } InsertAction( resolvedInsertCondition, - resolveAssignments(assignments, m, resolveValuesWithSourceOnly = true)) + resolveAssignments(assignments, m, resolveValuesFrom = sourceTable)) case o => o } +val newNotMatchedBySourceActions = m.notMatchedBySourceActions.map { + case DeleteAction(deleteCondition) => +val resolvedDeleteCondition = deleteCondition.map( + resolveExpressionByPlanChildren(_, Project(Nil, targetTable))) Review Comment: Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rangadi commented on a diff in pull request #38286: [SPARK-40657] Add support for Java classes in Protobuf functions
rangadi commented on code in PR #38286: URL: https://github.com/apache/spark/pull/38286#discussion_r1008242208 ## connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala: ## @@ -24,6 +24,7 @@ import org.apache.spark.sql.{RandomDataGenerator, Row} import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, NoopFilters, OrderedFilters, StructFilters} import org.apache.spark.sql.catalyst.expressions.{ExpressionEvalHelper, GenericInternalRow, Literal} import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, GenericArrayData, MapData} +import org.apache.spark.sql.protobuf.protos.CatalystTypes.BytesMsg Review Comment: @MaxGekk did you import it as maven or sbt project? Both should work though. With maven, 'generating sources' should fix that (in maven 'Tool Window'). I am also seeing similar issue with it, IntelliJ does not seem to run the `generate-test-sources` phase. I tried just now and running `../../build/mvn test` in protobuf directory fixed this issue. https://user-images.githubusercontent.com/502522/198683965-67b5c95f-f1e6-4a9b-9f96-1607026946a0.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #38065: [SPARK-40625] Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values
dtenedor commented on PR #38065: URL: https://github.com/apache/spark/pull/38065#issuecomment-1295190482 going to close this in favor of the other work by @vinodkc instead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #38146: [SPARK-40687][SQL] Support data masking built-in function 'mask'
dtenedor commented on PR #38146: URL: https://github.com/apache/spark/pull/38146#issuecomment-1295191010 @vinodkc sounds good! Thanks for working on this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor closed pull request #38065: [SPARK-40625] Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values
dtenedor closed pull request #38065: [SPARK-40625] Add MASK_CCN and TRY_MASK_CCN functions to redact credit card string values URL: https://github.com/apache/spark/pull/38065 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #38101: [WIP][SPARK-40652][SQL] Add MASK_PHONE and TRY_MASK_PHONE functions to redact phone number string values
dtenedor commented on PR #38101: URL: https://github.com/apache/spark/pull/38101#issuecomment-1295190096 Going to close this in favor of the other effort by @vinodkc instead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor closed pull request #38101: [WIP][SPARK-40652][SQL] Add MASK_PHONE and TRY_MASK_PHONE functions to redact phone number string values
dtenedor closed pull request #38101: [WIP][SPARK-40652][SQL] Add MASK_PHONE and TRY_MASK_PHONE functions to redact phone number string values URL: https://github.com/apache/spark/pull/38101 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] antonipp commented on a diff in pull request #38376: [SPARK-40817] [Kubernetes] Do not discard remote user-specified files when launching Spark jobs on Kubernetes
antonipp commented on code in PR #38376: URL: https://github.com/apache/spark/pull/38376#discussion_r1008235962 ## core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala: ## @@ -1609,6 +1609,16 @@ class TestFileSystem extends org.apache.hadoop.fs.LocalFileSystem { } override def open(path: Path): FSDataInputStream = super.open(local(path)) + + // No-op methods + + override def copyFromLocalFile( Review Comment: Fixed the indentation in a new commit. I also spent some time trying to make integration tests work on my machine but encountered a blocking issue with my local setup, I opened a ticket here: https://issues.apache.org/jira/browse/SPARK-40954 Could we merge this PR as it is? I can spend more time on fixing my integration test setup + writing a new test some time later but I think the unit test I added already gives us pretty good coverage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist
dongjoon-hyun commented on PR #38426: URL: https://github.com/apache/spark/pull/38426#issuecomment-1295178713 Thank you, @zhengruifeng . I'll fix `mypy` style check failure at linter. ``` starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/tests/connect/test_connect_select_ops.py:29: error: Argument 2 to "skipIf" has incompatible type "Optional[str]"; expected "str" [arg-type] python/pyspark/sql/tests/connect/test_connect_plan_only.py:29: error: Argument 2 to "skipIf" has incompatible type "Optional[str]"; expected "str" [arg-type] python/pyspark/sql/tests/connect/test_connect_column_expressions.py:30: error: Argument 2 to "skipIf" has incompatible type "Optional[str]"; expected "str" [arg-type] Found 3 errors in 3 files (checked 369 source files) 1 Error: Process completed with exit code 1. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008228337 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -3210,4 +3211,190 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase { messageParameters = Map("expression" -> toSQLExpr(expression)) ) } + + def cannotConvertProtobufTypeToSqlTypeError( + protobufColumn: String, + sqlColumn: String, + protobufType: String, + sqlType: DataType): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR", + messageParameters = Map( +"protobufColumn" -> protobufColumn, +"sqlColumn" -> toSQLId(sqlColumn), +"protobufType" -> protobufType, +"sqlType" -> toSQLType(sqlType))) + } + + def cannotConvertCatalystTypeToProtobufTypeError( + sqlColumn: String, + protobufColumn: String, + sqlType: DataType, + protobufType: String): Throwable = { +new AnalysisException( + errorClass = "SQL_TYPE_TO_PROTOBUF_TYPE_ERROR", + messageParameters = Map( +"sqlColumn" -> toSQLId(sqlColumn), +"protobufColumn" -> protobufColumn, +"sqlType" -> toSQLType(sqlType), +"protobufType" -> protobufType)) + } + + def cannotConvertCatalystTypeToProtobufEnumTypeError( + sqlColumn: String, + protobufColumn: String, + data: String, + enumString: String): Throwable = { +new AnalysisException( + errorClass = "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR", + messageParameters = Map( +"sqlColumn" -> sqlColumn, +"protobufColumn" -> protobufColumn, +"data" -> data, +"enumString" -> enumString)) + } + + def cannotConvertProtobufTypeToCatalystTypeError( + protobufType: String, + sqlType: DataType, + cause: Throwable): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR", + messageParameters = Map( +"protobufType" -> protobufType, +"toType" -> toSQLType(sqlType)), + cause = Some(cause.getCause)) + } + + def cannotConvertSqlTypeToProtobufError( + protobufType: String, + sqlType: DataType, + cause: Throwable): Throwable = { +new AnalysisException( + errorClass = "UNABLE_TO_CONVERT_TO_PROTOBUF_TYPE", + messageParameters = Map( +"protobufType" -> protobufType, +"toType" -> toSQLType(sqlType)), + cause = Some(cause.getCause)) Review Comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008228857 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -3210,4 +3211,190 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase { messageParameters = Map("expression" -> toSQLExpr(expression)) ) } + + def cannotConvertProtobufTypeToSqlTypeError( + protobufColumn: String, + sqlColumn: String, + protobufType: String, + sqlType: DataType): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR", + messageParameters = Map( +"protobufColumn" -> protobufColumn, +"sqlColumn" -> toSQLId(sqlColumn), +"protobufType" -> protobufType, +"sqlType" -> toSQLType(sqlType))) + } + + def cannotConvertCatalystTypeToProtobufTypeError( + sqlColumn: String, + protobufColumn: String, + sqlType: DataType, + protobufType: String): Throwable = { +new AnalysisException( + errorClass = "SQL_TYPE_TO_PROTOBUF_TYPE_ERROR", + messageParameters = Map( +"sqlColumn" -> toSQLId(sqlColumn), +"protobufColumn" -> protobufColumn, +"sqlType" -> toSQLType(sqlType), +"protobufType" -> protobufType)) + } + + def cannotConvertCatalystTypeToProtobufEnumTypeError( + sqlColumn: String, + protobufColumn: String, + data: String, + enumString: String): Throwable = { +new AnalysisException( + errorClass = "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR", + messageParameters = Map( +"sqlColumn" -> sqlColumn, +"protobufColumn" -> protobufColumn, +"data" -> data, +"enumString" -> enumString)) + } + + def cannotConvertProtobufTypeToCatalystTypeError( + protobufType: String, + sqlType: DataType, + cause: Throwable): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR", + messageParameters = Map( +"protobufType" -> protobufType, +"toType" -> toSQLType(sqlType)), + cause = Some(cause.getCause)) + } + + def cannotConvertSqlTypeToProtobufError( + protobufType: String, + sqlType: DataType, + cause: Throwable): Throwable = { +new AnalysisException( + errorClass = "UNABLE_TO_CONVERT_TO_PROTOBUF_TYPE", + messageParameters = Map( +"protobufType" -> protobufType, +"toType" -> toSQLType(sqlType)), + cause = Some(cause.getCause)) + } + + def protobufTypeUnsupportedYetError(protobufType: String): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_TYPE_NOT_SUPPORT_ERROR", + messageParameters = Map("protobufType" -> protobufType)) + } + + def unknownProtobufMessageTypeError( + descriptorName: String, + containingType: String): Throwable = { +new AnalysisException( + errorClass = "UNKNOWN_PROTOBUF_MESSAGE_TYPE", + messageParameters = Map( +"descriptorName" -> descriptorName, +"containingType" -> containingType)) + } + + def cannotFindCatalystTypeInProtobufSchemaError(catalystFieldPath: String): Throwable = { +new AnalysisException( + errorClass = "NO_CATALYST_TYPE_IN_PROTOBUF_SCHEMA", + messageParameters = Map("catalystFieldPath" -> catalystFieldPath)) + } + + def cannotFindProtobufFieldInCatalystError(field: String): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA", + messageParameters = Map("field" -> field)) + } + + def protobufFieldMatchError(field: String, + protobufSchema: String, + matchSize: String, + matches: String): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_FIELD_MISSING_ERROR", + messageParameters = Map( +"field" -> field, +"protobufSchema" -> protobufSchema, +"matchSize" -> matchSize, +"matches" -> matches)) + } + + def unableToLocateProtobufMessageError(messageName: String): Throwable = { +new AnalysisException( + errorClass = "UNABLE_TO_LOCATE_PROTOBUF_MESSAGE_ERROR", + messageParameters = Map("messageName" -> messageName)) + } + + def descrioptorParseError(cause: Throwable): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_DESCRIPTOR_ERROR", + messageParameters = Map.empty(), + cause = Some(cause.getCause)) + } + + def cannotFindDescriptorFileError(filePath: String, cause: Throwable): Throwable = { +new AnalysisException( + errorClass = "CANNOT_FIND_PROTOBUF_DESCRIPTOR_FILE_ERROR", + messageParameters = Map("filePath" -> filePath), + cause = Some(cause.getCause)) + } + + def noProtobufMessageTypeReturnError(descriptorName: String): Throwable = { +new AnalysisException( + errorClass = "NO_PROTOBUF_MESSAGE_TYPE_ERROR", + messageParameters = Map("descriptorName" ->
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008228200 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufSerializer.scala: ## @@ -215,10 +217,12 @@ private[sql] class ProtobufSerializer( duration.build() case _ => -throw new IncompatibleSchemaException( - errorPrefix + -s"schema is incompatible (sqlType = ${catalystType.sql}, " + -s"protoType = ${fieldDescriptor.getJavaType})") +throw QueryCompilationErrors.cannotConvertCatalystTypeToProtobufTypeError( + toFieldStr(catalystPath), Review Comment: okay, will call toSQLType inside the error class, avoid using toFieldStr for sqlType -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`
dongjoon-hyun commented on PR #38425: URL: https://github.com/apache/spark/pull/38425#issuecomment-1295175461 Thank you, @Yikun . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db opened a new pull request, #38429: [SPARK-40800][SQL][FOLLOW-UP] Add a config to control whether to always inline one-row relation subquery
allisonwang-db opened a new pull request, #38429: URL: https://github.com/apache/spark/pull/38429 ### What changes were proposed in this pull request? This PR is a follow-up for SPARK-40800. It introduces a new Spark config to control the behavior of whether to always inline one-row relation subquery: `spark.sql.optimizer.optimizeOneRowRelationSubquery.alwaysInline` (default: true). ### Why are the changes needed? To give users more flexibility to control the correlated subquery performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Edited existing unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #38350: [SPARK-40752][SQL] Migrate type check failures of misc expressions onto error classes
panbingkun commented on code in PR #38350: URL: https://github.com/apache/spark/pull/38350#discussion_r1008152723 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala: ## @@ -72,7 +74,11 @@ case class SortOrder( if (RowOrdering.isOrderable(dataType)) { TypeCheckResult.TypeCheckSuccess } else { - TypeCheckResult.TypeCheckFailure(s"cannot sort data type ${dataType.catalogString}") + DataTypeMismatch( +errorSubClass = "UNSUPPORTED_INPUT_TYPE", Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'
cloud-fan commented on PR #38263: URL: https://github.com/apache/spark/pull/38263#issuecomment-1295085134 one more reference: https://docs.snowflake.com/en/sql-reference/functions/sha2.html We can probably find more references and decide the function name and signature for consistent hash. cc @srielau -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO
cloud-fan commented on code in PR #38400: URL: https://github.com/apache/spark/pull/38400#discussion_r1008135830 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -451,7 +451,22 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit } } } -if (matchedActions.isEmpty && notMatchedActions.isEmpty) { +val notMatchedBySourceActions = ctx.notMatchedBySourceClause().asScala.map { + clause => { +val notMatchedBySourceAction = clause.notMatchedBySourceAction() +if (notMatchedBySourceAction.DELETE() != null) { + DeleteAction(Option(clause.notMatchedBySourceCond).map(expression)) +} else if (notMatchedBySourceAction.UPDATE() != null) { + val condition = Option(clause.notMatchedBySourceCond).map(expression) + UpdateAction(condition, + withAssignments(clause.notMatchedBySourceAction().assignmentList())) +} else { + // It should not be here. + throw QueryParsingErrors.unrecognizedNotMatchedBySourceActionError(clause) Review Comment: If this means a bug, we should throw `SparkException.internalError` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a diff in pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'
vinodkc commented on code in PR #38263: URL: https://github.com/apache/spark/pull/38263#discussion_r1008128017 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3950,6 +3950,14 @@ object SQLConf { .checkValues(ErrorMessageFormat.values.map(_.toString)) .createWithDefault(ErrorMessageFormat.PRETTY.toString) + val SPARK_MASKING_ALGO = buildConf("spark.sql.masking.algo") Review Comment: Yes, we can add a second boolean parameter to control the fips mode. I'll update the code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO
cloud-fan commented on code in PR #38400: URL: https://github.com/apache/spark/pull/38400#discussion_r1008130147 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -1563,13 +1563,29 @@ class Analyzer(override val catalogManager: CatalogManager) } InsertAction( resolvedInsertCondition, - resolveAssignments(assignments, m, resolveValuesWithSourceOnly = true)) + resolveAssignments(assignments, m, resolveValuesFrom = sourceTable)) case o => o } +val newNotMatchedBySourceActions = m.notMatchedBySourceActions.map { + case DeleteAction(deleteCondition) => +val resolvedDeleteCondition = deleteCondition.map( + resolveExpressionByPlanChildren(_, Project(Nil, targetTable))) Review Comment: nit: we can call `resolveExpressionByPlanOutput`, then we don't need to wrap the plan with `Project(Nil, ...)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a diff in pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'
vinodkc commented on code in PR #38263: URL: https://github.com/apache/spark/pull/38263#discussion_r1008128017 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3950,6 +3950,14 @@ object SQLConf { .checkValues(ErrorMessageFormat.values.map(_.toString)) .createWithDefault(ErrorMessageFormat.PRETTY.toString) + val SPARK_MASKING_ALGO = buildConf("spark.sql.masking.algo") Review Comment: Yes, we can add a third boolean parameter to control the fips mode. I'll update the code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO
cloud-fan commented on code in PR #38400: URL: https://github.com/apache/spark/pull/38400#discussion_r1008124090 ## core/src/main/resources/error/error-classes.json: ## @@ -801,6 +819,21 @@ ], "sqlState" : "42000" }, + "UNRECOGNIZED_MATCHED_ACTION" : { +"message" : [ + "Unrecognized matched action: ." +] + }, + "UNRECOGNIZED_NOT_MATCHED_ACTION" : { Review Comment: ditto, `UNRECOGNIZED_NOT_MATCHED_BY_TARGET_ACTION`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38400: [SPARK-40921][SQL] Add WHEN NOT MATCHED BY SOURCE clause to MERGE INTO
cloud-fan commented on code in PR #38400: URL: https://github.com/apache/spark/pull/38400#discussion_r1008123172 ## core/src/main/resources/error/error-classes.json: ## @@ -602,6 +602,24 @@ ], "sqlState" : "42000" }, + "NON_LAST_MATCHED_CLAUSE_OMIT_CONDITION" : { +"message" : [ + "When there are more than one MATCHED clauses in a MERGE statement, only the last MATCHED clause can omit the condition." +], +"sqlState" : "42000" + }, + "NON_LAST_NOT_MATCHED_BY_SOURCE_CLAUSE_OMIT_CONDITION" : { +"message" : [ + "When there are more than one NOT MATCHED BY SOURCE clauses in a MERGE statement, only the last NOT MATCHED BY SOURCE clause can omit the condition." +], +"sqlState" : "42000" + }, + "NON_LAST_NOT_MATCHED_CLAUSE_OMIT_CONDITION" : { Review Comment: is it more accurate to name it `NON_LAST_NOT_MATCHED_BY_TARGET_CLAUSE_OMIT_CONDITION ` ## core/src/main/resources/error/error-classes.json: ## @@ -602,6 +602,24 @@ ], "sqlState" : "42000" }, + "NON_LAST_MATCHED_CLAUSE_OMIT_CONDITION" : { +"message" : [ + "When there are more than one MATCHED clauses in a MERGE statement, only the last MATCHED clause can omit the condition." +], +"sqlState" : "42000" + }, + "NON_LAST_NOT_MATCHED_BY_SOURCE_CLAUSE_OMIT_CONDITION" : { +"message" : [ + "When there are more than one NOT MATCHED BY SOURCE clauses in a MERGE statement, only the last NOT MATCHED BY SOURCE clause can omit the condition." +], +"sqlState" : "42000" + }, + "NON_LAST_NOT_MATCHED_CLAUSE_OMIT_CONDITION" : { Review Comment: is it more accurate to name it `NON_LAST_NOT_MATCHED_BY_TARGET_CLAUSE_OMIT_CONDITION`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38286: [SPARK-40657] Add support for Java classes in Protobuf functions
MaxGekk commented on code in PR #38286: URL: https://github.com/apache/spark/pull/38286#discussion_r1008102188 ## connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala: ## @@ -24,6 +24,7 @@ import org.apache.spark.sql.{RandomDataGenerator, Row} import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, NoopFilters, OrderedFilters, StructFilters} import org.apache.spark.sql.catalyst.expressions.{ExpressionEvalHelper, GenericInternalRow, Literal} import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, GenericArrayData, MapData} +import org.apache.spark.sql.protobuf.protos.CatalystTypes.BytesMsg Review Comment: I am struggling in generating this file in IDEA: https://user-images.githubusercontent.com/1580697/198630233-33341e69-276f-4505-9d52-874ddd1e6bc1.png;> Is any way to generate all required dependencies? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13
AmplabJenkins commented on PR #38427: URL: https://github.com/apache/spark/pull/38427#issuecomment-1295039735 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38428: [SPARK-40912][CORE][WIP] Overhead of Exceptions in DeserializationStream
AmplabJenkins commented on PR #38428: URL: https://github.com/apache/spark/pull/38428#issuecomment-1295039656 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38350: [SPARK-40752][SQL] Migrate type check failures of misc expressions onto error classes
MaxGekk commented on PR #38350: URL: https://github.com/apache/spark/pull/38350#issuecomment-1295031180 @panbingkun Please, remove duplicate items in PR's description: ``` - 1.Add new UT - 2.Update existed UT - 3.Pass GA. ``` to ``` 1. Add new UT 2. Update existed UT 3. Pass GA. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38350: [SPARK-40752][SQL] Migrate type check failures of misc expressions onto error classes
MaxGekk commented on code in PR #38350: URL: https://github.com/apache/spark/pull/38350#discussion_r1008092467 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala: ## @@ -72,7 +74,11 @@ case class SortOrder( if (RowOrdering.isOrderable(dataType)) { TypeCheckResult.TypeCheckSuccess } else { - TypeCheckResult.TypeCheckFailure(s"cannot sort data type ${dataType.catalogString}") + DataTypeMismatch( +errorSubClass = "UNSUPPORTED_INPUT_TYPE", Review Comment: Could you re-use the existing error class `INVALID_ORDERING_TYPE`. Just call `TypeUtils.checkForOrderingExpr(...)` here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38350: [SPARK-40752][SQL] Migrate type check failures of misc expressions onto error classes
MaxGekk commented on PR #38350: URL: https://github.com/apache/spark/pull/38350#issuecomment-1295009404 @panbingkun This pr **replace** TypeCheckFailure -> This pr **replaces** TypeCheckFailure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] derhagen commented on a diff in pull request #38389: [MINOR][DOCS][PYTHON] Fix the truncation of API reference in several DataTypes
derhagen commented on code in PR #38389: URL: https://github.com/apache/spark/pull/38389#discussion_r1008071665 ## python/pyspark/sql/types.py: ## @@ -350,21 +350,21 @@ class FloatType(FractionalType, metaclass=DataTypeSingleton): class ByteType(IntegralType): -"""Byte data type, i.e. a signed integer in a single byte.""" +"""Byte data type, i.e.\ a signed integer in a single byte.""" Review Comment: starting the docstrings with an 'r' string prefix might be the right thing to do here, but I'm travelling right now. I'll test it next week, if I find the time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38352: [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10
srowen commented on PR #38352: URL: https://github.com/apache/spark/pull/38352#issuecomment-1295001164 The latter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sithmein commented on pull request #38352: [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10
sithmein commented on PR #38352: URL: https://github.com/apache/spark/pull/38352#issuecomment-1294996843 Is Spark actually affected by the problem in the `StringLookup` class or is this dependency update only to get rid of all the alarms raised by code scanners? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
cloud-fan commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r1008046680 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2100,3 +2100,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class Summary( Review Comment: they are not a function in my opinion, but more like a dataframe operation. e.g. we won't call `df.filter` a DataFrame function. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38410: [SPARK-40932][CORE] Fix issue messages for allGather are overridden
cloud-fan closed pull request #38410: [SPARK-40932][CORE] Fix issue messages for allGather are overridden URL: https://github.com/apache/spark/pull/38410 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38410: [SPARK-40932][CORE] Fix issue messages for allGather are overridden
cloud-fan commented on PR #38410: URL: https://github.com/apache/spark/pull/38410#issuecomment-1294978376 thanks, merging to master/3.3! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
MaxGekk commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1008034311 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -3210,4 +3211,190 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase { messageParameters = Map("expression" -> toSQLExpr(expression)) ) } + + def cannotConvertProtobufTypeToSqlTypeError( + protobufColumn: String, + sqlColumn: String, + protobufType: String, + sqlType: DataType): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_TYPE_TO_SQL_TYPE_ERROR", + messageParameters = Map( +"protobufColumn" -> protobufColumn, +"sqlColumn" -> toSQLId(sqlColumn), +"protobufType" -> protobufType, +"sqlType" -> toSQLType(sqlType))) + } + + def cannotConvertCatalystTypeToProtobufTypeError( + sqlColumn: String, + protobufColumn: String, + sqlType: DataType, + protobufType: String): Throwable = { +new AnalysisException( + errorClass = "SQL_TYPE_TO_PROTOBUF_TYPE_ERROR", + messageParameters = Map( +"sqlColumn" -> toSQLId(sqlColumn), +"protobufColumn" -> protobufColumn, +"sqlType" -> toSQLType(sqlType), +"protobufType" -> protobufType)) + } + + def cannotConvertCatalystTypeToProtobufEnumTypeError( + sqlColumn: String, + protobufColumn: String, + data: String, + enumString: String): Throwable = { +new AnalysisException( + errorClass = "CATALYST_TYPE_TO_PROTOBUF_ENUM_TYPE_ERROR", + messageParameters = Map( +"sqlColumn" -> sqlColumn, +"protobufColumn" -> protobufColumn, +"data" -> data, +"enumString" -> enumString)) + } + + def cannotConvertProtobufTypeToCatalystTypeError( + protobufType: String, + sqlType: DataType, + cause: Throwable): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_TYPE_TO_CATALYST_TYPE_ERROR", + messageParameters = Map( +"protobufType" -> protobufType, +"toType" -> toSQLType(sqlType)), + cause = Some(cause.getCause)) + } + + def cannotConvertSqlTypeToProtobufError( + protobufType: String, + sqlType: DataType, + cause: Throwable): Throwable = { +new AnalysisException( + errorClass = "UNABLE_TO_CONVERT_TO_PROTOBUF_TYPE", + messageParameters = Map( +"protobufType" -> protobufType, +"toType" -> toSQLType(sqlType)), + cause = Some(cause.getCause)) + } + + def protobufTypeUnsupportedYetError(protobufType: String): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_TYPE_NOT_SUPPORT_ERROR", + messageParameters = Map("protobufType" -> protobufType)) + } + + def unknownProtobufMessageTypeError( + descriptorName: String, + containingType: String): Throwable = { +new AnalysisException( + errorClass = "UNKNOWN_PROTOBUF_MESSAGE_TYPE", + messageParameters = Map( +"descriptorName" -> descriptorName, +"containingType" -> containingType)) + } + + def cannotFindCatalystTypeInProtobufSchemaError(catalystFieldPath: String): Throwable = { +new AnalysisException( + errorClass = "NO_CATALYST_TYPE_IN_PROTOBUF_SCHEMA", + messageParameters = Map("catalystFieldPath" -> catalystFieldPath)) + } + + def cannotFindProtobufFieldInCatalystError(field: String): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_FIELD_MISSING_IN_CATALYST_SCHEMA", + messageParameters = Map("field" -> field)) + } + + def protobufFieldMatchError(field: String, + protobufSchema: String, + matchSize: String, + matches: String): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_FIELD_MISSING_ERROR", + messageParameters = Map( +"field" -> field, +"protobufSchema" -> protobufSchema, +"matchSize" -> matchSize, +"matches" -> matches)) + } + + def unableToLocateProtobufMessageError(messageName: String): Throwable = { +new AnalysisException( + errorClass = "UNABLE_TO_LOCATE_PROTOBUF_MESSAGE_ERROR", + messageParameters = Map("messageName" -> messageName)) + } + + def descrioptorParseError(cause: Throwable): Throwable = { +new AnalysisException( + errorClass = "PROTOBUF_DESCRIPTOR_ERROR", + messageParameters = Map.empty(), + cause = Some(cause.getCause)) + } + + def cannotFindDescriptorFileError(filePath: String, cause: Throwable): Throwable = { +new AnalysisException( + errorClass = "CANNOT_FIND_PROTOBUF_DESCRIPTOR_FILE_ERROR", + messageParameters = Map("filePath" -> filePath), + cause = Some(cause.getCause)) + } + + def noProtobufMessageTypeReturnError(descriptorName: String): Throwable = { +new AnalysisException( + errorClass = "NO_PROTOBUF_MESSAGE_TYPE_ERROR", + messageParameters = Map("descriptorName" -> descriptorName)) +
[GitHub] [spark] eejbyfeldt opened a new pull request, #38428: [SPARK-40912][CORE][WIP] Overhead of Exceptions in DeserializationStream
eejbyfeldt opened a new pull request, #38428: URL: https://github.com/apache/spark/pull/38428 ### What changes were proposed in this pull request? This PR avoid exceptions in the implementation of KryoDeserializationStream. ### Why are the changes needed? Using an exceptions for end of stream is slow, especially for small streams. It also problematic as it the exception caught in the KryoDeserializationStream could also be caused by corrupt data which would just be ignored in the current implementation. ### Does this PR introduce _any_ user-facing change? Yes, it changes so some method on KryoDeserializationStream no longer raises EOFException. ### How was this patch tested? Existing tests. This PR only changes KryoDeserializationStream as a proof of concept. If this is the direction we want to go we should probably change DerserializationStream isntead so that the interface is consistent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
cloud-fan commented on PR #38397: URL: https://github.com/apache/spark/pull/38397#issuecomment-1294971438 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
cloud-fan commented on PR #38397: URL: https://github.com/apache/spark/pull/38397#issuecomment-1294971845 unfortunately it conflicts with 3.3, @juliuszsompolski could you open a backport PR? thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
cloud-fan closed pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output URL: https://github.com/apache/spark/pull/38397 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist
zhengruifeng commented on PR #38426: URL: https://github.com/apache/spark/pull/38426#issuecomment-1294931880 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38424: [SPARK-40953][CONNECT][PYTHON] Fix `Dataframe.head`
zhengruifeng commented on PR #38424: URL: https://github.com/apache/spark/pull/38424#issuecomment-1294926943 @dongjoon-hyun thanks for the reminder -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38170: [WIP][SPARK-40663][SQL] Migrate execution errors onto error classes: _LEGACY_ERROR_TEMP_2201-2225
MaxGekk commented on PR #38170: URL: https://github.com/apache/spark/pull/38170#issuecomment-1294926445 @itholic Please, resolve conflicts and fix tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38420: [SPARK-40947][PS][INFRA] Upgrade pandas to 1.5.1
zhengruifeng commented on PR #38420: URL: https://github.com/apache/spark/pull/38420#issuecomment-1294924307 seems some tests failed again... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38422: [SPARK-40948][SQL] Introduce new error class: DATA_PATH_NOT_EXIST
MaxGekk commented on code in PR #38422: URL: https://github.com/apache/spark/pull/38422#discussion_r1007996419 ## core/src/main/resources/error/error-classes.json: ## @@ -98,6 +98,11 @@ "The value () cannot be converted to because it is malformed. Correct the value as per the syntax, or change its format. Use to tolerate malformed input and return NULL instead." ] }, + "DATA_PATH_NOT_EXIST" : { Review Comment: How about `NON_EXISTENT_PATH`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38421: [SPARK-40889][SQL][TESTS] Check error classes in PlanResolutionSuite
MaxGekk closed pull request #38421: [SPARK-40889][SQL][TESTS] Check error classes in PlanResolutionSuite URL: https://github.com/apache/spark/pull/38421 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38421: [SPARK-40889][SQL][TESTS] Check error classes in PlanResolutionSuite
MaxGekk commented on PR #38421: URL: https://github.com/apache/spark/pull/38421#issuecomment-1294918032 +1, LGTM. Merging to master. Thank you, @panbingkun and @itholic for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38413: [SPARK-40936][SQL][TESTS] Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the `SparkFunSuite#checkError`
MaxGekk closed pull request #38413: [SPARK-40936][SQL][TESTS] Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the `SparkFunSuite#checkError` URL: https://github.com/apache/spark/pull/38413 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38413: [SPARK-40936][SQL][TESTS] Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the `SparkFunSuite#checkError`
MaxGekk commented on PR #38413: URL: https://github.com/apache/spark/pull/38413#issuecomment-1294915767 +1, LGTM. Merging to master. Thank you, @LuciferYang. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] eejbyfeldt opened a new pull request, #38427: [SPAR-40950
eejbyfeldt opened a new pull request, #38427: URL: https://github.com/apache/spark/pull/38427 ### What changes were proposed in this pull request? In FetchBlockRequest we currently store a `Seq[FetchBlockInfo]` as part of the function `isRemoteAddressMaxedOut` (probably other places as well, but this is the function that showd up in my profileling) we use the length of this Seq. In scala 2.12 `Seq` is an alias for `scala.collection.Seq` but in 2.13 it an alias for `scala.collection.immutable.Seq`. This means that in when for example we call `toSeq` on a `ArrayBuffer` in 2.12 we do nothing and the `blocks` in the `FetchRequest` will be backed by something with a cheap `length` but in 2.13 we end up copying the data to a `List`. This PR solves this changing the `Seq` to and `IndexedSeq` and therefore making the expectation of a cheap length function explicit. This means that we some places will do an extra copy in scala 2.13 compared to 2.12 (was also the case before this PR). If we wanted to avoid this copy we should instead change it to use `scala.collection.IndexedSeq` so we would have the same types in both 2.13 and 2.12. ### Why are the changes needed? The performance for ShuffleBlockFetcherIterator is much worse on Scala 2.13 than 2.12. Have seen cases were the overhead of repeatedly calculating the length is as much as 20% of cpu time (and could probably be even worse for larger shuffles). ### Does this PR introduce _any_ user-facing change? No. I think the interface changes are only on private classes. ### How was this patch tested? Existing specs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38422: [SPARK-40948][SQL] Introduce new error class: DATA_PATH_NOT_EXIST
MaxGekk commented on PR #38422: URL: https://github.com/apache/spark/pull/38422#issuecomment-1294872638 @itholic Could you add a test for this new error class, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`
Yikun commented on PR #38425: URL: https://github.com/apache/spark/pull/38425#issuecomment-1294863094 @dongjoon-hyun @itholic Thanks, merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun closed pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`
Yikun closed pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt` URL: https://github.com/apache/spark/pull/38425 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] juliuszsompolski commented on pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
juliuszsompolski commented on PR #38397: URL: https://github.com/apache/spark/pull/38397#issuecomment-1294826483 Thanks. I updated the title after fixing the Orc, but forgot to update the description which still was describing about Parquet only. This bug exists in branch-3.3 as well btw; if it doesn't merge cleanly I can open a separate PR later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode
shrprasa commented on PR #37880: URL: https://github.com/apache/spark/pull/37880#issuecomment-1294815181 ping @gaborgsomogyi @dongjoon-hyun @HyukjinKwon @squito -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38426: [SPARK-40951][PYSPARK][TESTS] `pyspark-connect` tests should be skipped if `pandas` doesn't exist
dongjoon-hyun commented on PR #38426: URL: https://github.com/apache/spark/pull/38426#issuecomment-1294811494 Could you review this, @HyukjinKwon ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #38426: [SPARK-40951][PYSPARK][TESTS] pyspark-connect tests should be skipped if pandas doesn't exist
dongjoon-hyun opened a new pull request, #38426: URL: https://github.com/apache/spark/pull/38426 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`
dongjoon-hyun commented on PR #38425: URL: https://github.com/apache/spark/pull/38425#issuecomment-1294793919 Thank you, @itholic . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`
itholic commented on PR #38425: URL: https://github.com/apache/spark/pull/38425#issuecomment-1294763957 LGTM when the test pass. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mcdull-zhang commented on pull request #36301: [SPARK-21697][SQL] NPE & ExceptionInInitializerError trying to load UDF from HDFS
mcdull-zhang commented on PR #36301: URL: https://github.com/apache/spark/pull/36301#issuecomment-1294759394 > How to reproduce this problem? > > 1. Make sure `commons-logging` is in front of the classpath. >You can use `SPARK_PRINT_LAUNCH_COMMAND` to get the startup command and manually modify the classpath. > 2. Add two hdfs jars with different names. >`ADD JAR hdfs://` > 3. Create a temporary function, the class of udf does not matter whether it exists or not. >`CREATE temporary FUNCTION AS 'xx' ` > 4. An error occurred. Why do you need to add two jars in step 2? I feel that there will be a problem with adding one jar. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] juliuszsompolski commented on pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
juliuszsompolski commented on PR #38397: URL: https://github.com/apache/spark/pull/38397#issuecomment-1294754916 > Given that OrcFileFormat has no issue like _metadata columns @dongjoon-hyun I think OrcFIleFormat has exactly the same issue as ParquetFileFormat, like @cloud-fan pointed out? When there was a column like `_metadata.file_path` requested for OrcFileFormat, it would also count that column in FileSourceScanExec.supporsBatch, but not count it in OrcFileFormat.supportsBatch during buildReaderWithPartitionValues. The code changes I made to OrcFileFormat exactly mirror what I did to ParquetFileFormat. I updated the description to descibe "`ParquetFileFormat` or `OrcFileFormat`" in various places, but the issue seems exactly the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38425: [SPARK-40229][PS][TEST][FOLLOWUP] Add `openpyxl` to `requirements.txt`
dongjoon-hyun commented on PR #38425: URL: https://github.com/apache/spark/pull/38425#issuecomment-1294754038 cc @itholic , @HyukjinKwon , @Yikun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org