[GitHub] [spark] yaooqinn opened a new pull request, #40602: [SPARK-42978][SQL] Derby&PG: RENAME cannot qualify a new-table-Name with a schema-Name

2023-03-30 Thread via GitHub
yaooqinn opened a new pull request, #40602: URL: https://github.com/apache/spark/pull/40602 ### What changes were proposed in this pull request? Fix `rename a table` in derby and pg, which schema name is not allowed to qualify the new table name ### Why are

[GitHub] [spark] yaooqinn commented on a diff in pull request #40602: [SPARK-42978][SQL] Derby&PG: RENAME cannot qualify a new-table-Name with a schema-Name

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #40602: URL: https://github.com/apache/spark/pull/40602#discussion_r1152824490 ## core/src/main/resources/error/error-classes.json: ## @@ -129,6 +129,12 @@ ], "sqlState" : "429BB" }, + "CANNOT_RENAME_ACROSS_SCHEMA" : { +"message

[GitHub] [spark] ScrapCodes commented on pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
ScrapCodes commented on PR #40553: URL: https://github.com/apache/spark/pull/40553#issuecomment-1489811022 Hi @VindhyaG, this might be useful - may be we can benefit from the usecase you have for this. Is it just for logging? Not sure what others think, it might be good to limit the API

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1152828935 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -679,6 +679,8 @@ object RemoveNoopUnion extends Rule[LogicalPlan] {

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1152828935 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -679,6 +679,8 @@ object RemoveNoopUnion extends Rule[LogicalPlan] {

[GitHub] [spark] grundprinzip commented on a diff in pull request #40586: [SPARK-42939][SS][CONNECT] Core streaming Python API for Spark Connect

2023-03-30 Thread via GitHub
grundprinzip commented on code in PR #40586: URL: https://github.com/apache/spark/pull/40586#discussion_r1152826039 ## connector/connect/common/src/main/protobuf/spark/connect/commands.proto: ## @@ -177,3 +179,97 @@ message WriteOperationV2 { // (Optional) A condition for ove

[GitHub] [spark] MaxGekk commented on a diff in pull request #40593: [WIP][SQL] Define typed literal constructors as keywords

2023-03-30 Thread via GitHub
MaxGekk commented on code in PR #40593: URL: https://github.com/apache/spark/pull/40593#discussion_r1152878072 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -928,11 +928,19 @@ primaryExpression (FILTER LEFT_PAREN WHERE wher

[GitHub] [spark] yaooqinn commented on pull request #40601: [SPARK-42975][SQL] Cast result type to timestamp type for string +/- interval

2023-03-30 Thread via GitHub
yaooqinn commented on PR #40601: URL: https://github.com/apache/spark/pull/40601#issuecomment-1489866478 This change makes sense to me. This is a breaking change, then shall we add a migration guide for it? -- This is an automated message from the Apache Git Service. To respond to the mes

[GitHub] [spark] grundprinzip opened a new pull request, #40603: [MINOR][CONNECT] Adding Proto Debug String to Job Description.

2023-03-30 Thread via GitHub
grundprinzip opened a new pull request, #40603: URL: https://github.com/apache/spark/pull/40603 ### What changes were proposed in this pull request? Instead of just showing the Scala callsite show the abbreviate version of the proto message in the Spark UI. ### Why are the changes

[GitHub] [spark] cloud-fan commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-03-30 Thread via GitHub
cloud-fan commented on PR #40437: URL: https://github.com/apache/spark/pull/40437#issuecomment-1489876875 @yaooqinn this is a good point. If we are sure this is only for CLI display, not thriftserver protocol, I agree we don't need to follow Hive. -- This is an automated message from the

[GitHub] [spark] cloud-fan commented on a diff in pull request #40593: [WIP][SQL] Define typed literal constructors as keywords

2023-03-30 Thread via GitHub
cloud-fan commented on code in PR #40593: URL: https://github.com/apache/spark/pull/40593#discussion_r1152908374 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -928,11 +928,19 @@ primaryExpression (FILTER LEFT_PAREN WHERE wh

[GitHub] [spark] cloud-fan commented on a diff in pull request #40593: [WIP][SQL] Define typed literal constructors as keywords

2023-03-30 Thread via GitHub
cloud-fan commented on code in PR #40593: URL: https://github.com/apache/spark/pull/40593#discussion_r1152908724 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -928,11 +928,19 @@ primaryExpression (FILTER LEFT_PAREN WHERE wh

[GitHub] [spark] cloud-fan commented on a diff in pull request #40593: [WIP][SQL] Define typed literal constructors as keywords

2023-03-30 Thread via GitHub
cloud-fan commented on code in PR #40593: URL: https://github.com/apache/spark/pull/40593#discussion_r1152910161 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -928,11 +928,19 @@ primaryExpression (FILTER LEFT_PAREN WHERE wh

[GitHub] [spark] yaooqinn commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-03-30 Thread via GitHub
yaooqinn commented on PR #40437: URL: https://github.com/apache/spark/pull/40437#issuecomment-1489909241 > If we are sure this is only for CLI display, Yes. hiveResultString is only used in spark-sql CLI. The thrift server-side always uses command output schema. Maybe this is the inco

[GitHub] [spark] cloud-fan commented on a diff in pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-03-30 Thread via GitHub
cloud-fan commented on code in PR #40437: URL: https://github.com/apache/spark/pull/40437#discussion_r1152923808 ## sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala: ## @@ -59,18 +59,6 @@ object HiveResult { formatDescribeTableOutput(executedPlan.

[GitHub] [spark] cloud-fan opened a new pull request, #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
cloud-fan opened a new pull request, #40604: URL: https://github.com/apache/spark/pull/40604 This reverts commit a111a02de1a814c5f335e0bcac4cffb0515557dc. ### What changes were proposed in this pull request? SQLMetrics is not only used in the UI, but is also a programmin

[GitHub] [spark] cloud-fan commented on pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
cloud-fan commented on PR #40604: URL: https://github.com/apache/spark/pull/40604#issuecomment-1489923055 cc @ulysses-you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

[GitHub] [spark] cloud-fan commented on pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
cloud-fan commented on PR #40604: URL: https://github.com/apache/spark/pull/40604#issuecomment-1489923733 also cc @xinrong-meng , this is not a blocker but it's better if we can make it into 3.4.0. -- This is an automated message from the Apache Git Service. To respond to the message, ple

[GitHub] [spark] Yikf commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-03-30 Thread via GitHub
Yikf commented on PR #40437: URL: https://github.com/apache/spark/pull/40437#issuecomment-1489933393 Yes. `hiveResultString` is added to ensure compatibility with hive output. `hiveResultString` is only used by the spark-sql CLI. It is used only as the CLI display. `thriftServe

[GitHub] [spark] LuciferYang commented on a diff in pull request #40598: [SPARK-42974][CORE] Restore `Utils#createTempDir` use `ShutdownHookManager#registerShutdownDeleteDir` to cleanup tempDir

2023-03-30 Thread via GitHub
LuciferYang commented on code in PR #40598: URL: https://github.com/apache/spark/pull/40598#discussion_r1152946827 ## common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java: ## @@ -373,18 +373,22 @@ public static byte[] bufferToArray(ByteBuffer buffer)

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152951801 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -494,10 +525,46 @@ class ExecutorPodsAllo

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152954378 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -520,10 +552,46 @@ class ExecutorPodsAllo

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152957287 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala: ## @@ -750,6 +750,26 @@ private[spark] object Config extends Logging {

[GitHub] [spark] pan3793 opened a new pull request, #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
pan3793 opened a new pull request, #38732: URL: https://github.com/apache/spark/pull/38732 ### What changes were proposed in this pull request? Fail Spark Application when number of executor failures reach threshold. ### Why are the changes needed? Sometimes,

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152960657 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -117,6 +120,12 @@ class ExecutorPodsAlloc

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152961738 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -148,6 +163,10 @@ class ExecutorPodsAlloc

[GitHub] [spark] LuciferYang opened a new pull request, #40605: [SPARK-42958][CONNECT] Refactor `CheckConnectJvmClientCompatibility` to compare client and avro module

2023-03-30 Thread via GitHub
LuciferYang opened a new pull request, #40605: URL: https://github.com/apache/spark/pull/40605 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

[GitHub] [spark] cloud-fan commented on pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-30 Thread via GitHub
cloud-fan commented on PR #40300: URL: https://github.com/apache/spark/pull/40300#issuecomment-1489959075 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
pan3793 commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152964045 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -148,6 +163,10 @@ class ExecutorPodsAlloca

[GitHub] [spark] cloud-fan closed pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-30 Thread via GitHub
cloud-fan closed pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns URL: https://github.com/apache/spark/pull/40300 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] cloud-fan commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-03-30 Thread via GitHub
cloud-fan commented on PR #40437: URL: https://github.com/apache/spark/pull/40437#issuecomment-1489962469 > I'm not sure why spark-sql CLI has to be compatible with hive output, personally, I don't think it's necessary. Maybe we should display spark's schema as is, just like thriftSever?

[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
pan3793 commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152967630 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala: ## @@ -750,6 +750,26 @@ private[spark] object Config extends Logging {

[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
pan3793 commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1038973719 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -136,6 +151,10 @@ class ExecutorPodsAlloca

[GitHub] [spark] lyy-pineapple commented on pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2023-03-30 Thread via GitHub
lyy-pineapple commented on PR #38171: URL: https://github.com/apache/spark/pull/38171#issuecomment-1489985307 > `joni` seems to be used in Hbase client only instead of Hbase server or Hbase common. > > * https://mvnrepository.com/artifact/org.apache.hbase/hbase-client/2.5.3 > >

[GitHub] [spark] lyy-pineapple commented on pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2023-03-30 Thread via GitHub
lyy-pineapple commented on PR #38171: URL: https://github.com/apache/spark/pull/38171#issuecomment-1489987177 > https://user-images.githubusercontent.com/8748814/204439049-53f0bd4f-9ea0-4289-8268-d16aef5b4334.png";> > > @lyy-pineapple Would you share the test sql pattern? I test some c

[GitHub] [spark] grundprinzip opened a new pull request, #40606: Debugging is awesome

2023-03-30 Thread via GitHub
grundprinzip opened a new pull request, #40606: URL: https://github.com/apache/spark/pull/40606 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### Ho

[GitHub] [spark] LuciferYang commented on a diff in pull request #40605: [SPARK-42958][CONNECT] Refactor `connect-jvm-client-mima-check` to support mima check with avro module

2023-03-30 Thread via GitHub
LuciferYang commented on code in PR #40605: URL: https://github.com/apache/spark/pull/40605#discussion_r1153013745 ## dev/connect-jvm-client-mima-check: ## @@ -34,20 +34,18 @@ fi rm -f .connect-mima-check-result -echo "Build sql module, connect-client-jvm module and connect

[GitHub] [spark] huangxiaopingRD commented on a diff in pull request #40232: [SPARK-42629][DOCS] Update the description of default data source in the document

2023-03-30 Thread via GitHub
huangxiaopingRD commented on code in PR #40232: URL: https://github.com/apache/spark/pull/40232#discussion_r1153014538 ## docs/sql-ref-syntax-ddl-create-table-datasource.md: ## @@ -118,7 +118,7 @@ CREATE TABLE student (id INT, name STRING, age INT) USING CSV; CREATE TABLE stud

[GitHub] [spark] huangxiaopingRD commented on a diff in pull request #40232: [SPARK-42629][DOCS] Update the description of default data source in the document

2023-03-30 Thread via GitHub
huangxiaopingRD commented on code in PR #40232: URL: https://github.com/apache/spark/pull/40232#discussion_r1153014538 ## docs/sql-ref-syntax-ddl-create-table-datasource.md: ## @@ -118,7 +118,7 @@ CREATE TABLE student (id INT, name STRING, age INT) USING CSV; CREATE TABLE stud

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1153042664 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala: ## @@ -750,6 +750,26 @@ private[spark] object Config extends Logging {

[GitHub] [spark] zhengruifeng opened a new pull request, #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
zhengruifeng opened a new pull request, #40607: URL: https://github.com/apache/spark/pull/40607 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this p

[GitHub] [spark] yaooqinn commented on pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on PR #38732: URL: https://github.com/apache/spark/pull/38732#issuecomment-1490050958 Does Kubernetes support other mechanisms to add a timeout during pod/container/app initialization? If not, we shall bring this feature in at the spark layer. Also cc @Yikun -- This i

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
WeichenXu123 commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153066202 ## python/pyspark/ml/torch/distributor.py: ## @@ -581,11 +593,11 @@ def _run_distributed_training( f"Started distributed training with {self.num_proce

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
WeichenXu123 commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153066202 ## python/pyspark/ml/torch/distributor.py: ## @@ -581,11 +593,11 @@ def _run_distributed_training( f"Started distributed training with {self.num_proce

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
WeichenXu123 commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153067103 ## python/pyspark/ml/torch/distributor.py: ## @@ -330,6 +340,7 @@ def __init__( num_processes: int = 1, local_mode: bool = True, use_gpu

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
WeichenXu123 commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153067929 ## python/pyspark/ml/torch/distributor.py: ## @@ -144,15 +145,21 @@ def __init__( num_processes: int = 1, local_mode: bool = True, use_g

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
zhengruifeng commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153069439 ## python/pyspark/ml/torch/distributor.py: ## @@ -330,6 +340,7 @@ def __init__( num_processes: int = 1, local_mode: bool = True, use_gpu

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
WeichenXu123 commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153069493 ## python/pyspark/ml/tests/connect/test_parity_torch_distributor.py: ## @@ -0,0 +1,511 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
zhengruifeng commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153071026 ## python/pyspark/ml/tests/connect/test_parity_torch_distributor.py: ## @@ -0,0 +1,511 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
zhengruifeng commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153072179 ## python/pyspark/ml/torch/distributor.py: ## @@ -581,11 +593,11 @@ def _run_distributed_training( f"Started distributed training with {self.num_proce

[GitHub] [spark] infoankitp commented on a diff in pull request #40563: [SPARK-41232][SPARK-41233][FOLLOWUP] Refactor `array_append` and `array_prepend` with `RuntimeReplaceable`

2023-03-30 Thread via GitHub
infoankitp commented on code in PR #40563: URL: https://github.com/apache/spark/pull/40563#discussion_r1153083574 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala: ## @@ -5056,128 +4950,45 @@ case class ArrayCompact(child: Expre

[GitHub] [spark] infoankitp commented on a diff in pull request #40563: [SPARK-41232][SPARK-41233][FOLLOWUP] Refactor `array_append` and `array_prepend` with `RuntimeReplaceable`

2023-03-30 Thread via GitHub
infoankitp commented on code in PR #40563: URL: https://github.com/apache/spark/pull/40563#discussion_r1153083910 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala: ## @@ -1400,120 +1400,24 @@ case class ArrayContains(left: Expre

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1153114879 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala: ## @@ -980,3 +1022,65 @@ object StreamingDeduplicateExec { private v

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1153114879 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala: ## @@ -980,3 +1022,65 @@ object StreamingDeduplicateExec { private v

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1153116775 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala: ## @@ -980,3 +1022,65 @@ object StreamingDeduplicateExec { private v

[GitHub] [spark] cloud-fan commented on pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
cloud-fan commented on PR #40604: URL: https://github.com/apache/spark/pull/40604#issuecomment-1490156737 thanks for review, merging to master/3.4! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] cloud-fan closed pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
cloud-fan closed pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles" URL: https://github.com/apache/spark/pull/40604 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] VindhyaG commented on a diff in pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on code in PR #40553: URL: https://github.com/apache/spark/pull/40553#discussion_r1153193529 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -535,6 +535,159 @@ class Dataset[T] private[sql] ( } } + /** + *

[GitHub] [spark] VindhyaG commented on a diff in pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on code in PR #40553: URL: https://github.com/apache/spark/pull/40553#discussion_r1153193529 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -535,6 +535,159 @@ class Dataset[T] private[sql] ( } } + /** + *

[GitHub] [spark] VindhyaG commented on pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on PR #40553: URL: https://github.com/apache/spark/pull/40553#issuecomment-1490227613 > Hi @VindhyaG, this might be useful - may be we can benefit from the usecase you have for this. Is it just for logging? Not sure what others think, it might be good to limit the API sur

[GitHub] [spark] martin-kokos closed pull request #39941: [MINOR][DOCS] Add link to Hadoop docs

2023-03-30 Thread via GitHub
martin-kokos closed pull request #39941: [MINOR][DOCS] Add link to Hadoop docs URL: https://github.com/apache/spark/pull/39941 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

[GitHub] [spark] martin-kokos commented on pull request #39941: [MINOR][DOCS] Add link to Hadoop docs

2023-03-30 Thread via GitHub
martin-kokos commented on PR #39941: URL: https://github.com/apache/spark/pull/39941#issuecomment-1490231287 Fixed by https://github.com/apache/spark/commit/c9c3880e3ad6f57a359f1de05b7e772c06660d0b -- This is an automated message from the Apache Git Service. To respond to the message, ple

[GitHub] [spark] HeartSaVioR commented on pull request #40600: [SPARK-42968][SS] Add option to skip commit coordinator as part of StreamingWrite API for DSv2 sources/sinks

2023-03-30 Thread via GitHub
HeartSaVioR commented on PR #40600: URL: https://github.com/apache/spark/pull/40600#issuecomment-1490244143 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HeartSaVioR closed pull request #40600: [SPARK-42968][SS] Add option to skip commit coordinator as part of StreamingWrite API for DSv2 sources/sinks

2023-03-30 Thread via GitHub
HeartSaVioR closed pull request #40600: [SPARK-42968][SS] Add option to skip commit coordinator as part of StreamingWrite API for DSv2 sources/sinks URL: https://github.com/apache/spark/pull/40600 -- This is an automated message from the Apache Git Service. To respond to the message, please l

[GitHub] [spark] jaceklaskowski commented on a diff in pull request #40567: [SPARK-42935] [SQL] Add union required distribution push down

2023-03-30 Thread via GitHub
jaceklaskowski commented on code in PR #40567: URL: https://github.com/apache/spark/pull/40567#discussion_r1153247899 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4195,6 +4195,15 @@ object SQLConf { .booleanConf .createWithDefa

[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

2023-03-30 Thread via GitHub
MaxGekk commented on code in PR #40126: URL: https://github.com/apache/spark/pull/40126#discussion_r1153316438 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala: ## @@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {

[GitHub] [spark] juanvisoler opened a new pull request, #40608: SPARK-35198

2023-03-30 Thread via GitHub
juanvisoler opened a new pull request, #40608: URL: https://github.com/apache/spark/pull/40608 Add support for calling debugCodegen from Python & Java ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does thi

[GitHub] [spark] dongjoon-hyun commented on pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
dongjoon-hyun commented on PR #40604: URL: https://github.com/apache/spark/pull/40604#issuecomment-1490405810 +1 for reverting decision. Thank you, @cloud-fan and all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

[GitHub] [spark] VindhyaG commented on a diff in pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on code in PR #40553: URL: https://github.com/apache/spark/pull/40553#discussion_r1151950076 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -883,6 +883,129 @@ class Dataset[T] private[sql]( println(showString(numRows, truncate, verti

[GitHub] [spark] MaxGekk commented on pull request #40593: [SPARK-42979][SQL] Define literal constructors as keywords

2023-03-30 Thread via GitHub
MaxGekk commented on PR #40593: URL: https://github.com/apache/spark/pull/40593#issuecomment-1490430902 Merging to master. Thank you, @cloud-fan for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] MaxGekk closed pull request #40593: [SPARK-42979][SQL] Define literal constructors as keywords

2023-03-30 Thread via GitHub
MaxGekk closed pull request #40593: [SPARK-42979][SQL] Define literal constructors as keywords URL: https://github.com/apache/spark/pull/40593 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153375489 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java: ## @@ -17,23 +17,53 @@ package org.apache.spark.sql.execution.d

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153376111 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java: ## @@ -17,23 +17,53 @@ package org.apache.spark.sql.execution.d

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153376539 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java: ## @@ -17,23 +17,53 @@ package org.apache.spark.sql.execution.d

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153377375 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java: ## @@ -17,23 +17,53 @@ package org.apache.spark.sql.execution.d

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153375489 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java: ## @@ -17,23 +17,53 @@ package org.apache.spark.sql.execution.d

[GitHub] [spark] ScrapCodes commented on pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
ScrapCodes commented on PR #40553: URL: https://github.com/apache/spark/pull/40553#issuecomment-1490453383 I see this as developer facing API, So just having ``` def getString(numRows: Int, truncate: Int): String = getString(numRows, truncate, vertical = false) ``` would

[GitHub] [spark] ScrapCodes commented on pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
ScrapCodes commented on PR #40553: URL: https://github.com/apache/spark/pull/40553#issuecomment-1490457386 Do you think, a more interesting way can be returning a JSON representation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

[GitHub] [spark] juanvisoler commented on pull request #40608: [SPARK-35198][CORE][PYTHON][SQL] Add support for calling debugCodegen from Python & Java

2023-03-30 Thread via GitHub
juanvisoler commented on PR #40608: URL: https://github.com/apache/spark/pull/40608#issuecomment-1490470454 @holdenk @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

[GitHub] [spark] Hisoka-X opened a new pull request, #40609: [SPARK-42316][SQL] Assign name to _LEGACY_ERROR_TEMP_2044

2023-03-30 Thread via GitHub
Hisoka-X opened a new pull request, #40609: URL: https://github.com/apache/spark/pull/40609 ### What changes were proposed in this pull request? This PR proposes to assign name to _LEGACY_ERROR_TEMP_2044, "BINARY_ARITHMETIC_CAUSE_OVERFLOW". ### Why are the changes n

[GitHub] [spark] VindhyaG commented on a diff in pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on code in PR #40553: URL: https://github.com/apache/spark/pull/40553#discussion_r1153437032 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -535,6 +535,159 @@ class Dataset[T] private[sql] ( } } + /** + *

[GitHub] [spark] ivoson opened a new pull request, #40610: [SPARK-42626][CONNECT] Add Destructive Iterator for SparkResult

2023-03-30 Thread via GitHub
ivoson opened a new pull request, #40610: URL: https://github.com/apache/spark/pull/40610 ### What changes were proposed in this pull request? Add a destructive iterator to SparkResult and change `Dataset.toLocalIterator` to use the desctructive iterator. With the desctructive iterator

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153492439 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java: ## @@ -89,17 +90,28 @@ @Override public void ini

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153492439 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java: ## @@ -89,17 +90,28 @@ @Override public void ini

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153492439 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java: ## @@ -89,17 +90,28 @@ @Override public void ini

[GitHub] [spark] rangadi commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
rangadi commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1153505179 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -679,6 +679,8 @@ object RemoveNoopUnion extends Rule[LogicalPlan] {

[GitHub] [spark] dongjoon-hyun commented on pull request #40587: [SPARK-42957][INFRA][FOLLOWUP] Use 'cyclonedx' instead of file extensions

2023-03-30 Thread via GitHub
dongjoon-hyun commented on PR #40587: URL: https://github.com/apache/spark/pull/40587#issuecomment-1490660361 I verified that Apache Spark 3.4.0 RC5 successfully has SBOM artifacts. - https://repository.apache.org/content/repositories/orgapachespark-1439/org/apache/spark/spark-core_2.12/3

[GitHub] [spark] arturobernalg commented on pull request #40608: [SPARK-35198][CORE][PYTHON][SQL] Add support for calling debugCodegen from Python & Java

2023-03-30 Thread via GitHub
arturobernalg commented on PR #40608: URL: https://github.com/apache/spark/pull/40608#issuecomment-1490659794 LGTM +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

[GitHub] [spark] hvanhovell opened a new pull request, #40611: [SPARK-42981][CONNECT] Add direct arrow serialization

2023-03-30 Thread via GitHub
hvanhovell opened a new pull request, #40611: URL: https://github.com/apache/spark/pull/40611 ### What changes were proposed in this pull request? This PR adds direct serialization from user domain objects to arrow batches. This removes the need to go through catalyst. ### Why are

[GitHub] [spark] hvanhovell commented on a diff in pull request #40611: [SPARK-42981][CONNECT] Add direct arrow serialization

2023-03-30 Thread via GitHub
hvanhovell commented on code in PR #40611: URL: https://github.com/apache/spark/pull/40611#discussion_r1153602634 ## connector/connect/client/jvm/pom.xml: ## @@ -120,6 +120,19 @@ + Review Comment: Needed for a couple of classes used during tests.

[GitHub] [spark] amaliujia commented on a diff in pull request #40611: [SPARK-42981][CONNECT] Add direct arrow serialization

2023-03-30 Thread via GitHub
amaliujia commented on code in PR #40611: URL: https://github.com/apache/spark/pull/40611#discussion_r1153614619 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowSerializer.scala: ## @@ -0,0 +1,529 @@ +/* + * Licensed to the Apache So

[GitHub] [spark] amaliujia commented on a diff in pull request #40581: [SPARK-42953][Connect] Typed filter, map, flatMap, mapPartitions

2023-03-30 Thread via GitHub
amaliujia commented on code in PR #40581: URL: https://github.com/apache/spark/pull/40581#discussion_r1152606377 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -482,27 +482,66 @@ class SparkConnectPlanner(val sess

[GitHub] [spark] MaxGekk commented on a diff in pull request #40609: [SPARK-42316][SQL] Assign name to _LEGACY_ERROR_TEMP_2044

2023-03-30 Thread via GitHub
MaxGekk commented on code in PR #40609: URL: https://github.com/apache/spark/pull/40609#discussion_r1153634383 ## sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala: ## @@ -625,6 +625,20 @@ class QueryExecutionErrorsSuite } } + test("B

[GitHub] [spark] viirya commented on pull request #40587: [SPARK-42957][INFRA][FOLLOWUP] Use 'cyclonedx' instead of file extensions

2023-03-30 Thread via GitHub
viirya commented on PR #40587: URL: https://github.com/apache/spark/pull/40587#issuecomment-1490740514 Cool. Thanks @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

[GitHub] [spark] MaxGekk commented on a diff in pull request #40609: [SPARK-42316][SQL] Assign name to _LEGACY_ERROR_TEMP_2044

2023-03-30 Thread via GitHub
MaxGekk commented on code in PR #40609: URL: https://github.com/apache/spark/pull/40609#discussion_r1153638873 ## sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala: ## @@ -625,6 +625,20 @@ class QueryExecutionErrorsSuite } } + test("B

[GitHub] [spark] amaliujia commented on a diff in pull request #40611: [SPARK-42981][CONNECT] Add direct arrow serialization

2023-03-30 Thread via GitHub
amaliujia commented on code in PR #40611: URL: https://github.com/apache/spark/pull/40611#discussion_r1153615770 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowSerializer.scala: ## @@ -0,0 +1,529 @@ +/* + * Licensed to the Apache So

[GitHub] [spark] rangadi commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
rangadi commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1153675952 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -1742,6 +1742,8 @@ class DataFrameSuite extends QueryTest Seq(Row(2, 1, 2), Row(1, 2,

[GitHub] [spark] VindhyaG commented on pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on PR #40553: URL: https://github.com/apache/spark/pull/40553#issuecomment-1490821588 > I see this as developer facing API, So just having > > ``` > def getString(numRows: Int, truncate: Int): String = > getString(numRows, truncate, vertical = false) >

[GitHub] [spark] VindhyaG commented on pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on PR #40553: URL: https://github.com/apache/spark/pull/40553#issuecomment-1490823634 > Do you think, a more interesting way can be returning a JSON representation? For rest api yes JSON would make more sense but for logging i suppose string in tabular form is more use

[GitHub] [spark] shrprasa commented on pull request #40363: [SPARK_42744] delete uploaded file when job finish for k8s

2023-03-30 Thread via GitHub
shrprasa commented on PR #40363: URL: https://github.com/apache/spark/pull/40363#issuecomment-1490831779 @thousandhu @dongjoon-hyun @holdenk The approach in this PR only handles the cleanup on driver side. It won't clean up the files if files were uploaded during job submission but then

  1   2   3   >