[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40608: [SPARK-35198][CORE][PYTHON][SQL] Add support for calling debugCodegen from Python & Java

2023-03-30 Thread via GitHub
HyukjinKwon commented on code in PR #40608: URL: https://github.com/apache/spark/pull/40608#discussion_r1153890797 ## python/pyspark/sql/dataframe.py: ## @@ -706,6 +706,25 @@ def explain( assert self._sc._jvm is not None

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1153889706 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -1742,6 +1742,8 @@ class DataFrameSuite extends QueryTest Seq(Row(2, 1, 2),

[GitHub] [spark] HyukjinKwon closed pull request #40595: [SPARK-42970][CONNECT][PYTHON][TESTS][3.4] Reuse pyspark.sql.tests.test_arrow test cases

2023-03-30 Thread via GitHub
HyukjinKwon closed pull request #40595: [SPARK-42970][CONNECT][PYTHON][TESTS][3.4] Reuse pyspark.sql.tests.test_arrow test cases URL: https://github.com/apache/spark/pull/40595 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] HyukjinKwon closed pull request #40612: [SPARK-42969][CONNECT][TESTS] Fix the comparison the result with Arrow optimization enabled/disabled

2023-03-30 Thread via GitHub
HyukjinKwon closed pull request #40612: [SPARK-42969][CONNECT][TESTS] Fix the comparison the result with Arrow optimization enabled/disabled URL: https://github.com/apache/spark/pull/40612 -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [spark] HyukjinKwon commented on pull request #40595: [SPARK-42970][CONNECT][PYTHON][TESTS][3.4] Reuse pyspark.sql.tests.test_arrow test cases

2023-03-30 Thread via GitHub
HyukjinKwon commented on PR #40595: URL: https://github.com/apache/spark/pull/40595#issuecomment-1491102080 Merged to branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] HyukjinKwon commented on pull request #40612: [SPARK-42969][CONNECT][TESTS] Fix the comparison the result with Arrow optimization enabled/disabled

2023-03-30 Thread via GitHub
HyukjinKwon commented on PR #40612: URL: https://github.com/apache/spark/pull/40612#issuecomment-1491101671 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] gengliangwang closed pull request #40592: [SPARK-42967][CORE][3.2][3.3][3.4] Fix SparkListenerTaskStart.stageAttemptId when a task is started after the stage is cancelled

2023-03-30 Thread via GitHub
gengliangwang closed pull request #40592: [SPARK-42967][CORE][3.2][3.3][3.4] Fix SparkListenerTaskStart.stageAttemptId when a task is started after the stage is cancelled URL: https://github.com/apache/spark/pull/40592 -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] gengliangwang commented on pull request #40592: [SPARK-42967][CORE][3.2][3.3][3.4] Fix SparkListenerTaskStart.stageAttemptId when a task is started after the stage is cancelled

2023-03-30 Thread via GitHub
gengliangwang commented on PR #40592: URL: https://github.com/apache/spark/pull/40592#issuecomment-1491064274 Merging to master/3.4/3.3/3.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] rangadi commented on a diff in pull request #40586: [SPARK-42939][SS][CONNECT] Core streaming Python API for Spark Connect

2023-03-30 Thread via GitHub
rangadi commented on code in PR #40586: URL: https://github.com/apache/spark/pull/40586#discussion_r1153813344 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -1969,6 +2014,136 @@ class SparkConnectPlanner(val

[GitHub] [spark] rangadi commented on a diff in pull request #40586: [SPARK-42939][SS][CONNECT] Core streaming Python API for Spark Connect

2023-03-30 Thread via GitHub
rangadi commented on code in PR #40586: URL: https://github.com/apache/spark/pull/40586#discussion_r1153629770 ## connector/connect/common/src/main/protobuf/spark/connect/commands.proto: ## @@ -177,3 +179,97 @@ message WriteOperationV2 { // (Optional) A condition for

[GitHub] [spark] ueshin opened a new pull request, #40612: [SPARK-42969][CONNECT][TESTS] Fix the comparison the result with Arrow optimization enabled/disabled

2023-03-30 Thread via GitHub
ueshin opened a new pull request, #40612: URL: https://github.com/apache/spark/pull/40612 ### What changes were proposed in this pull request? Fixes the comparison the result with Arrow optimization enabled/disabled. ### Why are the changes needed? in `test_arrow`, there

[GitHub] [spark] srowen commented on a diff in pull request #36529: [SPARK-39102][CORE][SQL][DSTREAM] Add checkstyle rules to disabled use of Guava's `Files.createTempDir()`

2023-03-30 Thread via GitHub
srowen commented on code in PR #36529: URL: https://github.com/apache/spark/pull/36529#discussion_r1153736049 ## common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java: ## @@ -362,6 +364,60 @@ public static byte[] bufferToArray(ByteBuffer buffer) {

[GitHub] [spark] shrprasa commented on pull request #40363: [SPARK_42744] delete uploaded file when job finish for k8s

2023-03-30 Thread via GitHub
shrprasa commented on PR #40363: URL: https://github.com/apache/spark/pull/40363#issuecomment-1490831779 @thousandhu @dongjoon-hyun @holdenk The approach in this PR only handles the cleanup on driver side. It won't clean up the files if files were uploaded during job submission but

[GitHub] [spark] VindhyaG commented on pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on PR #40553: URL: https://github.com/apache/spark/pull/40553#issuecomment-1490823634 > Do you think, a more interesting way can be returning a JSON representation? For rest api yes JSON would make more sense but for logging i suppose string in tabular form is more

[GitHub] [spark] VindhyaG commented on pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on PR #40553: URL: https://github.com/apache/spark/pull/40553#issuecomment-1490821588 > I see this as developer facing API, So just having > > ``` > def getString(numRows: Int, truncate: Int): String = > getString(numRows, truncate, vertical = false) >

[GitHub] [spark] rangadi commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
rangadi commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1153675952 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -1742,6 +1742,8 @@ class DataFrameSuite extends QueryTest Seq(Row(2, 1, 2), Row(1,

[GitHub] [spark] amaliujia commented on a diff in pull request #40611: [SPARK-42981][CONNECT] Add direct arrow serialization

2023-03-30 Thread via GitHub
amaliujia commented on code in PR #40611: URL: https://github.com/apache/spark/pull/40611#discussion_r1153615770 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowSerializer.scala: ## @@ -0,0 +1,529 @@ +/* + * Licensed to the Apache

[GitHub] [spark] MaxGekk commented on a diff in pull request #40609: [SPARK-42316][SQL] Assign name to _LEGACY_ERROR_TEMP_2044

2023-03-30 Thread via GitHub
MaxGekk commented on code in PR #40609: URL: https://github.com/apache/spark/pull/40609#discussion_r1153638873 ## sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala: ## @@ -625,6 +625,20 @@ class QueryExecutionErrorsSuite } } +

[GitHub] [spark] viirya commented on pull request #40587: [SPARK-42957][INFRA][FOLLOWUP] Use 'cyclonedx' instead of file extensions

2023-03-30 Thread via GitHub
viirya commented on PR #40587: URL: https://github.com/apache/spark/pull/40587#issuecomment-1490740514 Cool. Thanks @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] MaxGekk commented on a diff in pull request #40609: [SPARK-42316][SQL] Assign name to _LEGACY_ERROR_TEMP_2044

2023-03-30 Thread via GitHub
MaxGekk commented on code in PR #40609: URL: https://github.com/apache/spark/pull/40609#discussion_r1153634383 ## sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala: ## @@ -625,6 +625,20 @@ class QueryExecutionErrorsSuite } } +

[GitHub] [spark] amaliujia commented on a diff in pull request #40581: [SPARK-42953][Connect] Typed filter, map, flatMap, mapPartitions

2023-03-30 Thread via GitHub
amaliujia commented on code in PR #40581: URL: https://github.com/apache/spark/pull/40581#discussion_r1152606377 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -482,27 +482,66 @@ class SparkConnectPlanner(val

[GitHub] [spark] amaliujia commented on a diff in pull request #40611: [SPARK-42981][CONNECT] Add direct arrow serialization

2023-03-30 Thread via GitHub
amaliujia commented on code in PR #40611: URL: https://github.com/apache/spark/pull/40611#discussion_r1153614619 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowSerializer.scala: ## @@ -0,0 +1,529 @@ +/* + * Licensed to the Apache

[GitHub] [spark] hvanhovell commented on a diff in pull request #40611: [SPARK-42981][CONNECT] Add direct arrow serialization

2023-03-30 Thread via GitHub
hvanhovell commented on code in PR #40611: URL: https://github.com/apache/spark/pull/40611#discussion_r1153602634 ## connector/connect/client/jvm/pom.xml: ## @@ -120,6 +120,19 @@ + Review Comment: Needed for a couple of classes used during tests.

[GitHub] [spark] hvanhovell opened a new pull request, #40611: [SPARK-42981][CONNECT] Add direct arrow serialization

2023-03-30 Thread via GitHub
hvanhovell opened a new pull request, #40611: URL: https://github.com/apache/spark/pull/40611 ### What changes were proposed in this pull request? This PR adds direct serialization from user domain objects to arrow batches. This removes the need to go through catalyst. ### Why are

[GitHub] [spark] arturobernalg commented on pull request #40608: [SPARK-35198][CORE][PYTHON][SQL] Add support for calling debugCodegen from Python & Java

2023-03-30 Thread via GitHub
arturobernalg commented on PR #40608: URL: https://github.com/apache/spark/pull/40608#issuecomment-1490659794 LGTM +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] dongjoon-hyun commented on pull request #40587: [SPARK-42957][INFRA][FOLLOWUP] Use 'cyclonedx' instead of file extensions

2023-03-30 Thread via GitHub
dongjoon-hyun commented on PR #40587: URL: https://github.com/apache/spark/pull/40587#issuecomment-1490660361 I verified that Apache Spark 3.4.0 RC5 successfully has SBOM artifacts. -

[GitHub] [spark] rangadi commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
rangadi commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1153505179 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -679,6 +679,8 @@ object RemoveNoopUnion extends Rule[LogicalPlan] {

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153492439 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java: ## @@ -89,17 +90,28 @@ @Override public void

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153492439 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java: ## @@ -89,17 +90,28 @@ @Override public void

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153492439 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java: ## @@ -89,17 +90,28 @@ @Override public void

[GitHub] [spark] ivoson opened a new pull request, #40610: [SPARK-42626][CONNECT] Add Destructive Iterator for SparkResult

2023-03-30 Thread via GitHub
ivoson opened a new pull request, #40610: URL: https://github.com/apache/spark/pull/40610 ### What changes were proposed in this pull request? Add a destructive iterator to SparkResult and change `Dataset.toLocalIterator` to use the desctructive iterator. With the desctructive

[GitHub] [spark] VindhyaG commented on a diff in pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on code in PR #40553: URL: https://github.com/apache/spark/pull/40553#discussion_r1153437032 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -535,6 +535,159 @@ class Dataset[T] private[sql] ( } } + /** + *

[GitHub] [spark] Hisoka-X opened a new pull request, #40609: [SPARK-42316][SQL] Assign name to _LEGACY_ERROR_TEMP_2044

2023-03-30 Thread via GitHub
Hisoka-X opened a new pull request, #40609: URL: https://github.com/apache/spark/pull/40609 ### What changes were proposed in this pull request? This PR proposes to assign name to _LEGACY_ERROR_TEMP_2044, "BINARY_ARITHMETIC_CAUSE_OVERFLOW". ### Why are the changes

[GitHub] [spark] juanvisoler commented on pull request #40608: [SPARK-35198][CORE][PYTHON][SQL] Add support for calling debugCodegen from Python & Java

2023-03-30 Thread via GitHub
juanvisoler commented on PR #40608: URL: https://github.com/apache/spark/pull/40608#issuecomment-1490470454 @holdenk @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] ScrapCodes commented on pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
ScrapCodes commented on PR #40553: URL: https://github.com/apache/spark/pull/40553#issuecomment-1490457386 Do you think, a more interesting way can be returning a JSON representation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] ScrapCodes commented on pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
ScrapCodes commented on PR #40553: URL: https://github.com/apache/spark/pull/40553#issuecomment-1490453383 I see this as developer facing API, So just having ``` def getString(numRows: Int, truncate: Int): String = getString(numRows, truncate, vertical = false) ``` would

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153375489 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java: ## @@ -17,23 +17,53 @@ package

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153377375 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java: ## @@ -17,23 +17,53 @@ package

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153376539 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java: ## @@ -17,23 +17,53 @@ package

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153376111 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java: ## @@ -17,23 +17,53 @@ package

[GitHub] [spark] yabola commented on a diff in pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-03-30 Thread via GitHub
yabola commented on code in PR #39950: URL: https://github.com/apache/spark/pull/39950#discussion_r1153375489 ## sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java: ## @@ -17,23 +17,53 @@ package

[GitHub] [spark] MaxGekk closed pull request #40593: [SPARK-42979][SQL] Define literal constructors as keywords

2023-03-30 Thread via GitHub
MaxGekk closed pull request #40593: [SPARK-42979][SQL] Define literal constructors as keywords URL: https://github.com/apache/spark/pull/40593 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] MaxGekk commented on pull request #40593: [SPARK-42979][SQL] Define literal constructors as keywords

2023-03-30 Thread via GitHub
MaxGekk commented on PR #40593: URL: https://github.com/apache/spark/pull/40593#issuecomment-1490430902 Merging to master. Thank you, @cloud-fan for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] VindhyaG commented on a diff in pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on code in PR #40553: URL: https://github.com/apache/spark/pull/40553#discussion_r1151950076 ## sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -883,6 +883,129 @@ class Dataset[T] private[sql]( println(showString(numRows, truncate,

[GitHub] [spark] dongjoon-hyun commented on pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
dongjoon-hyun commented on PR #40604: URL: https://github.com/apache/spark/pull/40604#issuecomment-1490405810 +1 for reverting decision. Thank you, @cloud-fan and all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [spark] juanvisoler opened a new pull request, #40608: SPARK-35198

2023-03-30 Thread via GitHub
juanvisoler opened a new pull request, #40608: URL: https://github.com/apache/spark/pull/40608 Add support for calling debugCodegen from Python & Java ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does

[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

2023-03-30 Thread via GitHub
MaxGekk commented on code in PR #40126: URL: https://github.com/apache/spark/pull/40126#discussion_r1153316438 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala: ## @@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {

[GitHub] [spark] jaceklaskowski commented on a diff in pull request #40567: [SPARK-42935] [SQL] Add union required distribution push down

2023-03-30 Thread via GitHub
jaceklaskowski commented on code in PR #40567: URL: https://github.com/apache/spark/pull/40567#discussion_r1153247899 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4195,6 +4195,15 @@ object SQLConf { .booleanConf

[GitHub] [spark] HeartSaVioR closed pull request #40600: [SPARK-42968][SS] Add option to skip commit coordinator as part of StreamingWrite API for DSv2 sources/sinks

2023-03-30 Thread via GitHub
HeartSaVioR closed pull request #40600: [SPARK-42968][SS] Add option to skip commit coordinator as part of StreamingWrite API for DSv2 sources/sinks URL: https://github.com/apache/spark/pull/40600 -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] HeartSaVioR commented on pull request #40600: [SPARK-42968][SS] Add option to skip commit coordinator as part of StreamingWrite API for DSv2 sources/sinks

2023-03-30 Thread via GitHub
HeartSaVioR commented on PR #40600: URL: https://github.com/apache/spark/pull/40600#issuecomment-1490244143 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] martin-kokos closed pull request #39941: [MINOR][DOCS] Add link to Hadoop docs

2023-03-30 Thread via GitHub
martin-kokos closed pull request #39941: [MINOR][DOCS] Add link to Hadoop docs URL: https://github.com/apache/spark/pull/39941 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] martin-kokos commented on pull request #39941: [MINOR][DOCS] Add link to Hadoop docs

2023-03-30 Thread via GitHub
martin-kokos commented on PR #39941: URL: https://github.com/apache/spark/pull/39941#issuecomment-1490231287 Fixed by https://github.com/apache/spark/commit/c9c3880e3ad6f57a359f1de05b7e772c06660d0b -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] VindhyaG commented on pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on PR #40553: URL: https://github.com/apache/spark/pull/40553#issuecomment-1490227613 > Hi @VindhyaG, this might be useful - may be we can benefit from the usecase you have for this. Is it just for logging? Not sure what others think, it might be good to limit the API

[GitHub] [spark] VindhyaG commented on a diff in pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on code in PR #40553: URL: https://github.com/apache/spark/pull/40553#discussion_r1153193529 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -535,6 +535,159 @@ class Dataset[T] private[sql] ( } } + /** + *

[GitHub] [spark] VindhyaG commented on a diff in pull request #40553: [SPARK-39722] [SQL] getString API for Dataset

2023-03-30 Thread via GitHub
VindhyaG commented on code in PR #40553: URL: https://github.com/apache/spark/pull/40553#discussion_r1153193529 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -535,6 +535,159 @@ class Dataset[T] private[sql] ( } } + /** + *

[GitHub] [spark] cloud-fan closed pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
cloud-fan closed pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles" URL: https://github.com/apache/spark/pull/40604 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] cloud-fan commented on pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
cloud-fan commented on PR #40604: URL: https://github.com/apache/spark/pull/40604#issuecomment-1490156737 thanks for review, merging to master/3.4! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1153116775 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala: ## @@ -980,3 +1022,65 @@ object StreamingDeduplicateExec { private

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1153114879 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala: ## @@ -980,3 +1022,65 @@ object StreamingDeduplicateExec { private

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-03-30 Thread via GitHub
HeartSaVioR commented on code in PR #40561: URL: https://github.com/apache/spark/pull/40561#discussion_r1153114879 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala: ## @@ -980,3 +1022,65 @@ object StreamingDeduplicateExec { private

[GitHub] [spark] infoankitp commented on a diff in pull request #40563: [SPARK-41232][SPARK-41233][FOLLOWUP] Refactor `array_append` and `array_prepend` with `RuntimeReplaceable`

2023-03-30 Thread via GitHub
infoankitp commented on code in PR #40563: URL: https://github.com/apache/spark/pull/40563#discussion_r1153083910 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala: ## @@ -1400,120 +1400,24 @@ case class ArrayContains(left:

[GitHub] [spark] infoankitp commented on a diff in pull request #40563: [SPARK-41232][SPARK-41233][FOLLOWUP] Refactor `array_append` and `array_prepend` with `RuntimeReplaceable`

2023-03-30 Thread via GitHub
infoankitp commented on code in PR #40563: URL: https://github.com/apache/spark/pull/40563#discussion_r1153083574 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala: ## @@ -5056,128 +4950,45 @@ case class ArrayCompact(child:

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
zhengruifeng commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153072179 ## python/pyspark/ml/torch/distributor.py: ## @@ -581,11 +593,11 @@ def _run_distributed_training( f"Started distributed training with

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
zhengruifeng commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153071026 ## python/pyspark/ml/tests/connect/test_parity_torch_distributor.py: ## @@ -0,0 +1,511 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
WeichenXu123 commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153069493 ## python/pyspark/ml/tests/connect/test_parity_torch_distributor.py: ## @@ -0,0 +1,511 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
zhengruifeng commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153069439 ## python/pyspark/ml/torch/distributor.py: ## @@ -330,6 +340,7 @@ def __init__( num_processes: int = 1, local_mode: bool = True,

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
WeichenXu123 commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153067929 ## python/pyspark/ml/torch/distributor.py: ## @@ -144,15 +145,21 @@ def __init__( num_processes: int = 1, local_mode: bool = True,

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
WeichenXu123 commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153067103 ## python/pyspark/ml/torch/distributor.py: ## @@ -330,6 +340,7 @@ def __init__( num_processes: int = 1, local_mode: bool = True,

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
WeichenXu123 commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153066202 ## python/pyspark/ml/torch/distributor.py: ## @@ -581,11 +593,11 @@ def _run_distributed_training( f"Started distributed training with

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
WeichenXu123 commented on code in PR #40607: URL: https://github.com/apache/spark/pull/40607#discussion_r1153066202 ## python/pyspark/ml/torch/distributor.py: ## @@ -581,11 +593,11 @@ def _run_distributed_training( f"Started distributed training with

[GitHub] [spark] yaooqinn commented on pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on PR #38732: URL: https://github.com/apache/spark/pull/38732#issuecomment-1490050958 Does Kubernetes support other mechanisms to add a timeout during pod/container/app initialization? If not, we shall bring this feature in at the spark layer. Also cc @Yikun -- This

[GitHub] [spark] zhengruifeng opened a new pull request, #40607: [WIP][ML] Make Torch Distributor support Spark Connect

2023-03-30 Thread via GitHub
zhengruifeng opened a new pull request, #40607: URL: https://github.com/apache/spark/pull/40607 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1153042664 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala: ## @@ -750,6 +750,26 @@ private[spark] object Config extends Logging {

[GitHub] [spark] huangxiaopingRD commented on a diff in pull request #40232: [SPARK-42629][DOCS] Update the description of default data source in the document

2023-03-30 Thread via GitHub
huangxiaopingRD commented on code in PR #40232: URL: https://github.com/apache/spark/pull/40232#discussion_r1153014538 ## docs/sql-ref-syntax-ddl-create-table-datasource.md: ## @@ -118,7 +118,7 @@ CREATE TABLE student (id INT, name STRING, age INT) USING CSV; CREATE TABLE

[GitHub] [spark] huangxiaopingRD commented on a diff in pull request #40232: [SPARK-42629][DOCS] Update the description of default data source in the document

2023-03-30 Thread via GitHub
huangxiaopingRD commented on code in PR #40232: URL: https://github.com/apache/spark/pull/40232#discussion_r1153014538 ## docs/sql-ref-syntax-ddl-create-table-datasource.md: ## @@ -118,7 +118,7 @@ CREATE TABLE student (id INT, name STRING, age INT) USING CSV; CREATE TABLE

[GitHub] [spark] LuciferYang commented on a diff in pull request #40605: [SPARK-42958][CONNECT] Refactor `connect-jvm-client-mima-check` to support mima check with avro module

2023-03-30 Thread via GitHub
LuciferYang commented on code in PR #40605: URL: https://github.com/apache/spark/pull/40605#discussion_r1153013745 ## dev/connect-jvm-client-mima-check: ## @@ -34,20 +34,18 @@ fi rm -f .connect-mima-check-result -echo "Build sql module, connect-client-jvm module and

[GitHub] [spark] grundprinzip opened a new pull request, #40606: Debugging is awesome

2023-03-30 Thread via GitHub
grundprinzip opened a new pull request, #40606: URL: https://github.com/apache/spark/pull/40606 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ###

[GitHub] [spark] lyy-pineapple commented on pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2023-03-30 Thread via GitHub
lyy-pineapple commented on PR #38171: URL: https://github.com/apache/spark/pull/38171#issuecomment-1489987177 > https://user-images.githubusercontent.com/8748814/204439049-53f0bd4f-9ea0-4289-8268-d16aef5b4334.png;> > > @lyy-pineapple Would you share the test sql pattern? I test some

[GitHub] [spark] lyy-pineapple commented on pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2023-03-30 Thread via GitHub
lyy-pineapple commented on PR #38171: URL: https://github.com/apache/spark/pull/38171#issuecomment-1489985307 > `joni` seems to be used in Hbase client only instead of Hbase server or Hbase common. > > * https://mvnrepository.com/artifact/org.apache.hbase/hbase-client/2.5.3 >

[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
pan3793 commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1038973719 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -136,6 +151,10 @@ class

[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
pan3793 commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152967630 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala: ## @@ -750,6 +750,26 @@ private[spark] object Config extends Logging {

[GitHub] [spark] cloud-fan commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-03-30 Thread via GitHub
cloud-fan commented on PR #40437: URL: https://github.com/apache/spark/pull/40437#issuecomment-1489962469 > I'm not sure why spark-sql CLI has to be compatible with hive output, personally, I don't think it's necessary. Maybe we should display spark's schema as is, just like thriftSever?

[GitHub] [spark] cloud-fan closed pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-30 Thread via GitHub
cloud-fan closed pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns URL: https://github.com/apache/spark/pull/40300 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
pan3793 commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152964045 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -148,6 +163,10 @@ class

[GitHub] [spark] cloud-fan commented on pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-30 Thread via GitHub
cloud-fan commented on PR #40300: URL: https://github.com/apache/spark/pull/40300#issuecomment-1489959075 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] LuciferYang opened a new pull request, #40605: [SPARK-42958][CONNECT] Refactor `CheckConnectJvmClientCompatibility` to compare client and avro module

2023-03-30 Thread via GitHub
LuciferYang opened a new pull request, #40605: URL: https://github.com/apache/spark/pull/40605 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ###

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152961738 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -148,6 +163,10 @@ class

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152960657 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -117,6 +120,12 @@ class

[GitHub] [spark] pan3793 opened a new pull request, #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
pan3793 opened a new pull request, #38732: URL: https://github.com/apache/spark/pull/38732 ### What changes were proposed in this pull request? Fail Spark Application when number of executor failures reach threshold. ### Why are the changes needed? Sometimes,

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152957287 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala: ## @@ -750,6 +750,26 @@ private[spark] object Config extends Logging {

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152954378 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -520,10 +552,46 @@ class

[GitHub] [spark] yaooqinn commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism

2023-03-30 Thread via GitHub
yaooqinn commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1152951801 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -494,10 +525,46 @@ class

[GitHub] [spark] LuciferYang commented on a diff in pull request #40598: [SPARK-42974][CORE] Restore `Utils#createTempDir` use `ShutdownHookManager#registerShutdownDeleteDir` to cleanup tempDir

2023-03-30 Thread via GitHub
LuciferYang commented on code in PR #40598: URL: https://github.com/apache/spark/pull/40598#discussion_r1152946827 ## common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java: ## @@ -373,18 +373,22 @@ public static byte[] bufferToArray(ByteBuffer buffer)

[GitHub] [spark] Yikf commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-03-30 Thread via GitHub
Yikf commented on PR #40437: URL: https://github.com/apache/spark/pull/40437#issuecomment-1489933393 Yes. `hiveResultString` is added to ensure compatibility with hive output. `hiveResultString` is only used by the spark-sql CLI. It is used only as the CLI display.

[GitHub] [spark] cloud-fan commented on pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
cloud-fan commented on PR #40604: URL: https://github.com/apache/spark/pull/40604#issuecomment-1489923733 also cc @xinrong-meng , this is not a blocker but it's better if we can make it into 3.4.0. -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] cloud-fan commented on pull request #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
cloud-fan commented on PR #40604: URL: https://github.com/apache/spark/pull/40604#issuecomment-1489923055 cc @ulysses-you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] cloud-fan opened a new pull request, #40604: Revert "[SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles"

2023-03-30 Thread via GitHub
cloud-fan opened a new pull request, #40604: URL: https://github.com/apache/spark/pull/40604 This reverts commit a111a02de1a814c5f335e0bcac4cffb0515557dc. ### What changes were proposed in this pull request? SQLMetrics is not only used in the UI, but is also a

[GitHub] [spark] cloud-fan commented on a diff in pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-03-30 Thread via GitHub
cloud-fan commented on code in PR #40437: URL: https://github.com/apache/spark/pull/40437#discussion_r1152923808 ## sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala: ## @@ -59,18 +59,6 @@ object HiveResult {

[GitHub] [spark] yaooqinn commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-03-30 Thread via GitHub
yaooqinn commented on PR #40437: URL: https://github.com/apache/spark/pull/40437#issuecomment-1489909241 > If we are sure this is only for CLI display, Yes. hiveResultString is only used in spark-sql CLI. The thrift server-side always uses command output schema. Maybe this is the

[GitHub] [spark] cloud-fan commented on a diff in pull request #40593: [WIP][SQL] Define typed literal constructors as keywords

2023-03-30 Thread via GitHub
cloud-fan commented on code in PR #40593: URL: https://github.com/apache/spark/pull/40593#discussion_r1152910161 ## sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4: ## @@ -928,11 +928,19 @@ primaryExpression (FILTER LEFT_PAREN WHERE

<    1   2   3   >