[GitHub] [spark] pan3793 opened a new pull request, #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub
pan3793 opened a new pull request, #40920: URL: https://github.com/apache/spark/pull/40920 ### What changes were proposed in this pull request? Remove unnecessary serialize/deserialize of `Path` on parallel gather partition stats. ### Why are the changes needed?

[GitHub] [spark] LuciferYang commented on a diff in pull request #40898: [SPARK-43230][CONNECT] Simplify `DataFrameNaFunctions.fillna`

2023-04-24 Thread via GitHub
LuciferYang commented on code in PR #40898: URL: https://github.com/apache/spark/pull/40898#discussion_r1174911075 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala: ## @@ -145,6 +145,7 @@ object CheckConn

[GitHub] [spark] LuciferYang commented on a diff in pull request #40898: [SPARK-43230][CONNECT] Simplify `DataFrameNaFunctions.fillna`

2023-04-24 Thread via GitHub
LuciferYang commented on code in PR #40898: URL: https://github.com/apache/spark/pull/40898#discussion_r1174911075 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala: ## @@ -145,6 +145,7 @@ object CheckConn

[GitHub] [spark] vicennial commented on a diff in pull request #40675: [SPARK-42657][CONNECT] Support to find and transfer client-side REPL classfiles to server as artifacts

2023-04-24 Thread via GitHub
vicennial commented on code in PR #40675: URL: https://github.com/apache/spark/pull/40675#discussion_r1174917132 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala: ## @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software Found

[GitHub] [spark] vicennial commented on a diff in pull request #40675: [SPARK-42657][CONNECT] Support to find and transfer client-side REPL classfiles to server as artifacts

2023-04-24 Thread via GitHub
vicennial commented on code in PR #40675: URL: https://github.com/apache/spark/pull/40675#discussion_r1174917132 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala: ## @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software Found

[GitHub] [spark] LuciferYang commented on a diff in pull request #40898: [SPARK-43230][CONNECT] Simplify `DataFrameNaFunctions.fillna`

2023-04-24 Thread via GitHub
LuciferYang commented on code in PR #40898: URL: https://github.com/apache/spark/pull/40898#discussion_r1174911075 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala: ## @@ -145,6 +145,7 @@ object CheckConn

[GitHub] [spark] CavemanIV opened a new pull request, #40921: [SPARK-43242] fix throw 'Unexpected type of BlockId' in diagnose when…

2023-04-24 Thread via GitHub
CavemanIV opened a new pull request, #40921: URL: https://github.com/apache/spark/pull/40921 ### What changes were proposed in this pull request? A minor bugfix in `ShuffleBlockFetcherIterator.diagnose`, which not handle type ShuffleBlockBatchId properly ### Why are the changes

[GitHub] [spark] zhengruifeng closed pull request #40862: [SPARK-43169][INFRA][FOLLOWUP] Add more memory for mima check

2023-04-24 Thread via GitHub
zhengruifeng closed pull request #40862: [SPARK-43169][INFRA][FOLLOWUP] Add more memory for mima check URL: https://github.com/apache/spark/pull/40862 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] zhengruifeng commented on pull request #40862: [SPARK-43169][INFRA][FOLLOWUP] Add more memory for mima check

2023-04-24 Thread via GitHub
zhengruifeng commented on PR #40862: URL: https://github.com/apache/spark/pull/40862#issuecomment-1519595951 mima test OOM again https://github.com/apache/spark/actions/runs/4783257117/jobs/8503361722 -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] zhengruifeng commented on pull request #40862: [SPARK-43169][INFRA][FOLLOWUP] Add more memory for mima check

2023-04-24 Thread via GitHub
zhengruifeng commented on PR #40862: URL: https://github.com/apache/spark/pull/40862#issuecomment-1519597372 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] LuciferYang commented on pull request #40862: [SPARK-43169][INFRA][FOLLOWUP] Add more memory for mima check

2023-04-24 Thread via GitHub
LuciferYang commented on PR #40862: URL: https://github.com/apache/spark/pull/40862#issuecomment-1519601176 Let's merge this one to avoid oom :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

[GitHub] [spark] LuciferYang commented on pull request #40862: [SPARK-43169][INFRA][FOLLOWUP] Add more memory for mima check

2023-04-24 Thread via GitHub
LuciferYang commented on PR #40862: URL: https://github.com/apache/spark/pull/40862#issuecomment-1519602330 Thanks @zhengruifeng @HyukjinKwon @pan3793 @Hisoka-X -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] kori73 commented on pull request #40810: [SPARK-42317][SQL] Assign name to _LEGACY_ERROR_TEMP_2247: CANNOT_MERGE_SCHEMAS

2023-04-24 Thread via GitHub
kori73 commented on PR #40810: URL: https://github.com/apache/spark/pull/40810#issuecomment-1519612722 > @kori73 Could you update the example (output) according to the recent commit, please. updated the example according to the recent commit -- This is an automated message from the

[GitHub] [spark] cloud-fan opened a new pull request, #40922: [SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show

2023-04-24 Thread via GitHub
cloud-fan opened a new pull request, #40922: URL: https://github.com/apache/spark/pull/40922 ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/40699 to avoid changing the Cast behavior. It pulls out the cast-to-stri

[GitHub] [spark] cloud-fan commented on pull request #40922: [SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show

2023-04-24 Thread via GitHub
cloud-fan commented on PR #40922: URL: https://github.com/apache/spark/pull/40922#issuecomment-1519616957 cc @Yikf @sadikovi @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[GitHub] [spark] MaxGekk commented on pull request #40810: [SPARK-42317][SQL] Assign name to _LEGACY_ERROR_TEMP_2247: CANNOT_MERGE_SCHEMAS

2023-04-24 Thread via GitHub
MaxGekk commented on PR #40810: URL: https://github.com/apache/spark/pull/40810#issuecomment-1519620141 +1, LGTM. Merging to master. Thank you, @kori73. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] MaxGekk closed pull request #40810: [SPARK-42317][SQL] Assign name to _LEGACY_ERROR_TEMP_2247: CANNOT_MERGE_SCHEMAS

2023-04-24 Thread via GitHub
MaxGekk closed pull request #40810: [SPARK-42317][SQL] Assign name to _LEGACY_ERROR_TEMP_2247: CANNOT_MERGE_SCHEMAS URL: https://github.com/apache/spark/pull/40810 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

[GitHub] [spark] MaxGekk commented on pull request #40810: [SPARK-42317][SQL] Assign name to _LEGACY_ERROR_TEMP_2247: CANNOT_MERGE_SCHEMAS

2023-04-24 Thread via GitHub
MaxGekk commented on PR #40810: URL: https://github.com/apache/spark/pull/40810#issuecomment-1519623179 @kori73 Congratulations with your first contribution to Apache Spark! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] zhengruifeng closed pull request #40899: [SPARK-43249][CONNECT] Fix missing stats for SQL Command

2023-04-24 Thread via GitHub
zhengruifeng closed pull request #40899: [SPARK-43249][CONNECT] Fix missing stats for SQL Command URL: https://github.com/apache/spark/pull/40899 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] zhengruifeng commented on pull request #40899: [SPARK-43249][CONNECT] Fix missing stats for SQL Command

2023-04-24 Thread via GitHub
zhengruifeng commented on PR #40899: URL: https://github.com/apache/spark/pull/40899#issuecomment-1519628355 merged to master and branch-3.4, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

[GitHub] [spark] lyy-pineapple commented on a diff in pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2023-04-24 Thread via GitHub
lyy-pineapple commented on code in PR #38171: URL: https://github.com/apache/spark/pull/38171#discussion_r1174987899 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressionsJoni.scala: ## @@ -0,0 +1,481 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] lyy-pineapple commented on a diff in pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2023-04-24 Thread via GitHub
lyy-pineapple commented on code in PR #38171: URL: https://github.com/apache/spark/pull/38171#discussion_r1174997469 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressionsJoni.scala: ## @@ -0,0 +1,481 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] bjornjorgensen commented on pull request #40658: [WIP][SPARK-43024][PS] Upgrade pandas to 2.0.0

2023-04-24 Thread via GitHub
bjornjorgensen commented on PR #40658: URL: https://github.com/apache/spark/pull/40658#issuecomment-1519710374 https://pandas.pydata.org/pandas-docs/version/2.0.1/whatsnew/v2.0.1.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] lyy-pineapple commented on a diff in pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

2023-04-24 Thread via GitHub
lyy-pineapple commented on code in PR #38171: URL: https://github.com/apache/spark/pull/38171#discussion_r1175016808 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressionsJoni.scala: ## @@ -0,0 +1,481 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] bogao007 opened a new pull request, #40923: [Draft] State API (FlatMapGroupsWithState) in Scala for Spark Connect

2023-04-24 Thread via GitHub
bogao007 opened a new pull request, #40923: URL: https://github.com/apache/spark/pull/40923 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How wa

[GitHub] [spark] cloud-fan commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
cloud-fan commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175031042 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technically, a file f

[GitHub] [spark] cloud-fan commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
cloud-fan commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175034679 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technically, a file f

[GitHub] [spark] bogao007 commented on a diff in pull request #40923: [Draft] State API (FlatMapGroupsWithState) in Scala for Spark Connect

2023-04-24 Thread via GitHub
bogao007 commented on code in PR #40923: URL: https://github.com/apache/spark/pull/40923#discussion_r1175032518 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -1275,6 +1276,24 @@ class Dataset[T] private[sql] ( proto.Aggregate.Gro

[GitHub] [spark] bogao007 commented on a diff in pull request #40923: [Draft] State API (FlatMapGroupsWithState) in Scala for Spark Connect

2023-04-24 Thread via GitHub
bogao007 commented on code in PR #40923: URL: https://github.com/apache/spark/pull/40923#discussion_r1175032518 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -1275,6 +1276,24 @@ class Dataset[T] private[sql] ( proto.Aggregate.Gro

[GitHub] [spark] bogao007 commented on a diff in pull request #40923: [Draft] State API (FlatMapGroupsWithState) in Scala for Spark Connect

2023-04-24 Thread via GitHub
bogao007 commented on code in PR #40923: URL: https://github.com/apache/spark/pull/40923#discussion_r1175036078 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -616,6 +618,38 @@ class SparkConnectPlanner(val sessio

[GitHub] [spark] rshkv commented on a diff in pull request #40902: [SPARK-43142] Fix DSL expressions on attributes with special characters

2023-04-24 Thread via GitHub
rshkv commented on code in PR #40902: URL: https://github.com/apache/spark/pull/40902#discussion_r1175069446 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala: ## @@ -272,7 +272,7 @@ package object dsl { def attr: UnresolvedAttribute = analysi

[GitHub] [spark] rshkv commented on a diff in pull request #40902: [SPARK-43142] Fix DSL expressions on attributes with special characters

2023-04-24 Thread via GitHub
rshkv commented on code in PR #40902: URL: https://github.com/apache/spark/pull/40902#discussion_r1175068610 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala: ## @@ -281,89 +281,108 @@ package object dsl { def attr: UnresolvedAttribute = anal

[GitHub] [spark] zhengruifeng closed pull request #40866: [SPARK-43178][CONNECT][PYTHON] Migrate UDF errors into PySpark error framework

2023-04-24 Thread via GitHub
zhengruifeng closed pull request #40866: [SPARK-43178][CONNECT][PYTHON] Migrate UDF errors into PySpark error framework URL: https://github.com/apache/spark/pull/40866 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

[GitHub] [spark] zhengruifeng commented on pull request #40866: [SPARK-43178][CONNECT][PYTHON] Migrate UDF errors into PySpark error framework

2023-04-24 Thread via GitHub
zhengruifeng commented on PR #40866: URL: https://github.com/apache/spark/pull/40866#issuecomment-1519852696 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] cloud-fan commented on pull request #40563: [SPARK-41233][FOLLOWUP] Refactor `array_prepend` with `RuntimeReplaceable`

2023-04-24 Thread via GitHub
cloud-fan commented on PR #40563: URL: https://github.com/apache/spark/pull/40563#issuecomment-1519867307 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

[GitHub] [spark] cloud-fan closed pull request #40563: [SPARK-41233][FOLLOWUP] Refactor `array_prepend` with `RuntimeReplaceable`

2023-04-24 Thread via GitHub
cloud-fan closed pull request #40563: [SPARK-41233][FOLLOWUP] Refactor `array_prepend` with `RuntimeReplaceable` URL: https://github.com/apache/spark/pull/40563 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

[GitHub] [spark] cloud-fan commented on a diff in pull request #40915: [SPARK-43232][SQL] Improve ObjectHashAggregateExec performance for high cardinality

2023-04-24 Thread via GitHub
cloud-fan commented on code in PR #40915: URL: https://github.com/apache/spark/pull/40915#discussion_r1175103099 ## sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala: ## @@ -111,25 +111,17 @@ class ObjectAggregationIterator( }

[GitHub] [spark] cloud-fan commented on a diff in pull request #40915: [SPARK-43232][SQL] Improve ObjectHashAggregateExec performance for high cardinality

2023-04-24 Thread via GitHub
cloud-fan commented on code in PR #40915: URL: https://github.com/apache/spark/pull/40915#discussion_r1175104259 ## sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala: ## @@ -231,11 +224,15 @@ class SortBasedAggregator( grouping

[GitHub] [spark] cloud-fan commented on a diff in pull request #40915: [SPARK-43232][SQL] Improve ObjectHashAggregateExec performance for high cardinality

2023-04-24 Thread via GitHub
cloud-fan commented on code in PR #40915: URL: https://github.com/apache/spark/pull/40915#discussion_r1175105524 ## sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala: ## @@ -252,6 +249,7 @@ class SortBasedAggregator( var hasN

[GitHub] [spark] cloud-fan commented on pull request #40875: [SPARK-43214][SQL] Post driver-side metrics for LocalTableScanExec/CommandResultExec

2023-04-24 Thread via GitHub
cloud-fan commented on PR #40875: URL: https://github.com/apache/spark/pull/40875#issuecomment-1519891829 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

[GitHub] [spark] cloud-fan closed pull request #40875: [SPARK-43214][SQL] Post driver-side metrics for LocalTableScanExec/CommandResultExec

2023-04-24 Thread via GitHub
cloud-fan closed pull request #40875: [SPARK-43214][SQL] Post driver-side metrics for LocalTableScanExec/CommandResultExec URL: https://github.com/apache/spark/pull/40875 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

[GitHub] [spark] itholic commented on pull request #40658: [WIP][SPARK-43024][PS] Upgrade pandas to 2.0.0

2023-04-24 Thread via GitHub
itholic commented on PR #40658: URL: https://github.com/apache/spark/pull/40658#issuecomment-1519900926 Thanks, @bjornjorgensen ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

[GitHub] [spark] itholic opened a new pull request, #40924: [SPARK-43260][PYTHON] Migrate the Spark SQL pandas arrow type errors into error class.

2023-04-24 Thread via GitHub
itholic opened a new pull request, #40924: URL: https://github.com/apache/spark/pull/40924 ### What changes were proposed in this pull request? This PR proposes to migrate the Spark SQL pandas arrow type errors into error class. ### Why are the changes needed? Leveraging

[GitHub] [spark] LuciferYang opened a new pull request, #40925: [SPARK-43246][BUILD] Ignore `privateClasses` and `privateMembers` from connect mima check as default

2023-04-24 Thread via GitHub
LuciferYang opened a new pull request, #40925: URL: https://github.com/apache/spark/pull/40925 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How

[GitHub] [spark] ulysses-you commented on a diff in pull request #40915: [SPARK-43232][SQL] Improve ObjectHashAggregateExec performance for high cardinality

2023-04-24 Thread via GitHub
ulysses-you commented on code in PR #40915: URL: https://github.com/apache/spark/pull/40915#discussion_r1175192930 ## sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala: ## @@ -111,25 +111,17 @@ class ObjectAggregationIterator(

[GitHub] [spark] LuciferYang commented on a diff in pull request #40898: [SPARK-43230][CONNECT] Simplify `DataFrameNaFunctions.fillna`

2023-04-24 Thread via GitHub
LuciferYang commented on code in PR #40898: URL: https://github.com/apache/spark/pull/40898#discussion_r1175199390 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala: ## @@ -145,6 +145,7 @@ object CheckConn

[GitHub] [spark] ulysses-you commented on a diff in pull request #40915: [SPARK-43232][SQL] Improve ObjectHashAggregateExec performance for high cardinality

2023-04-24 Thread via GitHub
ulysses-you commented on code in PR #40915: URL: https://github.com/apache/spark/pull/40915#discussion_r1175202831 ## sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala: ## @@ -252,6 +249,7 @@ class SortBasedAggregator( var ha

[GitHub] [spark] itholic opened a new pull request, #40926: [SPARK-43261][PYTHON] Migrate `TypeError` from Spark SQL types into error class.

2023-04-24 Thread via GitHub
itholic opened a new pull request, #40926: URL: https://github.com/apache/spark/pull/40926 ### What changes were proposed in this pull request? This PR proposes to migrate `TypeError` from Spark SQL types into error class. ### Why are the changes needed? To improve P

[GitHub] [spark] justaparth commented on a diff in pull request #40686: [SPARK-43051][PROTOBUF] Add option to materialize zero values when deserializing protobufs

2023-04-24 Thread via GitHub
justaparth commented on code in PR #40686: URL: https://github.com/apache/spark/pull/40686#discussion_r1175212228 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala: ## @@ -288,7 +289,21 @@ private[sql] class ProtobufDeserializer(

[GitHub] [spark] itholic opened a new pull request, #40927: [SPARK-42419][FOLLOWUP][CONNECT][PYTHON] Remove unused exception

2023-04-24 Thread via GitHub
itholic opened a new pull request, #40927: URL: https://github.com/apache/spark/pull/40927 ### What changes were proposed in this pull request? This is follow-up for https://github.com/apache/spark/pull/39991 to remove unused exception. ### Why are the changes neede

[GitHub] [spark] itholic opened a new pull request, #40928: [SPARK-43262][CONNECT][SS][PYTHON] Migrate Spark Connect Structured Streaming errors into error class

2023-04-24 Thread via GitHub
itholic opened a new pull request, #40928: URL: https://github.com/apache/spark/pull/40928 ### What changes were proposed in this pull request? This PR proposes to migrate built-in `TypeError` and `ValueError` from Spark Connect Structured Streaming into PySpark error framework.

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175271997 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175271997 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175286323 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175286323 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175286323 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175286323 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175286323 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175286323 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175275233 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] LuciferYang commented on pull request #40847: [SPARK-43185][BUILD] Inline `hadoop-client` related properties in `pom.xml`

2023-04-24 Thread via GitHub
LuciferYang commented on PR #40847: URL: https://github.com/apache/spark/pull/40847#issuecomment-1520218583 @xkrogen @sunchao @pan3793 Synchronize my experimental results 1. Before building, we need to add the following content to `resource-managers/yarn/pom.xml` refer to https://git

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175364533 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] majdyz opened a new pull request, #40929: Avoid allocation of unwritten ColumnVector in VectorizedReader

2023-04-24 Thread via GitHub
majdyz opened a new pull request, #40929: URL: https://github.com/apache/spark/pull/40929 ### What changes were proposed in this pull request? This PR adds lazy allocation support for the backing array of ColumnVector used in Spark VectorizedReader. This is added as a memory o

[GitHub] [spark] LuciferYang commented on pull request #40929: [SPARK-43264][CORE] Avoid allocation of unwritten ColumnVector in Spark Vectorized Reader

2023-04-24 Thread via GitHub
LuciferYang commented on PR #40929: URL: https://github.com/apache/spark/pull/40929#issuecomment-1520312702 @majdyz Can you enable GA first refer to https://user-images.githubusercontent.com/1475305/234031906-ad7fa49e-209b-4369-888a-e81a1299943d.png";> https://github.com/apache/spark/p

[GitHub] [spark] cloud-fan commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
cloud-fan commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175418393 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technically, a file f

[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub
LuciferYang commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175420930 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.

[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub
LuciferYang commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175422370 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.

[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub
LuciferYang commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175425968 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.

[GitHub] [spark] ryan-johnson-databricks opened a new pull request, #40930: [DO NOT MERGE] File constant metadata extractors split

2023-04-24 Thread via GitHub
ryan-johnson-databricks opened a new pull request, #40930: URL: https://github.com/apache/spark/pull/40930 ### What changes were proposed in this pull request? Experimental PR in response to https://github.com/apache/spark/pull/40885#discussion_r1174277575, so that reviewers

[GitHub] [spark] LuciferYang commented on pull request #40847: [SPARK-43185][BUILD] Inline `hadoop-client` related properties in `pom.xml`

2023-04-24 Thread via GitHub
LuciferYang commented on PR #40847: URL: https://github.com/apache/spark/pull/40847#issuecomment-1520363712 More 1. The conclusion using hadoop 3.0.x and hadoop 3.1.x is the same 2. User hadoop 3.2.x can't build `hadoop-cloud` module too 3. Currently, only hadoop 3.3. x can build all

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175437133 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] majdyz closed pull request #40929: [SPARK-43264][SQL] Avoid allocation of unwritten ColumnVector in Spark Vectorized Reader

2023-04-24 Thread via GitHub
majdyz closed pull request #40929: [SPARK-43264][SQL] Avoid allocation of unwritten ColumnVector in Spark Vectorized Reader URL: https://github.com/apache/spark/pull/40929 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

[GitHub] [spark] majdyz commented on pull request #40929: [SPARK-43264][SQL] Avoid allocation of unwritten ColumnVector in Spark Vectorized Reader

2023-04-24 Thread via GitHub
majdyz commented on PR #40929: URL: https://github.com/apache/spark/pull/40929#issuecomment-1520371824 @LuciferYang Thanks, I think it's already been enabled now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub
pan3793 commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175442719 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.leng

[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub
pan3793 commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175443729 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.leng

[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub
pan3793 commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175444671 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.leng

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175449282 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175452917 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technic

[GitHub] [spark] amaliujia opened a new pull request, #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub
amaliujia opened a new pull request, #40931: URL: https://github.com/apache/spark/pull/40931 ### What changes were proposed in this pull request? Move Error framework to a common utils module so that we can share it between Spark and Spark Connect without introducing heavy dep

[GitHub] [spark] amaliujia commented on pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub
amaliujia commented on PR #40931: URL: https://github.com/apache/spark/pull/40931#issuecomment-1520403075 @cloud-fan @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

[GitHub] [spark] cloud-fan commented on pull request #40879: [SPARK-43217] Correctly recurse in nested maps/arrays in findNestedField

2023-04-24 Thread via GitHub
cloud-fan commented on PR #40879: URL: https://github.com/apache/spark/pull/40879#issuecomment-1520408175 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

[GitHub] [spark] cloud-fan closed pull request #40879: [SPARK-43217] Correctly recurse in nested maps/arrays in findNestedField

2023-04-24 Thread via GitHub
cloud-fan closed pull request #40879: [SPARK-43217] Correctly recurse in nested maps/arrays in findNestedField URL: https://github.com/apache/spark/pull/40879 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] peter-toth opened a new pull request, #40932: [SPARK-43266][SQL] Move MergeScalarSubqueries to spark-sql

2023-04-24 Thread via GitHub
peter-toth opened a new pull request, #40932: URL: https://github.com/apache/spark/pull/40932 ### What changes were proposed in this pull request? This PR moves `MergeScalarSubqueries` from `catalyst` to `spark-sql` ### Why are the changes needed? Make SPARK-40193 / https://gith

[GitHub] [spark] peter-toth commented on pull request #37630: [SPARK-40193][SQL] Merge subquery plans with different filters

2023-04-24 Thread via GitHub
peter-toth commented on PR #37630: URL: https://github.com/apache/spark/pull/37630#issuecomment-1520435320 I extracted the first commit of this PR, that just moves `MergeScalarSubqueries` from `spark-catalyst` to `spark-sql` to https://github.com/apache/spark/pull/40932 to make the actual c

[GitHub] [spark] hvanhovell commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub
hvanhovell commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1175496791 ## common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala: ## @@ -34,7 +33,7 @@ private[spark] object ErrorMessageFormat extends Enumeration { */

[GitHub] [spark] hvanhovell commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub
hvanhovell commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1175497831 ## common/utils/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala: ## @@ -30,6 +29,7 @@ import org.apache.commons.text.StringSubstitutor import org.apa

[GitHub] [spark] pan3793 commented on pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub
pan3793 commented on PR #40920: URL: https://github.com/apache/spark/pull/40920#issuecomment-1520446235 cc @sunchao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsu

[GitHub] [spark] amaliujia commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub
amaliujia commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1175520012 ## common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala: ## @@ -34,7 +33,7 @@ private[spark] object ErrorMessageFormat extends Enumeration { */

[GitHub] [spark] RyanBerti commented on a diff in pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch

2023-04-24 Thread via GitHub
RyanBerti commented on code in PR #40615: URL: https://github.com/apache/spark/pull/40615#discussion_r1175541542 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala: ## @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache S

[GitHub] [spark] aokolnychyi commented on pull request #40919: [SPARK-43204][SQL] Align MERGE assignments with table attributes

2023-04-24 Thread via GitHub
aokolnychyi commented on PR #40919: URL: https://github.com/apache/spark/pull/40919#issuecomment-1520498405 @cloud-fan @sunchao @viirya @huaxingao @dongjoon-hyun @gengliangwang, this is a follow-up to PR #40308. -- This is an automated message from the Apache Git Service. To respond to th

[GitHub] [spark] RyanBerti commented on a diff in pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch

2023-04-24 Thread via GitHub
RyanBerti commented on code in PR #40615: URL: https://github.com/apache/spark/pull/40615#discussion_r1175544253 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/DatasketchesHllSketchSuite.scala: ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apac

[GitHub] [spark] RyanBerti commented on pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch

2023-04-24 Thread via GitHub
RyanBerti commented on PR #40615: URL: https://github.com/apache/spark/pull/40615#issuecomment-1520502797 >about adding a third boolean argument, with the default value being false -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitH

[GitHub] [spark] dzhigimont commented on a diff in pull request #40370: [SPARK-42620][PS] Add `inclusive` parameter for (DataFrame|Series).between_time

2023-04-24 Thread via GitHub
dzhigimont commented on code in PR #40370: URL: https://github.com/apache/spark/pull/40370#discussion_r1175551612 ## python/pyspark/pandas/frame.py: ## @@ -3519,16 +3516,8 @@ def between_time( Initial time as a time filter limit. end_time : datetime.time or

[GitHub] [spark] dzhigimont commented on a diff in pull request #40370: [SPARK-42620][PS] Add `inclusive` parameter for (DataFrame|Series).between_time

2023-04-24 Thread via GitHub
dzhigimont commented on code in PR #40370: URL: https://github.com/apache/spark/pull/40370#discussion_r1175552369 ## python/pyspark/pandas/frame.py: ## @@ -3582,14 +3571,18 @@ def between_time( if not isinstance(self.index, ps.DatetimeIndex): raise TypeErro

[GitHub] [spark] DerekTBrown commented on pull request #40798: SPARK-43166: name docker users

2023-04-24 Thread via GitHub
DerekTBrown commented on PR #40798: URL: https://github.com/apache/spark/pull/40798#issuecomment-1520512047 Looks good. Closing in favor of #40831 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] DerekTBrown closed pull request #40798: SPARK-43166: name docker users

2023-04-24 Thread via GitHub
DerekTBrown closed pull request #40798: SPARK-43166: name docker users URL: https://github.com/apache/spark/pull/40798 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubs

[GitHub] [spark] dzhigimont commented on a diff in pull request #40665: [SPARK-42621][PS] Add inclusive parameter for pd.date_range

2023-04-24 Thread via GitHub
dzhigimont commented on code in PR #40665: URL: https://github.com/apache/spark/pull/40665#discussion_r1175559912 ## python/pyspark/pandas/namespace.py: ## @@ -1782,12 +1780,8 @@ def date_range( Normalize start/end dates to midnight before generating date range. na

[GitHub] [spark] sunchao commented on pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader

2023-04-24 Thread via GitHub
sunchao commented on PR #39950: URL: https://github.com/apache/spark/pull/39950#issuecomment-1520563772 Yea @yabola is correct, if we have 100 row groups in a file and there are 100 tasks to read them, each task will only be assigned a range (e.g., a single row group) in the file to read, s

[GitHub] [spark] sunchao commented on pull request #40893: [SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution

2023-04-24 Thread via GitHub
sunchao commented on PR #40893: URL: https://github.com/apache/spark/pull/40893#issuecomment-1520573329 @pan3793 AFAIK the development efforts in Hive community are only in Hive 3.x/4.x at the moment, and the 2.x branch is barely maintained. I can try to start a conversation in the Hive com

[GitHub] [spark] amaliujia commented on pull request #40899: [SPARK-43249][CONNECT] Fix missing stats for SQL Command

2023-04-24 Thread via GitHub
amaliujia commented on PR #40899: URL: https://github.com/apache/spark/pull/40899#issuecomment-1520596794 Thanks for adding the JIRA! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

  1   2   >