[GitHub] [spark] LuciferYang commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-03-22 Thread via GitHub
LuciferYang commented on code in PR #40352: URL: https://github.com/apache/spark/pull/40352#discussion_r1145683804 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala: ## @@ -584,6 +585,97 @@ final class DataFrameStatFunctions

[GitHub] [spark] yaooqinn opened a new pull request, #40531: [SPARK-42904][SQL] Char/Varchar Support for JDBC Catalog

2023-03-22 Thread via GitHub
yaooqinn opened a new pull request, #40531: URL: https://github.com/apache/spark/pull/40531 ### What changes were proposed in this pull request? Add type mapping for spark char/varchar to jdbc types. ### Why are the changes needed? The STANDARD JDBC 1.0 and

[GitHub] [spark] LuciferYang commented on a diff in pull request #40438: [SPARK-42806][CONNECT] Add `Catalog` support

2023-03-22 Thread via GitHub
LuciferYang commented on code in PR #40438: URL: https://github.com/apache/spark/pull/40438#discussion_r1145682342 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala: ## @@ -129,6 +130,9 @@ object

[GitHub] [spark] grundprinzip commented on a diff in pull request #39947: [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception

2023-03-22 Thread via GitHub
grundprinzip commented on code in PR #39947: URL: https://github.com/apache/spark/pull/39947#discussion_r1145679097 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala: ## @@ -53,19 +59,37 @@ class SparkConnectService(debug:

[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-22 Thread via GitHub
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480595916 > @shrprasa do you know how the case 1 works? yes. It works because the resolved column has just one match attributes: Vector(id#17) but for second case, the match

[GitHub] [spark] LuciferYang commented on pull request #40518: [SPARK-42901][CONNECT][PYTHON] Move `StorageLevel` into a separate file to avoid potential `file recursively imports`

2023-03-22 Thread via GitHub
LuciferYang commented on PR #40518: URL: https://github.com/apache/spark/pull/40518#issuecomment-1480591952 Thanks @HyukjinKwon @dongjoon-hyun @ueshin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] HyukjinKwon opened a new pull request, #40530: [SPARK-42903][PYTHON][DOCS] Avoid documenting None as as a return value in docstring

2023-03-22 Thread via GitHub
HyukjinKwon opened a new pull request, #40530: URL: https://github.com/apache/spark/pull/40530 ### What changes were proposed in this pull request? This PR proposes to remove None as as a return value in docstring. ### Why are the changes needed? To be consistent with

[GitHub] [spark] cloud-fan commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-22 Thread via GitHub
cloud-fan commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480589254 @shrprasa do you know how the case 1 works? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] yaooqinn commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-22 Thread via GitHub
yaooqinn commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480574647 @shrprasa At the dataset definition phase, especially for intermediate datasets, Spark is lenient/lazy with case sensitivity. This is because the checks happen in SQL Analyzing,

[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-22 Thread via GitHub
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480574606 > df3.select("id").show() @cloud-fan The example you have shared will behave the same even after this fix. It will give ambiguous error. The use case which the fix is trying

[GitHub] [spark] zhengruifeng commented on pull request #40355: [SPARK-42604][CONNECT] Implement functions.typedlit

2023-03-22 Thread via GitHub
zhengruifeng commented on PR #40355: URL: https://github.com/apache/spark/pull/40355#issuecomment-1480570911 ping @hvanhovell @zhenlineo -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon closed pull request #40518: [SPARK-42901][CONNECT][PYTHON] Move `StorageLevel` into a separate file to avoid potential `file recursively imports`

2023-03-22 Thread via GitHub
HyukjinKwon closed pull request #40518: [SPARK-42901][CONNECT][PYTHON] Move `StorageLevel` into a separate file to avoid potential `file recursively imports` URL: https://github.com/apache/spark/pull/40518 -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] HyukjinKwon commented on pull request #40518: [SPARK-42901][CONNECT][PYTHON] Move `StorageLevel` into a separate file to avoid potential `file recursively imports`

2023-03-22 Thread via GitHub
HyukjinKwon commented on PR #40518: URL: https://github.com/apache/spark/pull/40518#issuecomment-1480558008 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] LuciferYang commented on pull request #40518: [SPARK-42901][CONNECT][PYTHON] Move `StorageLevel` into a separate file to avoid potential `file recursively imports`

2023-03-22 Thread via GitHub
LuciferYang commented on PR #40518: URL: https://github.com/apache/spark/pull/40518#issuecomment-1480556858 GA passed ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] HyukjinKwon commented on pull request #40487: [SPARK-42891][CONNECT][PYTHON] Implement CoGrouped Map API

2023-03-22 Thread via GitHub
HyukjinKwon commented on PR #40487: URL: https://github.com/apache/spark/pull/40487#issuecomment-1480556026 It has a conflict w/ branch-3.4. mind creating a backport PR please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] HyukjinKwon closed pull request #40487: [SPARK-42891][CONNECT][PYTHON] Implement CoGrouped Map API

2023-03-22 Thread via GitHub
HyukjinKwon closed pull request #40487: [SPARK-42891][CONNECT][PYTHON] Implement CoGrouped Map API URL: https://github.com/apache/spark/pull/40487 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] HyukjinKwon commented on pull request #40487: [SPARK-42891][CONNECT][PYTHON] Implement CoGrouped Map API

2023-03-22 Thread via GitHub
HyukjinKwon commented on PR #40487: URL: https://github.com/apache/spark/pull/40487#issuecomment-1480555708 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39947: [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception

2023-03-22 Thread via GitHub
HyukjinKwon commented on code in PR #39947: URL: https://github.com/apache/spark/pull/39947#discussion_r1145632881 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala: ## @@ -53,19 +59,37 @@ class SparkConnectService(debug:

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
HyukjinKwon commented on code in PR #40520: URL: https://github.com/apache/spark/pull/40520#discussion_r1145632491 ## python/pyspark/sql/pandas/map_ops.py: ## @@ -60,6 +62,7 @@ def mapInPandas( schema : :class:`pyspark.sql.types.DataType` or str the return

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
HyukjinKwon commented on code in PR #40520: URL: https://github.com/apache/spark/pull/40520#discussion_r1145632294 ## python/pyspark/sql/pandas/map_ops.py: ## @@ -32,7 +32,9 @@ class PandasMapOpsMixin: """ def mapInPandas( -self, func:

[GitHub] [spark] yliou opened a new pull request, #40529: [SPARK-42890] [UI] add repeat identifier on SQL UI

2023-03-22 Thread via GitHub
yliou opened a new pull request, #40529: URL: https://github.com/apache/spark/pull/40529 ### What changes were proposed in this pull request? On the SQL page in the Web UI, this PR aims to add a repeat identifier to distinguish which InMemoryTableScan is being used at a certain

[GitHub] [spark] cloud-fan commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-22 Thread via GitHub
cloud-fan commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480540953 I think column resolution should only look at one level, to make the behavior simple and predictable. I tried it on pgsql and it fails as well: ``` create table t(i int);

[GitHub] [spark] chong0929 commented on pull request #40521: [MINOR][DOCS][PYTHON] Update some urls about deprecated repository pyspark.pandas

2023-03-22 Thread via GitHub
chong0929 commented on PR #40521: URL: https://github.com/apache/spark/pull/40521#issuecomment-1480537486 Thanks for your review. I think some examples that cannot be linked to the correct places, which can make confused, and the original points provides some clear references. -- This

[GitHub] [spark] cloud-fan commented on pull request #40526: [SPARK-42899][SQL] Fix DataFrame.to(schema) to handle the case where there is a non-nullable nested field in a nullable field

2023-03-22 Thread via GitHub
cloud-fan commented on PR #40526: URL: https://github.com/apache/spark/pull/40526#issuecomment-1480536077 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] cloud-fan commented on pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
cloud-fan commented on PR #40520: URL: https://github.com/apache/spark/pull/40520#issuecomment-1480533198 From a SQL engine's point of view, running all tasks at once or batch by batch doesn't matter. It doesn't change the semantics of the SQL operator, and the optimizer doesn't care about

[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-22 Thread via GitHub
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480532873 > I second @srowen ‘s view. cc @cloud-fan Thanks @yaooqinn for replying. Can you please explain why you think it's not the right fix? The fix only proposes to remove

[GitHub] [spark] cxzl25 commented on pull request #40439: [SPARK-42807][CORE] Apply custom log URL pattern for yarn-client AM log URL in SHS

2023-03-22 Thread via GitHub
cxzl25 commented on PR #40439: URL: https://github.com/apache/spark/pull/40439#issuecomment-1480532626 @HeartSaVioR Please help review this PR, Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] beliefer commented on a diff in pull request #40528: [WIP][SPARK-42584][CONNECT] Improve output of Column.explain

2023-03-22 Thread via GitHub
beliefer commented on code in PR #40528: URL: https://github.com/apache/spark/pull/40528#discussion_r1145613769 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala: ## @@ -1213,11 +1213,8 @@ class Column private[sql] (private[sql] val expr:

[GitHub] [spark] zhengruifeng commented on pull request #40521: [MINOR][DOCS][PYTHON] Update some urls about deprecated repository pyspark.pandas

2023-03-22 Thread via GitHub
zhengruifeng commented on PR #40521: URL: https://github.com/apache/spark/pull/40521#issuecomment-1480513582 I think it is fine if we don't have avaliable ticket link. It seems that those links point to issues before moving Kolas to Apache Spark. -- This is an automated message from

[GitHub] [spark] WeichenXu123 commented on pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
WeichenXu123 commented on PR #40520: URL: https://github.com/apache/spark/pull/40520#issuecomment-1480510947 > I am saying that real power of Catalyst optimizer is to optimize/reorder these logical plans, and I believe that's the reason why barrier execution wasn't introduced in SQL. The

[GitHub] [spark] beliefer commented on pull request #40528: [WIP][SPARK-42584][CONNECT] Improve output of Column.explain

2023-03-22 Thread via GitHub
beliefer commented on PR #40528: URL: https://github.com/apache/spark/pull/40528#issuecomment-1480509383 This PR has not been implemented yet. @hvanhovell Could you take a look? Does this one satisfy your expected. -- This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] beliefer opened a new pull request, #40528: [WIP][SPARK-42584][CONNECT] Improve output of Column.explain

2023-03-22 Thread via GitHub
beliefer opened a new pull request, #40528: URL: https://github.com/apache/spark/pull/40528 ### What changes were proposed in this pull request? Currently, connect display the structure of the proto in both the regular and extended version of explain. We should display a more compact

[GitHub] [spark] shrprasa commented on pull request #40128: [SPARK-42466][K8S]: Cleanup k8s upload directory when job terminates

2023-03-22 Thread via GitHub
shrprasa commented on PR #40128: URL: https://github.com/apache/spark/pull/40128#issuecomment-1480495550 @dongjoon-hyun Thanks for the clarification. But the unreliability for shutdown hook is common for all other shutdown tasks also. This doesn't mean we haven't impletened them. So, why

[GitHub] [spark] zhengruifeng commented on pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
zhengruifeng commented on PR #40520: URL: https://github.com/apache/spark/pull/40520#issuecomment-1480490580 > Barrier mode is only used in specific ML case, i.e. in model training routine, we will only use it in one pattern: > > dataset.mapInPandas(..., is_barrier=True).collect()

[GitHub] [spark] HyukjinKwon closed pull request #40526: [SPARK-42899][SQL] Fix DataFrame.to(schema) to handle the case where there is a non-nullable nested field in a nullable field

2023-03-22 Thread via GitHub
HyukjinKwon closed pull request #40526: [SPARK-42899][SQL] Fix DataFrame.to(schema) to handle the case where there is a non-nullable nested field in a nullable field URL: https://github.com/apache/spark/pull/40526 -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] HyukjinKwon commented on pull request #40526: [SPARK-42899][SQL] Fix DataFrame.to(schema) to handle the case where there is a non-nullable nested field in a nullable field

2023-03-22 Thread via GitHub
HyukjinKwon commented on PR #40526: URL: https://github.com/apache/spark/pull/40526#issuecomment-1480485696 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] zhengruifeng commented on pull request #40527: [SPARK-42900][CONNECT][PYTHON] Fix createDataFrame to respect inference and column names

2023-03-22 Thread via GitHub
zhengruifeng commented on PR #40527: URL: https://github.com/apache/spark/pull/40527#issuecomment-1480472824 also cc @WeichenXu123 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] zhengruifeng commented on pull request #40519: [SPARK-42864][ML] Make `IsotonicRegression.PointsAccumulator` private

2023-03-22 Thread via GitHub
zhengruifeng commented on PR #40519: URL: https://github.com/apache/spark/pull/40519#issuecomment-1480471132 thanks for reivews -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] yaooqinn commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-22 Thread via GitHub
yaooqinn commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480469070 I second @srowen ‘s view. cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] ulysses-you commented on pull request #40522: [SPARK-42101][SQL][FOLLOWUP] Make QueryStageExec.resultOption and isMeterialized consistent

2023-03-22 Thread via GitHub
ulysses-you commented on PR #40522: URL: https://github.com/apache/spark/pull/40522#issuecomment-1480466232 lgtm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] zhenlineo commented on a diff in pull request #39947: [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception

2023-03-22 Thread via GitHub
zhenlineo commented on code in PR #39947: URL: https://github.com/apache/spark/pull/39947#discussion_r1145575350 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala: ## @@ -53,19 +59,37 @@ class SparkConnectService(debug:

[GitHub] [spark] zhenlineo commented on a diff in pull request #39947: [SPARK-40453][SPARK-41715][CONNECT] Take super class into account when throwing an exception

2023-03-22 Thread via GitHub
zhenlineo commented on code in PR #39947: URL: https://github.com/apache/spark/pull/39947#discussion_r1145575350 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala: ## @@ -53,19 +59,37 @@ class SparkConnectService(debug:

[GitHub] [spark] HyukjinKwon commented on pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
HyukjinKwon commented on PR #40520: URL: https://github.com/apache/spark/pull/40520#issuecomment-1480444902 I am saying that real power of Catalyst optimizer is to optimize/reorder these logical plans, and I believe that's the reason why barrier execution wasn't introduced in SQL. But

[GitHub] [spark] HyukjinKwon commented on pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
HyukjinKwon commented on PR #40520: URL: https://github.com/apache/spark/pull/40520#issuecomment-1480443966 Predicate pushdown is just an example. e.g., you might want to combine adjacent `MapInPandas`s but it would need a special handling if `is_barrier` flag is added. -- This is an

[GitHub] [spark] amaliujia commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options

2023-03-22 Thread via GitHub
amaliujia commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1145563490 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -458,7 +458,9 @@ class DataFrameReader private[sql] (sparkSession:

[GitHub] [spark] gerashegalov commented on a diff in pull request #40515: [SPARK-42884][CONNECT] Add Ammonite REPL integration

2023-03-22 Thread via GitHub
gerashegalov commented on code in PR #40515: URL: https://github.com/apache/spark/pull/40515#discussion_r1145546706 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientBuilderParseTestSuite.scala: ## @@ -0,0 +1,131 @@ +/* + *

[GitHub] [spark] LuciferYang commented on pull request #40518: [SPARK-42901][CONNECT][PYTHON] Move `StorageLevel` into a separate file to avoid potential `file recursively imports`

2023-03-22 Thread via GitHub
LuciferYang commented on PR #40518: URL: https://github.com/apache/spark/pull/40518#issuecomment-1480438169 > @LuciferYang . This looks worthy of having a new JIRA. Please create a new JIRA for this PR and use it. This PR is a good contribution of yours.  @dongjoon-hyun Thanks for

[GitHub] [spark] LuciferYang commented on pull request #40518: [SPARK-42901][CONNECT][PYTHON] Move `StorageLevel` into a separate file to avoid potential `file recursively imports`

2023-03-22 Thread via GitHub
LuciferYang commented on PR #40518: URL: https://github.com/apache/spark/pull/40518#issuecomment-1480437248 > @LuciferYang nit: you need to update the PR description. There is an old file name `storage_level.proto`. Thanks ~ fixed -- This is an automated message from the Apache

[GitHub] [spark] WeichenXu123 commented on pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
WeichenXu123 commented on PR #40520: URL: https://github.com/apache/spark/pull/40520#issuecomment-1480433501 > hmmm why do we need to care about the optimizer? The optimizer is not sensitive to the physical execution engine, e.g. Preso, Spark, Flink have many similar SQL optimizations.

[GitHub] [spark] cloud-fan commented on pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
cloud-fan commented on PR #40520: URL: https://github.com/apache/spark/pull/40520#issuecomment-1480433026 hmmm why do we need to care about the optimizer? The optimizer is not sensitive to the physical execution engine, e.g. Preso, Spark, Flink have many similar SQL optimizations. --

[GitHub] [spark] WeichenXu123 commented on pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
WeichenXu123 commented on PR #40520: URL: https://github.com/apache/spark/pull/40520#issuecomment-1480428307 To address @HyukjinKwon 's concern about optimizer, can we add `is_barrier` attribute into `UnaryExecNode`, and if optimizer find a node marking `is_barrier` as True, then

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
WeichenXu123 commented on code in PR #40520: URL: https://github.com/apache/spark/pull/40520#discussion_r1145545786 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/MapInPandasExec.scala: ## @@ -28,7 +28,8 @@ import org.apache.spark.sql.execution.SparkPlan case

[GitHub] [spark] github-actions[bot] commented on pull request #38781: [SPARK-41246][core] Solve the problem of RddId negative

2023-03-22 Thread via GitHub
github-actions[bot] commented on PR #38781: URL: https://github.com/apache/spark/pull/38781#issuecomment-1480416870 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] github-actions[bot] commented on pull request #39023: [SPARK-41459][SQL][3.3] fix thrift server operation log output is empty

2023-03-22 Thread via GitHub
github-actions[bot] commented on PR #39023: URL: https://github.com/apache/spark/pull/39023#issuecomment-1480416829 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] github-actions[bot] commented on pull request #38965: [SPARK-41386][SQL] Improve small partition factor for rebalance

2023-03-22 Thread via GitHub
github-actions[bot] commented on PR #38965: URL: https://github.com/apache/spark/pull/38965#issuecomment-1480416849 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] github-actions[bot] commented on pull request #38756: [SPARK-41220][SQL] Range partitioner sample supports column pruning

2023-03-22 Thread via GitHub
github-actions[bot] commented on PR #38756: URL: https://github.com/apache/spark/pull/38756#issuecomment-1480416904 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.

[GitHub] [spark] HyukjinKwon commented on pull request #40521: [MINOR][DOCS][PYTHON] Update some urls about deprecated repository pyspark.pandas

2023-03-22 Thread via GitHub
HyukjinKwon commented on PR #40521: URL: https://github.com/apache/spark/pull/40521#issuecomment-1480411932 Those are actually not real JIRAs or TODOs. These are the pointers of the original fix or ticket (that contains examples or code change). So I guess it's fine as is. -- This is an

[GitHub] [spark] hvanhovell closed pull request #40368: [SPARK-42748][CONNECT] Server-side Artifact Management

2023-03-22 Thread via GitHub
hvanhovell closed pull request #40368: [SPARK-42748][CONNECT] Server-side Artifact Management URL: https://github.com/apache/spark/pull/40368 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] hvanhovell commented on pull request #40368: [SPARK-42748][CONNECT] Server-side Artifact Management

2023-03-22 Thread via GitHub
hvanhovell commented on PR #40368: URL: https://github.com/apache/spark/pull/40368#issuecomment-1480408251 Merging this one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] ueshin commented on pull request #40518: [SPARK-42889][CONNECT][PYTHON][FOLLOWUP] Move `StorageLevel` into a separate file to avoid potential file recursively imports

2023-03-22 Thread via GitHub
ueshin commented on PR #40518: URL: https://github.com/apache/spark/pull/40518#issuecomment-1480408180 @LuciferYang nit: you need to update the PR description. There is an old file name `storage_level.proto`. -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] LuciferYang commented on pull request #40516: [SPARK-42894][CONNECT] Support `cache`/`persist`/`unpersist`/`storageLevel` for Spark connect jvm client

2023-03-22 Thread via GitHub
LuciferYang commented on PR #40516: URL: https://github.com/apache/spark/pull/40516#issuecomment-1480402878 Thanks @dongjoon-hyun @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] dongjoon-hyun commented on pull request #40128: [SPARK-42466][K8S]: Cleanup k8s upload directory when job terminates

2023-03-22 Thread via GitHub
dongjoon-hyun commented on PR #40128: URL: https://github.com/apache/spark/pull/40128#issuecomment-1480402673 @shrprasa . 1. It seems that you have an assumption that Shutdown hook is magically reliable. However, shutdown hook has a well-known limitation where JVM can be destroyed

[GitHub] [spark] LuciferYang commented on pull request #40518: [SPARK-42889][CONNECT][PYTHON][FOLLOWUP] Move `StorageLevel` into a separate file to avoid potential file recursively imports

2023-03-22 Thread via GitHub
LuciferYang commented on PR #40518: URL: https://github.com/apache/spark/pull/40518#issuecomment-1480402574 rebase due to https://github.com/apache/spark/pull/40516 merged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [spark] zhenlineo commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options

2023-03-22 Thread via GitHub
zhenlineo commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1145531822 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala: ## @@ -458,7 +458,9 @@ class DataFrameReader private[sql] (sparkSession:

[GitHub] [spark] hvanhovell commented on pull request #40515: [SPARK-42884][CONNECT] Add Ammonite REPL integration

2023-03-22 Thread via GitHub
hvanhovell commented on PR #40515: URL: https://github.com/apache/spark/pull/40515#issuecomment-1480397842 @dongjoon-hyun I send the email. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] dongjoon-hyun commented on pull request #40515: [SPARK-42884][CONNECT] Add Ammonite REPL integration

2023-03-22 Thread via GitHub
dongjoon-hyun commented on PR #40515: URL: https://github.com/apache/spark/pull/40515#issuecomment-1480398301 Thank you so much, @hvanhovell . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] amaliujia commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options

2023-03-22 Thread via GitHub
amaliujia commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1145530854 ## python/pyspark/sql/connect/plan.py: ## @@ -302,13 +302,16 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation: class Read(LogicalPlan): -

[GitHub] [spark] xinrong-meng commented on pull request #40487: [SPARK-42891][CONNECT][PYTHON] Implement CoGrouped Map API

2023-03-22 Thread via GitHub
xinrong-meng commented on PR #40487: URL: https://github.com/apache/spark/pull/40487#issuecomment-1480372625 cc [LuciferYang](https://github.com/LuciferYang) thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] xinrong-meng commented on pull request #40487: [SPARK-42891][CONNECT][PYTHON] Implement CoGrouped Map API

2023-03-22 Thread via GitHub
xinrong-meng commented on PR #40487: URL: https://github.com/apache/spark/pull/40487#issuecomment-1480372425 May I get a review please @zhengruifeng @HyukjinKwon ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] ueshin commented on pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-22 Thread via GitHub
ueshin commented on PR #40402: URL: https://github.com/apache/spark/pull/40402#issuecomment-1480349942 @zhengruifeng I submitted two PRs: #40526 and #40527. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] ueshin opened a new pull request, #40527: [SPARK-42900][CONNECT][PYTHON] Fix createDataFrame to respect inference and column names

2023-03-22 Thread via GitHub
ueshin opened a new pull request, #40527: URL: https://github.com/apache/spark/pull/40527 ### What changes were proposed in this pull request? Fixes `createDataFrame` to respect inference and column names. ### Why are the changes needed? Currently when a column name list

[GitHub] [spark] ueshin opened a new pull request, #40526: [SPARK-42899][SQL] Fix DataFrame.to(schema) to handle the case where there is a non-nullable nested field in a nullable field

2023-03-22 Thread via GitHub
ueshin opened a new pull request, #40526: URL: https://github.com/apache/spark/pull/40526 ### What changes were proposed in this pull request? Fixes `DataFrame.to(schema)` to handle the case where there is a non-nullable nested field in a nullable field. ### Why are the

[GitHub] [spark] hvanhovell closed pull request #40512: [SPARK-42892][SQL] Move sameType and relevant methods out of DataType

2023-03-22 Thread via GitHub
hvanhovell closed pull request #40512: [SPARK-42892][SQL] Move sameType and relevant methods out of DataType URL: https://github.com/apache/spark/pull/40512 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-22 Thread via GitHub
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480217186 Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please review this PR or direct it to someone who can review this PR. -- This is an automated message from the

[GitHub] [spark] shrprasa commented on pull request #40128: [SPARK-42466][K8S]: Cleanup k8s upload directory when job terminates

2023-03-22 Thread via GitHub
shrprasa commented on PR #40128: URL: https://github.com/apache/spark/pull/40128#issuecomment-1480215990 Hi @dongjoon-hyun The change to clean up the upload directory is not specific to HDFS. The reason we should do cleanup is because if the spark job is creating new directories/files,

[GitHub] [spark] itholic commented on pull request #40525: [WIP][SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

2023-03-22 Thread via GitHub
itholic commented on PR #40525: URL: https://github.com/apache/spark/pull/40525#issuecomment-1480119115 The remaining task at hand is to address numerous mypy annotation issues. If you have any good ideas for resolving linter, please feel free to let me know at any time :-) -- This is

[GitHub] [spark] itholic opened a new pull request, #40525: [WIP][SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect

2023-03-22 Thread via GitHub
itholic opened a new pull request, #40525: URL: https://github.com/apache/spark/pull/40525 ### What changes were proposed in this pull request? This PR proposes to support pandas API on Spark for Spark Connect. This PR includes minimal changes to support basic functionality of the

[GitHub] [spark] cnauroth commented on pull request #40511: [SPARK-42888][BUILD] Upgrade `gcs-connector` to 2.2.11

2023-03-22 Thread via GitHub
cnauroth commented on PR #40511: URL: https://github.com/apache/spark/pull/40511#issuecomment-1480084498 @dongjoon-hyun and @sunchao , thank you for the commit and the warm welcome! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] gerashegalov commented on pull request #40524: [SPARK-42898][SQL] Mark that string/date casts do not need time zone id

2023-03-22 Thread via GitHub
gerashegalov commented on PR #40524: URL: https://github.com/apache/spark/pull/40524#issuecomment-1480055444 LGTM, I would just add a unit test to CastSuite to prevent regressions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] dongjoon-hyun commented on pull request #40515: [SPARK-42884][CONNECT] Add Ammonite REPL integration

2023-03-22 Thread via GitHub
dongjoon-hyun commented on PR #40515: URL: https://github.com/apache/spark/pull/40515#issuecomment-1480034164 Ya, I'm not against this nice improvement. Just shoot one email to the dev mailing list to give a headup. That's what I'm thinking that we need. -- This is an automated message

[GitHub] [spark] hvanhovell commented on pull request #40515: [SPARK-42884][CONNECT] Add Ammonite REPL integration

2023-03-22 Thread via GitHub
hvanhovell commented on PR #40515: URL: https://github.com/apache/spark/pull/40515#issuecomment-1480012982 @dongjoon-hyun officially is a bit a broad term. As far as I am concerned ammonite is just a way to use the connect JVM client, it is not meant as a change for all of Spark (although

[GitHub] [spark] ueshin commented on a diff in pull request #40518: [SPARK-42889][CONNECT][PYTHON][FOLLOWUP] Move `StorageLevel` into a separate file to avoid potential file recursively imports

2023-03-22 Thread via GitHub
ueshin commented on code in PR #40518: URL: https://github.com/apache/spark/pull/40518#discussion_r1145188951 ## connector/connect/common/src/main/protobuf/spark/connect/storage_level.proto: ## @@ -0,0 +1,37 @@ +/* Review Comment: `common.proto` sounds good to me. --

[GitHub] [spark] dongjoon-hyun commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

2023-03-22 Thread via GitHub
dongjoon-hyun commented on PR #40462: URL: https://github.com/apache/spark/pull/40462#issuecomment-1479987729 Merged to master for Apache Spark 3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] dongjoon-hyun closed pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

2023-03-22 Thread via GitHub
dongjoon-hyun closed pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit URL: https://github.com/apache/spark/pull/40462 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] dongjoon-hyun commented on pull request #40519: [SPARK-42864][ML] Make `IsotonicRegression.PointsAccumulator` private

2023-03-22 Thread via GitHub
dongjoon-hyun commented on PR #40519: URL: https://github.com/apache/spark/pull/40519#issuecomment-1479940012 branch-3.4 is handled via https://github.com/apache/spark/pull/40500 yesterday. -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] dongjoon-hyun commented on pull request #40519: [SPARK-42864][ML] Make `IsotonicRegression.PointsAccumulator` private

2023-03-22 Thread via GitHub
dongjoon-hyun commented on PR #40519: URL: https://github.com/apache/spark/pull/40519#issuecomment-1479933229 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] dongjoon-hyun closed pull request #40519: [SPARK-42864][ML] Make `IsotonicRegression.PointsAccumulator` private

2023-03-22 Thread via GitHub
dongjoon-hyun closed pull request #40519: [SPARK-42864][ML] Make `IsotonicRegression.PointsAccumulator` private URL: https://github.com/apache/spark/pull/40519 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] dongjoon-hyun closed pull request #40516: [SPARK-42894][CONNECT] Support `cache`/`persist`/`unpersist`/`storageLevel` for Spark connect jvm client

2023-03-22 Thread via GitHub
dongjoon-hyun closed pull request #40516: [SPARK-42894][CONNECT] Support `cache`/`persist`/`unpersist`/`storageLevel` for Spark connect jvm client URL: https://github.com/apache/spark/pull/40516 -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] dongjoon-hyun commented on pull request #40516: [SPARK-42894][CONNECT] Support `cache`/`persist`/`unpersist`/`storageLevel` for Spark connect jvm client

2023-03-22 Thread via GitHub
dongjoon-hyun commented on PR #40516: URL: https://github.com/apache/spark/pull/40516#issuecomment-1479928027 Merged to master/3.4. Thank you, @LuciferYang and @HyukjinKwon . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] VindhyaG commented on pull request #40462: [SPARK-42832][SQL] Remove repartition if it is the child of LocalLimit

2023-03-22 Thread via GitHub
VindhyaG commented on PR #40462: URL: https://github.com/apache/spark/pull/40462#issuecomment-1479847214 > > Can you please explain more on scenarios when rebalancepartitions becomes child of locallimit? i tried SELECT * FROM t WHERE id > 1 LIMIT 5; with spark 2.4.4 version and

[GitHub] [spark] revans2 opened a new pull request, #40524: [SPARK-42898][SQL] Mark that string/date casts do not need time zone id

2023-03-22 Thread via GitHub
revans2 opened a new pull request, #40524: URL: https://github.com/apache/spark/pull/40524 ### What changes were proposed in this pull request? This removes the need for a time zone id when casting from StringType -> DateType and DateType -> StringType. ### Why are the changes

[GitHub] [spark] wankunde opened a new pull request, #40523: [SPARK-42897][SQL] Avoid evaluate more than once for the variables from the left side in the FullOuter SMJ condition

2023-03-22 Thread via GitHub
wankunde opened a new pull request, #40523: URL: https://github.com/apache/spark/pull/40523 ### What changes were proposed in this pull request? For example: ``` val df1 = spark.range(5).select($"id".as("k1")) val df2 = spark.range(10).select($"id".as("k2"))

[GitHub] [spark] cloud-fan closed pull request #40446: [SPARK-42815][SQL] Subexpression elimination support shortcut expression

2023-03-22 Thread via GitHub
cloud-fan closed pull request #40446: [SPARK-42815][SQL] Subexpression elimination support shortcut expression URL: https://github.com/apache/spark/pull/40446 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] cloud-fan commented on pull request #40446: [SPARK-42815][SQL] Subexpression elimination support shortcut expression

2023-03-22 Thread via GitHub
cloud-fan commented on PR #40446: URL: https://github.com/apache/spark/pull/40446#issuecomment-1479648779 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] panbingkun commented on pull request #40506: [SPARK-42881][SQL] Codegen Support for get_json_object

2023-03-22 Thread via GitHub
panbingkun commented on PR #40506: URL: https://github.com/apache/spark/pull/40506#issuecomment-1479624519 > hmm... I think we should refactor `JsonBenchmark` to make get_json_object run w/ and w/o code gen in one Ok, Let me do it. -- This is an automated message from the Apache

[GitHub] [spark] cloud-fan commented on pull request #40522: [SPARK-42101][SQL][FOLLOWUP] Make QueryStageExec.resultOption and isMeterialized consistent

2023-03-22 Thread via GitHub
cloud-fan commented on PR #40522: URL: https://github.com/apache/spark/pull/40522#issuecomment-1479609724 cc @ulysses-you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
WeichenXu123 commented on code in PR #40520: URL: https://github.com/apache/spark/pull/40520#discussion_r1144852369 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/MapInPandasExec.scala: ## @@ -28,7 +28,8 @@ import org.apache.spark.sql.execution.SparkPlan case

[GitHub] [spark] cloud-fan commented on a diff in pull request #40522: [SPARK-42101][SQL][FOLLOWUP] Make QueryStageExec.resultOption and isMeterialized consistent

2023-03-22 Thread via GitHub
cloud-fan commented on code in PR #40522: URL: https://github.com/apache/spark/pull/40522#discussion_r1144851788 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala: ## @@ -561,34 +562,30 @@ case class AdaptiveSparkPlanExec( }

[GitHub] [spark] cloud-fan opened a new pull request, #40522: [SPARK-42101][SQL][FOLLOWUP] Make QueryStageExec.resultOption and isMeterialized consistent

2023-03-22 Thread via GitHub
cloud-fan opened a new pull request, #40522: URL: https://github.com/apache/spark/pull/40522 ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/39624 . `QueryStageExec.isMeterialized` should only return true if

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40520: [SPARK-42896][SQL][PYSPARK] Make `mapInPandas` / `mapInArrow` support barrier mode execution

2023-03-22 Thread via GitHub
HyukjinKwon commented on code in PR #40520: URL: https://github.com/apache/spark/pull/40520#discussion_r1144849964 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/MapInPandasExec.scala: ## @@ -28,7 +28,8 @@ import org.apache.spark.sql.execution.SparkPlan case

  1   2   >