date:20230914

[jira] [Updated] (SPARK-45179) Increase Numpy minimum version to 1.21

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45179:
---
Labels: pull-request-available  (was: )

> Increase Numpy minimum version to 1.21
> --
>
> Key: SPARK-45179
> URL: https://issues.apache.org/jira/browse/SPARK-45179
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45175) download krb5.conf from remote storage in spark-sumbit on k8s

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45175:
---
Labels: pull-request-available  (was: )

> download krb5.conf from remote storage in spark-sumbit on k8s
> -
>
> Key: SPARK-45175
> URL: https://issues.apache.org/jira/browse/SPARK-45175
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.1
>Reporter: Qian Sun
>Priority: Minor
>  Labels: pull-request-available
>
> krb5.conf currently only supports the local file format. Tenants would like 
> to save this file on their own servers and download it during the 
> spark-submit phase for better implementation of multi-tenant scenarios. The 
> proposed solution is to use the *downloadFile*  function[1], similar to the 
> configuration of *spark.kubernetes.driver/executor.podTemplateFile*
>  
> [1]https://github.com/apache/spark/blob/822f58f0d26b7d760469151a65eaf9ee863a07a1/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/PodTemplateConfigMapStep.scala#L82C24-L82C24



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44376) Build using maven is broken using 2.13 and Java 11 and Java 17

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44376:
---
Labels: pull-request-available  (was: )

> Build using maven is broken using 2.13 and Java 11 and Java 17
> --
>
> Key: SPARK-44376
> URL: https://issues.apache.org/jira/browse/SPARK-44376
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Emil Ejbyfeldt
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0, 4.0.0
>
>
> Fails with
> ```
> $ ./build/mvn compile -Pscala-2.13 -Djava.version=11 -X
> ...
> [WARNING] [Warn] : [deprecation @  | origin= | version=] -target is 
> deprecated: Use -release instead to compile against the correct platform API.
> [ERROR] [Error] : target platform version 8 is older than the release version 
> 11
> [WARNING] one warning found
> [ERROR] one error found
> ...
> ```
> if setting the `java.version` property or
> ```
> $ ./build/mvn compile -Pscala-2.13
> ...
> [WARNING] [Warn] : [deprecation @  | origin= | version=] -target is 
> deprecated: Use -release instead to compile against the correct platform API.
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/serializer/SerializationDebugger.scala:71:
>  not found: value sun
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26:
>  not found: object sun
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27:
>  not found: object sun
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:206:
>  not found: type DirectBuffer
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:210:
>  not found: type Unsafe
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:212:
>  not found: type Unsafe
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:213:
>  not found: type DirectBuffer
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:216:
>  not found: type DirectBuffer
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:236:
>  not found: type DirectBuffer
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26:
>  Unused import
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27:
>  Unused import
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala:452:
>  not found: value sun
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26:
>  not found: object sun
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99:
>  not found: type SignalHandler
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99:
>  not found: type Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:83:
>  not found: type Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108:
>  not found: type SignalHandler
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108:
>  not found: value Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:114:
>  not found: type Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:116:
>  not found: value Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:128:
>  not found: value Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26:
>  Unused import
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26:
>  Unused import
> [WARNING] one warning found
> [ERROR] 23 errors found
> ...
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To uns

[jira] [Created] (SPARK-45179) Increase Numpy minimum version to 1.21

2023-09-14 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-45179:
-

 Summary: Increase Numpy minimum version to 1.21
 Key: SPARK-45179
 URL: https://issues.apache.org/jira/browse/SPARK-45179
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-26365:
---
Labels: pull-request-available  (was: )

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Spark Submit
>Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0
>Reporter: Oscar Bonilla
>Priority: Major
>  Labels: pull-request-available
> Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, 
> spark-3.0.0-raise-exception-k8s-failure.patch
>
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43874) Enable GroupByTests for pandas 2.0.0.

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43874:
---
Labels: pull-request-available  (was: )

> Enable GroupByTests for pandas 2.0.0.
> -
>
> Key: SPARK-43874
> URL: https://issues.apache.org/jira/browse/SPARK-43874
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> test list:
>  * test_prod
>  * test_nth
>  * test_mad
>  * test_basic_stat_funcs
>  * test_groupby_multiindex_columns
>  * test_apply_without_shortcut
>  * test_mean
>  * test_apply



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43811) Enable DataFrameTests.test_reindex for pandas 2.0.0.

2023-09-14 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43811.
-
Resolution: Fixed

> Enable DataFrameTests.test_reindex for pandas 2.0.0.
> 
>
> Key: SPARK-43811
> URL: https://issues.apache.org/jira/browse/SPARK-43811
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44276) Match behavior with pandas for `SeriesStringTests.test_string_replace`

2023-09-14 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-44276.
-
Resolution: Fixed

> Match behavior with pandas for `SeriesStringTests.test_string_replace`
> --
>
> Key: SPARK-44276
> URL: https://issues.apache.org/jira/browse/SPARK-44276
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>
> See https://github.com/apache/spark/pull/41823/files



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43644) Enable DatetimeIndexTests.test_indexer_between_time for pandas 2.0.0.

2023-09-14 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43644.
-
Resolution: Fixed

> Enable DatetimeIndexTests.test_indexer_between_time for pandas 2.0.0.
> -
>
> Key: SPARK-43644
> URL: https://issues.apache.org/jira/browse/SPARK-43644
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Enable DatetimeIndexTests.test_indexer_between_time for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43433) Match `GroupBy.nth` behavior with new pandas behavior

2023-09-14 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43433.
-
Resolution: Fixed

> Match `GroupBy.nth` behavior with new pandas behavior
> -
>
> Key: SPARK-43433
> URL: https://issues.apache.org/jira/browse/SPARK-43433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Match behavior with 
> https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#dataframegroupby-nth-and-seriesgroupby-nth-now-behave-as-filtrations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43291) Generate proper warning on different behavior with numeric_only

2023-09-14 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43291.
-
Resolution: Won't Fix

We should match the behavior w/ Pandas instead of warning.

> Generate proper warning on different behavior with numeric_only
> ---
>
> Key: SPARK-43291
> URL: https://issues.apache.org/jira/browse/SPARK-43291
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Should enable test below:
> {code:java}
> pdf = pd.DataFrame([("1", "2"), ("0", "3"), ("2", "0"), ("1", "1")], 
> columns=["a", "b"])
> psdf = ps.from_pandas(pdf)
> self.assert_eq(pdf.cov(), psdf.cov()) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43282) Investigate DataFrame.sort_values with pandas behavior.

2023-09-14 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43282.
-
Resolution: Won't Fix

> Investigate DataFrame.sort_values with pandas behavior.
> ---
>
> Key: SPARK-43282
> URL: https://issues.apache.org/jira/browse/SPARK-43282
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>
> {code:java}
> import pandas as pd
> pdf = pd.DataFrame(
>     {
>         "a": pd.Categorical([1, 2, 3, 1, 2, 3]),
>         "b": pd.Categorical(
>             ["b", "a", "c", "c", "b", "a"], categories=["c", "b", "d", "a"]
>         ),
>     },
> )
> pdf.groupby("a").apply(lambda x: x).sort_values(["a"])
> Traceback (most recent call last):
> ...
> ValueError: 'a' is both an index level and a column label, which is 
> ambiguous. {code}
> We should investigate this issue whether this is intended behavior or just 
> bug in pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43271) Match behavior with DataFrame.reindex with specifying `index`.

2023-09-14 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee resolved SPARK-43271.
-
Resolution: Fixed

> Match behavior with DataFrame.reindex with specifying `index`.
> --
>
> Key: SPARK-43271
> URL: https://issues.apache.org/jira/browse/SPARK-43271
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Re-enable pandas 2.0.0 test in DataFrameTests.test_reindex in proper way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45168) Increate Pandas minimum version to 1.4.4

2023-09-14 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-45168.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42930
[https://github.com/apache/spark/pull/42930]

> Increate Pandas minimum version to 1.4.4
> 
>
> Key: SPARK-45168
> URL: https://issues.apache.org/jira/browse/SPARK-45168
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45168) Increate Pandas minimum version to 1.4.4

2023-09-14 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-45168:
-

Assignee: Ruifeng Zheng

> Increate Pandas minimum version to 1.4.4
> 
>
> Key: SPARK-45168
> URL: https://issues.apache.org/jira/browse/SPARK-45168
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45178) Fallback to use single batch executor for Trigger.AvailableNow with unsupported sources rather than using wrapper

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45178:
---
Labels: pull-request-available  (was: )

> Fallback to use single batch executor for Trigger.AvailableNow with 
> unsupported sources rather than using wrapper
> -
>
> Key: SPARK-45178
> URL: https://issues.apache.org/jira/browse/SPARK-45178
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>  Labels: pull-request-available
>
> We have observed the case where wrapper implementation of 
> Trigger.AvailableNow (
> AvailableNowDataStreamWrapper and subclasses) is not fully compatible with 
> 3rd party data source and brought up correctness issue.
>  
> While we could persuade 3rd party data source to support 
> Trigger.AvailableNow, pursuing all 3rd parties to do this is too aggressive 
> and challenging goal we never be able to make. Also, it may not be also 
> possible to come up with the wrapper implementation which would have zero 
> issue with any arbitrary source.
>  
> As a mitigation, we want to make a slight behavioral change for such case, 
> falling back to single batch execution (a.k.a. Trigger.Once) rather than 
> using wrapper implementation. The exact behavior between Trigger.AvailableNow 
> and Trigger.Once are different so it's technically behavioral change, but 
> it's probably lot less surprised than failing the query.
>  
> For extreme case where users are confident that there will be no issue at all 
> on using wrapper, we will come up with a flag to provide the previous 
> behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43254) Assign a name to the error class _LEGACY_ERROR_TEMP_2018

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43254:
---
Labels: pull-request-available starter  (was: starter)

> Assign a name to the error class _LEGACY_ERROR_TEMP_2018
> 
>
> Key: SPARK-43254
> URL: https://issues.apache.org/jira/browse/SPARK-43254
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2018* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45178) Fallback to use single batch executor for Trigger.AvailableNow with unsupported sources rather than using wrapper

2023-09-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765451#comment-17765451
 ] 

Jungtaek Lim commented on SPARK-45178:
--

PR will be available sooner.

> Fallback to use single batch executor for Trigger.AvailableNow with 
> unsupported sources rather than using wrapper
> -
>
> Key: SPARK-45178
> URL: https://issues.apache.org/jira/browse/SPARK-45178
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We have observed the case where wrapper implementation of 
> Trigger.AvailableNow (
> AvailableNowDataStreamWrapper and subclasses) is not fully compatible with 
> 3rd party data source and brought up correctness issue.
>  
> While we could persuade 3rd party data source to support 
> Trigger.AvailableNow, pursuing all 3rd parties to do this is too aggressive 
> and challenging goal we never be able to make. Also, it may not be also 
> possible to come up with the wrapper implementation which would have zero 
> issue with any arbitrary source.
>  
> As a mitigation, we want to make a slight behavioral change for such case, 
> falling back to single batch execution (a.k.a. Trigger.Once) rather than 
> using wrapper implementation. The exact behavior between Trigger.AvailableNow 
> and Trigger.Once are different so it's technically behavioral change, but 
> it's probably lot less surprised than failing the query.
>  
> For extreme case where users are confident that there will be no issue at all 
> on using wrapper, we will come up with a flag to provide the previous 
> behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44788) XML: Add pyspark.sql.functions

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44788:
---
Labels: pull-request-available  (was: )

> XML: Add pyspark.sql.functions
> --
>
> Key: SPARK-44788
> URL: https://issues.apache.org/jira/browse/SPARK-44788
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45178) Fallback to use single batch executor for Trigger.AvailableNow with unsupported sources rather than using wrapper

2023-09-14 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-45178:


 Summary: Fallback to use single batch executor for 
Trigger.AvailableNow with unsupported sources rather than using wrapper
 Key: SPARK-45178
 URL: https://issues.apache.org/jira/browse/SPARK-45178
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Jungtaek Lim


We have observed the case where wrapper implementation of Trigger.AvailableNow (
AvailableNowDataStreamWrapper and subclasses) is not fully compatible with 3rd 
party data source and brought up correctness issue.
 
While we could persuade 3rd party data source to support Trigger.AvailableNow, 
pursuing all 3rd parties to do this is too aggressive and challenging goal we 
never be able to make. Also, it may not be also possible to come up with the 
wrapper implementation which would have zero issue with any arbitrary source.
 
As a mitigation, we want to make a slight behavioral change for such case, 
falling back to single batch execution (a.k.a. Trigger.Once) rather than using 
wrapper implementation. The exact behavior between Trigger.AvailableNow and 
Trigger.Once are different so it's technically behavioral change, but it's 
probably lot less surprised than failing the query.
 
For extreme case where users are confident that there will be no issue at all 
on using wrapper, we will come up with a flag to provide the previous behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45177) Remove `col_space` parameter from `DataFrame.to_latex`

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45177:
---
Labels: pull-request-available  (was: )

> Remove `col_space` parameter from `DataFrame.to_latex`
> --
>
> Key: SPARK-45177
> URL: https://issues.apache.org/jira/browse/SPARK-45177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45177) Remove `col_space` parameter from `DataFrame.to_latex`

2023-09-14 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-45177:
---

 Summary: Remove `col_space` parameter from `DataFrame.to_latex`
 Key: SPARK-45177
 URL: https://issues.apache.org/jira/browse/SPARK-45177
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 4.0.0
Reporter: Haejoon Lee






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45143) Make PySpark compatible with PyArrow 13.0.0

2023-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45143:
--
Parent: SPARK-43831
Issue Type: Sub-task  (was: Improvement)

> Make PySpark compatible with PyArrow 13.0.0
> ---
>
> Key: SPARK-45143
> URL: https://issues.apache.org/jira/browse/SPARK-45143
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [https://github.com/apache/spark/actions/runs/6167186123/job/16738683872]
>  
> {code:java}
> ==
> FAIL [0.095s]: test_from_to_pandas 
> (pyspark.pandas.tests.data_type_ops.test_datetime_ops.DatetimeOpsTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 122, in 
> _assert_pandas_equal
> assert_series_equal(
>   File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", 
> line 931, in assert_series_equal
> assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}")
>   File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", 
> line 415, in assert_attr_equal
> raise_assert_detail(obj, msg, left_attr, right_attr)
>   File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", 
> line 599, in raise_assert_detail
> raise AssertionError(msg)
> AssertionError: Attributes of Series are different
> Attribute "dtype" are different
> [left]:  datetime64[ns]
> [right]: datetime64[us] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44434) Add more tests for Scala foreachBatch and streaming listeners

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44434:
-
Fix Version/s: (was: 3.5.0)

> Add more tests for Scala foreachBatch and streaming listeners 
> --
>
> Key: SPARK-44434
> URL: https://issues.apache.org/jira/browse/SPARK-44434
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Raghu Angadi
>Priority: Major
>
> Currently there are very few tests for Scala foreachBatch. Consider adding 
> more tests and covering more test scenarios (multiple queries etc). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45143) Make PySpark compatible with PyArrow 13.0.0

2023-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45143.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42920
[https://github.com/apache/spark/pull/42920]

> Make PySpark compatible with PyArrow 13.0.0
> ---
>
> Key: SPARK-45143
> URL: https://issues.apache.org/jira/browse/SPARK-45143
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [https://github.com/apache/spark/actions/runs/6167186123/job/16738683872]
>  
> {code:java}
> ==
> FAIL [0.095s]: test_from_to_pandas 
> (pyspark.pandas.tests.data_type_ops.test_datetime_ops.DatetimeOpsTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 122, in 
> _assert_pandas_equal
> assert_series_equal(
>   File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", 
> line 931, in assert_series_equal
> assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}")
>   File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", 
> line 415, in assert_attr_equal
> raise_assert_detail(obj, msg, left_attr, right_attr)
>   File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", 
> line 599, in raise_assert_detail
> raise AssertionError(msg)
> AssertionError: Attributes of Series are different
> Attribute "dtype" are different
> [left]:  datetime64[ns]
> [right]: datetime64[us] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45143) Make PySpark compatible with PyArrow 13.0.0

2023-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45143:
-

Assignee: Ruifeng Zheng

> Make PySpark compatible with PyArrow 13.0.0
> ---
>
> Key: SPARK-45143
> URL: https://issues.apache.org/jira/browse/SPARK-45143
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/spark/actions/runs/6167186123/job/16738683872]
>  
> {code:java}
> ==
> FAIL [0.095s]: test_from_to_pandas 
> (pyspark.pandas.tests.data_type_ops.test_datetime_ops.DatetimeOpsTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 122, in 
> _assert_pandas_equal
> assert_series_equal(
>   File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", 
> line 931, in assert_series_equal
> assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}")
>   File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", 
> line 415, in assert_attr_equal
> raise_assert_detail(obj, msg, left_attr, right_attr)
>   File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", 
> line 599, in raise_assert_detail
> raise AssertionError(msg)
> AssertionError: Attributes of Series are different
> Attribute "dtype" are different
> [left]:  datetime64[ns]
> [right]: datetime64[us] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44699) Add logging for complete write events to file in EventLogFileWriter.closeWriter

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44699:
-
Fix Version/s: (was: 3.5.0)

> Add logging for complete write events to file in 
> EventLogFileWriter.closeWriter
> ---
>
> Key: SPARK-44699
> URL: https://issues.apache.org/jira/browse/SPARK-44699
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: shuyouZZ
>Priority: Major
>
> Sometimes we want to know when to finish logging the events to eventLog file, 
> we need add a log to make it clearer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42252:
-
Target Version/s:   (was: 3.5.0)

> Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
> --
>
> Key: SPARK-42252
> URL: https://issues.apache.org/jira/browse/SPARK-42252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: Wei Guo
>Priority: Minor
>
> After Jira SPARK-28209 and PR 
> [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
> api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
> SortShuffleWriter, UnsafeShuffleWriter) are based on 
> LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
> spark.shuffle.unsafe.file.output.buffer used in 
> LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. 
>  
> It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44307) Bloom filter is not added for left outer join if the left side table is smaller than broadcast threshold.

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44307:
-
Fix Version/s: (was: 3.5.0)

> Bloom filter is not added for left outer join if the left side table is 
> smaller than broadcast threshold.
> -
>
> Key: SPARK-44307
> URL: https://issues.apache.org/jira/browse/SPARK-44307
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: mahesh kumar behera
>Priority: Major
>
> In case of left outer join, even if the left side table is small enough to be 
> broadcasted, shuffle join is used. This is because of the property of the 
> left outer join. If the left side is broadcasted in left outer join, the 
> result generated will be wrong. But this is not taken care of in bloom 
> filter. While injecting the bloom filter, if lest side is smaller than 
> broadcast threshold, bloom filter is not added. It assumes that the left side 
> will be broadcast and there is no need for a bloom filter. This causes bloom 
> filter optimization to be missed in case of left outer join with small left 
> side and huge right-side table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38945) simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38945:
-
Fix Version/s: (was: 3.5.0)

> simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep
> 
>
> Key: SPARK-38945
> URL: https://issues.apache.org/jira/browse/SPARK-38945
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: Qian Sun
>Priority: Minor
>
> Simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep, because already 
> imported



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43155) DataSourceV2 is hard to be implemented without following V1

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-43155:
-
Target Version/s:   (was: 3.5.0)

> DataSourceV2 is hard to be implemented without following V1
> ---
>
> Key: SPARK-43155
> URL: https://issues.apache.org/jira/browse/SPARK-43155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: PEIYUAN SUN
>Priority: Major
>  Labels: features
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> h1. Description
> The current interface of DataSourceV2 becomes overly complicated than the 
> Spark 2.x versions. To implement under the DataSourceV2, user needs to learn 
> not only the V2 APIs and interfaces. But also the DataSourceV1 (as it is a 
> failback version).
> h2. Interface Gaps
> There is no easy way and clear examples on how to implement both for a new 
> dataSource. For example, the examples in standard spark repo like orc, 
> parquet, json has a FileFormat interface for V1 while all these are not 
> feasible to be followed since the SPI is hard-code as `DefaultSource` instead 
> of dynamic loading if from user provided class outside the Spark Repo. 
> Different data sources are not strictly following a same pattern in V1 and 
> not decoupled well with customized logic within it.
>  
> h2. Loss of simple layer over different kinds of dataSource
> With original V1, user can actually implement a new wrapper on top of 
> orc/parquet easily with Relation Interface. The DataSourceV2 again here 
> becomes too low level and hard to be used in this case.
>  
> h2. No explicit guidance
> The functionality interfaces are not well organized which forces the reader 
> spend lots of time to understand the commit history, existing patterns as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41259) Spark-sql cli query results should correspond to schema

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41259:
-
Fix Version/s: (was: 3.5.0)

> Spark-sql cli query results should correspond to schema
> ---
>
> Key: SPARK-41259
> URL: https://issues.apache.org/jira/browse/SPARK-41259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yikaifei
>Priority: Minor
>
> When using the spark-sql cli, Spark outputs only one column in the `show 
> tables` and `show views` commands to be compatible with Hive output, but the 
> output schema is still the three columns of Spark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42252:
-
Fix Version/s: (was: 3.5.0)

> Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
> --
>
> Key: SPARK-42252
> URL: https://issues.apache.org/jira/browse/SPARK-42252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: Wei Guo
>Priority: Minor
>
> After Jira SPARK-28209 and PR 
> [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
> api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
> SortShuffleWriter, UnsafeShuffleWriter) are based on 
> LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
> spark.shuffle.unsafe.file.output.buffer used in 
> LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. 
>  
> It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37935) Migrate onto error classes

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37935:
-
Fix Version/s: (was: 3.5.0)

> Migrate onto error classes
> --
>
> Key: SPARK-37935
> URL: https://issues.apache.org/jira/browse/SPARK-37935
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/32850 introduced error classes as 
> a part of the error messages framework 
> (https://issues.apache.org/jira/browse/SPARK-33539). Need to migrate all 
> exceptions from QueryExecutionErrors, QueryCompilationErrors and 
> QueryParsingErrors on the error classes using instances of SparkThrowable, 
> and carefully test every error class by writing tests in dedicated test 
> suites:
> *  QueryExecutionErrorsSuite for the errors that are occurred during query 
> execution
> * QueryCompilationErrorsSuite ... query compilation or eagerly executing 
> commands
> * QueryParsingErrorsSuite ... parsing errors
> Here is an example https://github.com/apache/spark/pull/35157 of how an 
> existing Java exception can be replaced, and testing of related error 
> classes.At the end, we should migrate all exceptions from the files 
> Query.*Errors.scala and cover all error classes from the error-classes.json 
> file by tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39136) JDBCTable support properties

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39136:
-
Fix Version/s: (was: 3.5.0)

> JDBCTable support properties
> 
>
> Key: SPARK-39136
> URL: https://issues.apache.org/jira/browse/SPARK-39136
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
>  >
>  > desc formatted jdbc.test.people;
> NAME  string
> IDint
> # Partitioning
> Not partitioned
> # Detailed Table Information
> Name  test.people
> Table Properties  []
> Time taken: 0.048 seconds, Fetched 9 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39892:
-
Fix Version/s: (was: 3.5.0)

> Use ArrowType.Decimal(precision, scale, bitWidth) instead of 
> ArrowType.Decimal(precision, scale)
> 
>
> Key: SPARK-39892
> URL: https://issues.apache.org/jira/browse/SPARK-39892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> [warn] 
> /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:48:49:
>  [deprecation @ org.apache.spark.sql.util.ArrowUtils.toArrowType | 
> origin=org.apache.arrow.vector.types.pojo.ArrowType.Decimal. | 
> version=] constructor Decimal in class Decimal is deprecated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43155) DataSourceV2 is hard to be implemented without following V1

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-43155:
-
Fix Version/s: (was: 3.5.0)

> DataSourceV2 is hard to be implemented without following V1
> ---
>
> Key: SPARK-43155
> URL: https://issues.apache.org/jira/browse/SPARK-43155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: PEIYUAN SUN
>Priority: Major
>  Labels: features
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> h1. Description
> The current interface of DataSourceV2 becomes overly complicated than the 
> Spark 2.x versions. To implement under the DataSourceV2, user needs to learn 
> not only the V2 APIs and interfaces. But also the DataSourceV1 (as it is a 
> failback version).
> h2. Interface Gaps
> There is no easy way and clear examples on how to implement both for a new 
> dataSource. For example, the examples in standard spark repo like orc, 
> parquet, json has a FileFormat interface for V1 while all these are not 
> feasible to be followed since the SPI is hard-code as `DefaultSource` instead 
> of dynamic loading if from user provided class outside the Spark Repo. 
> Different data sources are not strictly following a same pattern in V1 and 
> not decoupled well with customized logic within it.
>  
> h2. Loss of simple layer over different kinds of dataSource
> With original V1, user can actually implement a new wrapper on top of 
> orc/parquet easily with Relation Interface. The DataSourceV2 again here 
> becomes too low level and hard to be used in this case.
>  
> h2. No explicit guidance
> The functionality interfaces are not well organized which forces the reader 
> spend lots of time to understand the commit history, existing patterns as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39814) Use AmazonKinesisClientBuilder.withCredentials instead of new AmazonKinesisClient(credentials)

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39814:
-
Fix Version/s: (was: 3.5.0)

> Use AmazonKinesisClientBuilder.withCredentials instead of new 
> AmazonKinesisClient(credentials)
> --
>
> Key: SPARK-39814
> URL: https://issues.apache.org/jira/browse/SPARK-39814
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>
> [warn] 
> /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala:108:25:
>  [deprecation @ 
> org.apache.spark.examples.streaming.KinesisWordCountASL.main.kinesisClient | 
> origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] 
> constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated
> [warn] 
> /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala:224:25:
>  [deprecation @ 
> org.apache.spark.examples.streaming.KinesisWordProducerASL.generate.kinesisClient
>  | origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | 
> version=] constructor AmazonKinesisClient in class AmazonKinesisClient is 
> deprecated
> [warn] 
> /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala:142:24:
>  [deprecation @ 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.client | 
> origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] 
> constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated
> [warn] 
> /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisTestUtils.scala:58:18:
>  [deprecation @ 
> org.apache.spark.streaming.kinesis.KinesisTestUtils.kinesisClient.client | 
> origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] 
> constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44307) Bloom filter is not added for left outer join if the left side table is smaller than broadcast threshold.

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44307:
-
Target Version/s:   (was: 3.4.1)

> Bloom filter is not added for left outer join if the left side table is 
> smaller than broadcast threshold.
> -
>
> Key: SPARK-44307
> URL: https://issues.apache.org/jira/browse/SPARK-44307
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: mahesh kumar behera
>Priority: Major
>
> In case of left outer join, even if the left side table is small enough to be 
> broadcasted, shuffle join is used. This is because of the property of the 
> left outer join. If the left side is broadcasted in left outer join, the 
> result generated will be wrong. But this is not taken care of in bloom 
> filter. While injecting the bloom filter, if lest side is smaller than 
> broadcast threshold, bloom filter is not added. It assumes that the left side 
> will be broadcast and there is no need for a bloom filter. This causes bloom 
> filter optimization to be missed in case of left outer join with small left 
> side and huge right-side table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43318) spark reader csv and json support wholetext parameters

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-43318:
-
Fix Version/s: (was: 3.5.0)

> spark reader csv and json support wholetext parameters
> --
>
> Key: SPARK-43318
> URL: https://issues.apache.org/jira/browse/SPARK-43318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> FTPInputStream used by Hadoop FTPFileSystem does not support seek, and spark 
> HadoopFileLinesReader fails to be read. 
> Support to read the entire file, and then split lines, avoid reading failure
>  
> [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ftp/FTPInputStream.java]
>  
> [~cloud_fan] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to 1.24.0

2023-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45172.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42934
[https://github.com/apache/spark/pull/42934]

> Upgrade commons-compress.version from 1.23.0 to 1.24.0
> --
>
> Key: SPARK-45172
> URL: https://issues.apache.org/jira/browse/SPARK-45172
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to 1.24.0

2023-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45172:
-

Assignee: Hyukjin Kwon

> Upgrade commons-compress.version from 1.23.0 to 1.24.0
> --
>
> Key: SPARK-45172
> URL: https://issues.apache.org/jira/browse/SPARK-45172
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45171) GenerateExec fails to initialize non-deterministic expressions before use

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45171:


Assignee: Bruce Robbins

> GenerateExec fails to initialize non-deterministic expressions before use
> -
>
> Key: SPARK-45171
> URL: https://issues.apache.org/jira/browse/SPARK-45171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> The following query fails:
> {noformat}
> select *
> from explode(
>   transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22)
> );
> {noformat}
> The error is:
> {noformat}
> 23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
> expression org.apache.spark.sql.catalyst.expressions.Rand should be 
> initialized before eval.
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
>   at 
> org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
>   at 
> org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062)
>   at 
> org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275)
>   at 
> org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274)
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
> ...
> {noformat}
> However, this query succeeds:
> {noformat}
> select *
> from explode(
>   sequence(0, cast(rand()*1000 as int) + 1)
> );
> {noformat}
> The difference is that {{transform}} turns off whole-stage codegen, which 
> exposes a bug in {{GenerateExec}} where the non-deterministic expression 
> passed to the generator function is not initialized before being used.
> An even simpler reprod case is:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select explode(array(rand()));
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45171) GenerateExec fails to initialize non-deterministic expressions before use

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45171.
--
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42933
[https://github.com/apache/spark/pull/42933]

> GenerateExec fails to initialize non-deterministic expressions before use
> -
>
> Key: SPARK-45171
> URL: https://issues.apache.org/jira/browse/SPARK-45171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> The following query fails:
> {noformat}
> select *
> from explode(
>   transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22)
> );
> {noformat}
> The error is:
> {noformat}
> 23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
> expression org.apache.spark.sql.catalyst.expressions.Rand should be 
> initialized before eval.
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
>   at 
> org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
>   at 
> org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062)
>   at 
> org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275)
>   at 
> org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274)
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
> ...
> {noformat}
> However, this query succeeds:
> {noformat}
> select *
> from explode(
>   sequence(0, cast(rand()*1000 as int) + 1)
> );
> {noformat}
> The difference is that {{transform}} turns off whole-stage codegen, which 
> exposes a bug in {{GenerateExec}} where the non-deterministic expression 
> passed to the generator function is not initialized before being used.
> An even simpler reprod case is:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select explode(array(rand()));
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45174) Support spark.deploy.maxDrivers

2023-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45174.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42936
[https://github.com/apache/spark/pull/42936]

> Support spark.deploy.maxDrivers
> ---
>
> Key: SPARK-45174
> URL: https://issues.apache.org/jira/browse/SPARK-45174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Like `spark.mesos.maxDrivers`, this issue aims to add 
> `spark.deploy.maxDrivers`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45174) Support spark.deploy.maxDrivers

2023-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45174:
-

Assignee: Dongjoon Hyun

> Support spark.deploy.maxDrivers
> ---
>
> Key: SPARK-45174
> URL: https://issues.apache.org/jira/browse/SPARK-45174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Like `spark.mesos.maxDrivers`, this issue aims to add 
> `spark.deploy.maxDrivers`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45165) Remove `inplace` parameter from `Categorical` APIs

2023-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45165.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42927
[https://github.com/apache/spark/pull/42927]

> Remove `inplace` parameter from `Categorical` APIs
> --
>
> Key: SPARK-45165
> URL: https://issues.apache.org/jira/browse/SPARK-45165
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> `inplace` should be removed from CategoricalIndex APIs to match the pandas 
> behavior



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45165) Remove `inplace` parameter from `Categorical` APIs

2023-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45165:
-

Assignee: Haejoon Lee

> Remove `inplace` parameter from `Categorical` APIs
> --
>
> Key: SPARK-45165
> URL: https://issues.apache.org/jira/browse/SPARK-45165
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> `inplace` should be removed from CategoricalIndex APIs to match the pandas 
> behavior



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45176) AggregatingAccumulator with TypedImperativeAggregate throwing ClassCastException

2023-09-14 Thread Huw (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huw updated SPARK-45176:

Description: 
Probably related to SPARK-39044. But potentially also this comment in 
Executor.scala.
{quote}// TODO: do not serialize value twice
val directResult = new DirectTaskResult(valueByteBuffer, accumUpdates, 
metricPeaks)
{quote}
The class cast exception I'm seeing is
{quote}
java.lang.ClassCastException: class [B cannot be cast to class 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir
{quote}
But I've seen it with other aggregation buffers like QuantileSummaries as well.

It's my belief that withBufferSerialized() for the Aggregating Accumulator is 
being called twice, leading to on serializeAggregateBuffernPlace(buffer)
also being called twice for the an Imperative aggregate, the second time round, 
the buffer is already a byte array and the asInstanceOf[T] in getBufferObject 
is throwing.

This doesn't appear to happen on all runs, and it might be its only occurring 
when there's a transitive exception. I have a further suspicion that the cause 
might originate with
{quote}
SerializationDebugger.improveException
{quote}
which is traversing the task and forcing writeExternal, to be called.

Setting
|spark.serializer.extraDebugInfo|false|

Seems to make things a bit more reliable (I haven't seen the error while this 
setting is on), and points strongly in that direction.

Stack trace:
{quote}
Job aborted due to stage failure: Authorized committer (attemptNumber=0, 
stage=15, partition=10) failed; but task commit success, data duplication may 
happen. 
reason=ExceptionFailure(java.io.IOException,java.lang.ClassCastException: class 
[B cannot be cast to class 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module 
java.base of loader 'bootstrap'; 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed 
module of loader 
'app'),[Ljava.lang.StackTraceElement;@7fe2f462,java.io.IOException: 
java.lang.ClassCastException: class [B cannot be cast to class 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module 
java.base of loader 'bootstrap'; 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed 
module of loader 'app')
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1502)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:59)
at java.base/java.io.ObjectOutputStream.writeExternalData(Unknown Source)
at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
at java.base/java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.base/java.io.ObjectOutputStream.writeObject(Unknown Source)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.SerializerHelper$.serializeToChunkedBuffer(SerializerHelper.scala:42)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassCastException: class [B cannot be cast to class 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module 
java.base of loader 'bootstrap'; 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed 
module of loader 'app')
at 
org.apache.spark.sql.catalyst.expressions.aggregate.ReservoirSample.serialize(ReservoirSample.scala:33)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:624)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:206)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
at jdk.internal.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at java.base/java.io.ObjectStreamClass.invokeWriteReplace(Unknown Source)
at java.base/java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.base/java.io.ObjectOutputStream.writeObject(Unknown Source)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:62)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:62)
at scala.collection.immutable.Vector.foreach(Vector.scala:1856)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:62)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at org.apache.spark.util.Utils$.tryOrIO

[jira] [Created] (SPARK-45176) AggregatingAccumulator with TypedImperativeAggregate throwing ClassCastException

2023-09-14 Thread Huw (Jira)

Huw created SPARK-45176:
---

 Summary: AggregatingAccumulator with TypedImperativeAggregate 
throwing ClassCastException
 Key: SPARK-45176
 URL: https://issues.apache.org/jira/browse/SPARK-45176
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1, 3.4.0
Reporter: Huw


Probably related to SPARK-39044. But potentially also this comment in 
Executor.scala.


// TODO: do not serialize value twice
val directResult = new DirectTaskResult(valueByteBuffer, accumUpdates, 
metricPeaks)

The class cast exception I'm seeing is
java.lang.ClassCastException: class [B cannot be cast to class 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir
But I've seen it with other aggregation buffers like QuantileSummaries as well.

It's my belief that
withBufferSerialized()
for the Aggregating Accumulator is being called twice, leading to on
serializeAggregateBuffernPlace(buffer)
also being called twice for the an Imperative aggregate, the second time round, 
the buffer is already a byte array and the asInstanceOf[T] in getBufferObject 
is throwing.

This doesn't appear to happen on all runs, and it might be its only occurring 
when there's a transitive exception. I have a further suspicion that the cause 
might originate with
SerializationDebugger.improveException
which is traversing the task and forcing writeExternal, to be called.

Setting
|spark.serializer.extraDebugInfo|false|

Seems to make things a bit more reliable (I haven't seen the error while this 
setting is on), and points strongly in that direction.

Stack trace:
Job aborted due to stage failure: Authorized committer (attemptNumber=0, 
stage=15, partition=10) failed; but task commit success, data duplication may 
happen. 
reason=ExceptionFailure(java.io.IOException,java.lang.ClassCastException: class 
[B cannot be cast to class 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module 
java.base of loader 'bootstrap'; 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed 
module of loader 
'app'),[Ljava.lang.StackTraceElement;@7fe2f462,java.io.IOException: 
java.lang.ClassCastException: class [B cannot be cast to class 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module 
java.base of loader 'bootstrap'; 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed 
module of loader 'app')
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1502)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:59)
at java.base/java.io.ObjectOutputStream.writeExternalData(Unknown 
Source)
at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(Unknown 
Source)
at java.base/java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.base/java.io.ObjectOutputStream.writeObject(Unknown Source)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.SerializerHelper$.serializeToChunkedBuffer(SerializerHelper.scala:42)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassCastException: class [B cannot be cast to class 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module 
java.base of loader 'bootstrap'; 
org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed 
module of loader 'app')
at 
org.apache.spark.sql.catalyst.expressions.aggregate.ReservoirSample.serialize(ReservoirSample.scala:33)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:624)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:206)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
at 
org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
at jdk.internal.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at java.base/java.io.ObjectStreamClass.invokeWriteReplace(Unknown 
Source)
at java.base/java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.base/java.io.ObjectOutputStream.writeObject(Unknown Source)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:62)
at 
org.apa

[jira] [Created] (SPARK-45175) download krb5.conf from remote storage in spark-sumbit on k8s

2023-09-14 Thread Qian Sun (Jira)

Qian Sun created SPARK-45175:


 Summary: download krb5.conf from remote storage in spark-sumbit on 
k8s
 Key: SPARK-45175
 URL: https://issues.apache.org/jira/browse/SPARK-45175
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.4.1
Reporter: Qian Sun


krb5.conf currently only supports the local file format. Tenants would like to 
save this file on their own servers and download it during the spark-submit 
phase for better implementation of multi-tenant scenarios. The proposed 
solution is to use the *downloadFile*  function[1], similar to the 
configuration of *spark.kubernetes.driver/executor.podTemplateFile*

 

[1]https://github.com/apache/spark/blob/822f58f0d26b7d760469151a65eaf9ee863a07a1/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/PodTemplateConfigMapStep.scala#L82C24-L82C24



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45174) Support spark.deploy.maxDrivers

2023-09-14 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-45174:
-

 Summary: Support spark.deploy.maxDrivers
 Key: SPARK-45174
 URL: https://issues.apache.org/jira/browse/SPARK-45174
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun


Like `spark.mesos.maxDrivers`, this issue aims to add `spark.deploy.maxDrivers`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45174) Support spark.deploy.maxDrivers

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45174:
---
Labels: pull-request-available  (was: )

> Support spark.deploy.maxDrivers
> ---
>
> Key: SPARK-45174
> URL: https://issues.apache.org/jira/browse/SPARK-45174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Like `spark.mesos.maxDrivers`, this issue aims to add 
> `spark.deploy.maxDrivers`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45173) Remove some unnecessary sourceMapping files in UI

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45173:
---
Labels: pull-request-available  (was: )

> Remove some unnecessary sourceMapping files in UI
> -
>
> Key: SPARK-45173
> URL: https://issues.apache.org/jira/browse/SPARK-45173
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45173) Remove some unnecessary sourceMapping files in UI

2023-09-14 Thread Kent Yao (Jira)

Kent Yao created SPARK-45173:


 Summary: Remove some unnecessary sourceMapping files in UI
 Key: SPARK-45173
 URL: https://issues.apache.org/jira/browse/SPARK-45173
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45159) Handle named arguments only when necessary

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45159.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42915
[https://github.com/apache/spark/pull/42915]

> Handle named arguments only when necessary
> --
>
> Key: SPARK-45159
> URL: https://issues.apache.org/jira/browse/SPARK-45159
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45159) Handle named arguments only when necessary

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45159:


Assignee: Takuya Ueshin

> Handle named arguments only when necessary
> --
>
> Key: SPARK-45159
> URL: https://issues.apache.org/jira/browse/SPARK-45159
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44752) XML: Update Spark Docs

2023-09-14 Thread tangjiafu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765419#comment-17765419
 ] 

tangjiafu commented on SPARK-44752:
---

I have used Spark XML in my project before, and I think I can do some testing 
and complete this PR. Can you assign this PR to me? This is my 'good first 
issue' for Spark

> XML: Update Spark Docs
> --
>
> Key: SPARK-44752
> URL: https://issues.apache.org/jira/browse/SPARK-44752
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Priority: Major
>
>  [https://spark.apache.org/docs/latest/sql-data-sources.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45084) ProgressReport should include an accurate effective shuffle partition number

2023-09-14 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-45084.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42822
[https://github.com/apache/spark/pull/42822]

> ProgressReport should include an accurate effective shuffle partition number
> 
>
> Key: SPARK-45084
> URL: https://issues.apache.org/jira/browse/SPARK-45084
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.2
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, there is a numShufflePartitions "metric" reported in 
> StateOperatorProgress part of the progress report. However, the number is 
> reported by aggregating executors so in the case of task retry or speculative 
> executor, the metric is higher than number of shuffle partitions for the 
> query plan. Number of shuffle partitions can be useful for reporting purpose 
> so having a metric is helpful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45084) ProgressReport should include an accurate effective shuffle partition number

2023-09-14 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-45084:


Assignee: Siying Dong

> ProgressReport should include an accurate effective shuffle partition number
> 
>
> Key: SPARK-45084
> URL: https://issues.apache.org/jira/browse/SPARK-45084
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.2
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, there is a numShufflePartitions "metric" reported in 
> StateOperatorProgress part of the progress report. However, the number is 
> reported by aggregating executors so in the case of task retry or speculative 
> executor, the metric is higher than number of shuffle partitions for the 
> query plan. Number of shuffle partitions can be useful for reporting purpose 
> so having a metric is helpful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43406) enable spark sql to drop multiple partitions in one call

2023-09-14 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-43406.
-
Resolution: Duplicate

> enable spark sql to drop multiple partitions in one call
> 
>
> Key: SPARK-43406
> URL: https://issues.apache.org/jira/browse/SPARK-43406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.2, 3.4.0
>Reporter: chenruotao
>Priority: Major
>
> Now spark sql cannot drop multiple partitions in one call, so I fix it
> With this patch we can drop multiple partitions like this : 
> alter table test.table_partition drop partition(dt<='2023-04-02', 
> dt>='2023-03-31')



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call

2023-09-14 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43406:

Target Version/s:   (was: 4.0.0)

> enable spark sql to drop multiple partitions in one call
> 
>
> Key: SPARK-43406
> URL: https://issues.apache.org/jira/browse/SPARK-43406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.2, 3.4.0
>Reporter: chenruotao
>Priority: Major
>
> Now spark sql cannot drop multiple partitions in one call, so I fix it
> With this patch we can drop multiple partitions like this : 
> alter table test.table_partition drop partition(dt<='2023-04-02', 
> dt>='2023-03-31')



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call

2023-09-14 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43406:

Fix Version/s: (was: 3.5.0)

> enable spark sql to drop multiple partitions in one call
> 
>
> Key: SPARK-43406
> URL: https://issues.apache.org/jira/browse/SPARK-43406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.2, 3.4.0
>Reporter: chenruotao
>Priority: Major
>
> Now spark sql cannot drop multiple partitions in one call, so I fix it
> With this patch we can drop multiple partitions like this : 
> alter table test.table_partition drop partition(dt<='2023-04-02', 
> dt>='2023-03-31')



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call

2023-09-14 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43406:

Target Version/s: 4.0.0

> enable spark sql to drop multiple partitions in one call
> 
>
> Key: SPARK-43406
> URL: https://issues.apache.org/jira/browse/SPARK-43406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.2, 3.4.0
>Reporter: chenruotao
>Priority: Major
>
> Now spark sql cannot drop multiple partitions in one call, so I fix it
> With this patch we can drop multiple partitions like this : 
> alter table test.table_partition drop partition(dt<='2023-04-02', 
> dt>='2023-03-31')



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-37487) CollectMetrics is executed twice if it is followed by a sort

2023-09-14 Thread Huw (Jira)



[ https://issues.apache.org/jira/browse/SPARK-37487 ]


Huw deleted comment on SPARK-37487:
-

was (Author: JIRAUSER288917):
I think I've seen crashes because of this in production.

I can't reproduce locally, but I believe that Imperative aggregates are having 
their `serialiseAggregateBufferInPlace` function called twice, the second time 
it's doing an unsafe coerce onto a byte buffer.

 
{quote}Caused by: java.lang.ClassCastException: class [B cannot be cast to 
class 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest
 ([B is in module java.base of loader 'bootstrap'; 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest
 is in unnamed module of loader 'app')
at 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproxQuantiles.serialize(ApproxQuantiles.scala:19)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:624)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:206)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33){quote}

 

> CollectMetrics is executed twice if it is followed by a sort
> 
>
> Key: SPARK-37487
> URL: https://issues.apache.org/jira/browse/SPARK-37487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Major
>  Labels: correctness
>
> It is best examplified by this new UT in DataFrameCallbackSuite:
> {code}
>   test("SPARK-37487: get observable metrics with sort by callback") {
> val df = spark.range(100)
>   .observe(
> name = "my_event",
> min($"id").as("min_val"),
> max($"id").as("max_val"),
> // Test unresolved alias
> sum($"id"),
> count(when($"id" % 2 === 0, 1)).as("num_even"))
>   .observe(
> name = "other_event",
> avg($"id").cast("int").as("avg_val"))
>   .sort($"id".desc)
> validateObservedMetrics(df)
>   }
> {code}
> The count and sum aggregate report twice the number of rows:
> {code}
> [info] - SPARK-37487: get observable metrics with sort by callback *** FAILED 
> *** (169 milliseconds)
> [info]   [0,99,9900,100] did not equal [0,99,4950,50] 
> (DataFrameCallbackSuite.scala:342)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350)
> [info]   at 
> org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> {code}
> I could not figure out how this happes. Hopefully the UT can help with 
> debugging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42466) spark.kubernetes.file.upload.path not deleting files under HDFS after job completes

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-42466:
---
Labels: pull-request-available  (was: )

> spark.kubernetes.file.upload.path not deleting files under HDFS after job 
> completes
> ---
>
> Key: SPARK-42466
> URL: https://issues.apache.org/jira/browse/SPARK-42466
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0, 3.3.2
>Reporter: Jagadeeswara Rao
>Priority: Major
>  Labels: pull-request-available
>
> In cluster mode after uploading files to HDFS location using 
> spark.kubernetes.file.upload.path property files are not getting cleared . 
> File is successfully uploaded to hdfs location in this format 
> spark-upload-[randomUUID] using {{KubernetesUtils}} is requested to  
> uploadFileUri . 
> [https://github.com/apache/spark/blob/76a134ade60a9f354aca01eaca0b2e2477c6bd43/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala#L310]
> following is driver log  , driver is completed successfully and shutdownhook 
> is not cleared the hdfs files.
> {code:java}
> 23/02/16 18:06:56 INFO KubernetesClusterSchedulerBackend: Shutting down all 
> executors
> 23/02/16 18:06:56 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 23/02/16 18:06:56 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed.
> 23/02/16 18:06:57 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 23/02/16 18:06:57 INFO MemoryStore: MemoryStore cleared
> 23/02/16 18:06:57 INFO BlockManager: BlockManager stopped
> 23/02/16 18:06:57 INFO BlockManagerMaster: BlockManagerMaster stopped
> 23/02/16 18:06:57 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 23/02/16 18:06:57 INFO SparkContext: Successfully stopped SparkContext
> 23/02/16 18:06:57 INFO ShutdownHookManager: Shutdown hook called
> 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-efb8f725-4ead-4729-a8e0-f478280121b7
> 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory 
> /spark-local2/spark-66dbf7e6-fe7e-4655-8724-69d76d93fc1f
> 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory 
> /spark-local1/spark-53aefaee-58a5-4fce-b5b0-5e29f42e337f{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-16484:
---
Labels: bulk-closed pull-request-available  (was: bulk-closed)

> Incremental Cardinality estimation operations with Hyperloglog
> --
>
> Key: SPARK-16484
> URL: https://issues.apache.org/jira/browse/SPARK-16484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yongjia Wang
>Assignee: Ryan Berti
>Priority: Major
>  Labels: bulk-closed, pull-request-available
> Fix For: 3.5.0
>
>
> Efficient cardinality estimation is very important, and SparkSQL has had 
> approxCountDistinct based on Hyperloglog for quite some time. However, there 
> isn't a way to do incremental estimation. For example, if we want to get 
> updated distinct counts of the last 90 days, we need to do the aggregation 
> for the entire window over and over again. The more efficient way involves 
> serializing the counter for smaller time windows (such as hourly) so the 
> counts can be efficiently updated in an incremental fashion for any time 
> window.
> With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus 
> implementation in the current Spark version, it's easy enough to extend the 
> functionality to include incremental counting, and even other general set 
> operations such as intersection and set difference. Spark API is already as 
> elegant as it can be, but it still takes quite some effort to do a custom 
> implementation of the aforementioned operations which are supposed to be in 
> high demand. I have been searching but failed to find an usable existing 
> solution nor any ongoing effort for this. The closest I got is the following 
> but it does not work with Spark 1.6 due to API changes. 
> https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala
> I wonder if it worth to integrate such operations into SparkSQL. The only 
> problem I see is it depends on serialization of a specific HLL implementation 
> and introduce compatibility issues. But as long as the user is aware of such 
> issue, it should be fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to 1.24.0

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45172:
---
Labels: pull-request-available  (was: )

> Upgrade commons-compress.version from 1.23.0 to 1.24.0
> --
>
> Key: SPARK-45172
> URL: https://issues.apache.org/jira/browse/SPARK-45172
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to 1.24.0

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-45172:
-
Summary: Upgrade commons-compress.version from 1.23.0 to 1.24.0  (was: 
Upgrade commons-compress.version from 1.23.0 to .124.0)

> Upgrade commons-compress.version from 1.23.0 to 1.24.0
> --
>
> Key: SPARK-45172
> URL: https://issues.apache.org/jira/browse/SPARK-45172
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to .124.0

2023-09-14 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-45172:


 Summary: Upgrade commons-compress.version from 1.23.0 to .124.0
 Key: SPARK-45172
 URL: https://issues.apache.org/jira/browse/SPARK-45172
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to .124.0

2023-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-45172:
-
Issue Type: Improvement  (was: Bug)

> Upgrade commons-compress.version from 1.23.0 to .124.0
> --
>
> Key: SPARK-45172
> URL: https://issues.apache.org/jira/browse/SPARK-45172
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45171) GenerateExec fails to initialize non-deterministic expressions before use

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45171:
---
Labels: pull-request-available  (was: )

> GenerateExec fails to initialize non-deterministic expressions before use
> -
>
> Key: SPARK-45171
> URL: https://issues.apache.org/jira/browse/SPARK-45171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> The following query fails:
> {noformat}
> select *
> from explode(
>   transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22)
> );
> {noformat}
> The error is:
> {noformat}
> 23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
> expression org.apache.spark.sql.catalyst.expressions.Rand should be 
> initialized before eval.
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
>   at 
> org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
>   at 
> org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062)
>   at 
> org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275)
>   at 
> org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274)
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
> ...
> {noformat}
> However, this query succeeds:
> {noformat}
> select *
> from explode(
>   sequence(0, cast(rand()*1000 as int) + 1)
> );
> {noformat}
> The difference is that {{transform}} turns off whole-stage codegen, which 
> exposes a bug in {{GenerateExec}} where the non-deterministic expression 
> passed to the generator function is not initialized before being used.
> An even simpler reprod case is:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select explode(array(rand()));
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45161) Bump `previousSparkVersion` to 3.5.0

2023-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45161:
-

Assignee: Yang Jie

> Bump `previousSparkVersion` to 3.5.0
> 
>
> Key: SPARK-45161
> URL: https://issues.apache.org/jira/browse/SPARK-45161
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45161) Bump `previousSparkVersion` to 3.5.0

2023-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45161.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42921
[https://github.com/apache/spark/pull/42921]

> Bump `previousSparkVersion` to 3.5.0
> 
>
> Key: SPARK-45161
> URL: https://issues.apache.org/jira/browse/SPARK-45161
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45137) Unsupported map and array constructors by `sql()` in connect clients

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45137:
---
Labels: pull-request-available  (was: )

> Unsupported map and array constructors by `sql()` in connect clients
> 
>
> Key: SPARK-45137
> URL: https://issues.apache.org/jira/browse/SPARK-45137
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: pull-request-available
>
> The code below demonstrates the issue:
> {code:scala}
> spark.sql("select element_at(?, 1)", Array(array(lit(1.collect()
> {code}
> It fails with the error:
> {code:java}
> [info]   java.lang.UnsupportedOperationException: literal unresolved_function 
> {
> [info]   function_name: "array"
> [info]   arguments {
> [info] literal {
> [info]   integer: 1
> [info] }
> [info]   }
> [info] }
> [info]  not supported (yet).
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45118) Refactor converters for complex types to short cut when the element types don't need converters

2023-09-14 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-45118.
---
Fix Version/s: 4.0.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Issue resolved by pull request 42874
https://github.com/apache/spark/pull/42874

> Refactor converters for complex types to short cut when the element types 
> don't need converters
> ---
>
> Key: SPARK-45118
> URL: https://issues.apache.org/jira/browse/SPARK-45118
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45171) GenerateExec fails to initialize non-deterministic expressions before use

2023-09-14 Thread Bruce Robbins (Jira)

Bruce Robbins created SPARK-45171:
-

 Summary: GenerateExec fails to initialize non-deterministic 
expressions before use
 Key: SPARK-45171
 URL: https://issues.apache.org/jira/browse/SPARK-45171
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


The following query fails:
{noformat}
select *
from explode(
  transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22)
);
{noformat}
The error is:
{noformat}
23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized 
before eval.
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
at 
org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
at 
org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
at 
org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543)
at 
org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384)
at 
org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062)
at 
org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275)
at 
org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274)
at 
org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308)
at 
org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
at 
org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
...
{noformat}
However, this query succeeds:
{noformat}
select *
from explode(
  sequence(0, cast(rand()*1000 as int) + 1)
);
{noformat}
The difference is that {{transform}} turns off whole-stage codegen, which 
exposes a bug in {{GenerateExec}} where the non-deterministic expression passed 
to the generator function is not initialized before being used.

An even simpler reprod case is:
{noformat}
set spark.sql.codegen.wholeStage=false;

select explode(array(rand()));
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43966) Support non-deterministic Python UDTFs

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43966:
---
Labels: pull-request-available  (was: )

> Support non-deterministic Python UDTFs
> --
>
> Key: SPARK-43966
> URL: https://issues.apache.org/jira/browse/SPARK-43966
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Support Python UDTFs with non-deterministic function body and inputs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45170) Scala-specific improvements in Dataset[T] API

2023-09-14 Thread Danila Goloshchapov (Jira)

Danila Goloshchapov created SPARK-45170:
---

 Summary: Scala-specific improvements in Dataset[T] API 
 Key: SPARK-45170
 URL: https://issues.apache.org/jira/browse/SPARK-45170
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Danila Goloshchapov


*Q1.* What are you trying to do? 

The main idea is to use the power of scala's macrosses to give developers more 
convenient and typesafe API to use in join conditions. 

 

*Q2.* What problem is this proposal NOT designed to solve?

R/Java/Python/DataFrame API is out of scope. The solution is not affecting plan 
generation too. 

 

*Q3.* How is it done today, and what are the limits of current practice?

Currently the join condition is specified via strings, which might lead to 
silly mistakes (typos, incompatible column types etc) and sometimes hard to 
read (in case when several joins are made and the final type is tuple of tuple 
of tuples...)

 

*Q4.* What is new in your approach and why do you think it will be successful?

Scala macroses can be used to extract the column name directly from lambda 
(extractor). As a side effect its possible to check the column type and 
prohibit to build inconsistent join expression (like boolean-timestamp 
comparison)

 

*Q5.* Who cares? If you are successful, what difference will it make?

Mainly scala developers who prefers typesafe code - they would have a more 
clean and nice API that will make the codebase a bit clearer, especially in 
case when several chained joins is used

 

*Q6.* What are the risks?

The overusage of macrosses may slow down the compilation speed. In additional 
macrosses are hard to maintain

 

*Q7.* How long will it take?

Currently the approach is already implemented as a separate 
[lib|https://github.com/Salamahin/joinwiz] that makes a bit more than just 
gives alternative API (for example abstracts Dataset[T] to F[T] which allows to 
run some spark-specific code without spark session for testing purposes)

Adaptation of it won't be a hard job, matter of several weeks

 

*Q8.* What are the mid-term and final “exams” to check for success?

API convenience is very hard to estimate as its more or less a question of taste

 

*Appendix A*

You may find the examples of such 'cleaner' API 
[here|https://github.com/Salamahin/joinwiz/blob/master/joinwiz_core/src/test/scala/joinwiz/ComputationEngineTest.scala]

Note that backward and forward compatibility is achieved by introducing a 
brand-new API without modifying an old one

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44141) Remove need to preinstall the buf compiler

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44141:
---
Labels: pull-request-available  (was: )

> Remove need to preinstall the buf compiler
> --
>
> Key: SPARK-44141
> URL: https://issues.apache.org/jira/browse/SPARK-44141
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Arnar Pall
>Priority: Minor
>  Labels: pull-request-available
>
> In order to lower the barrier of entry even further for this project we can 
> remove need to have {{buf}} preinstalled and just use {{go run}}
> This also ensures that the tool chain remains consistent and there is less 
> works on my machine issues to be had.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45169) Add official image Dockerfile for Apache Spark 3.5.0

2023-09-14 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang resolved SPARK-45169.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 55
[https://github.com/apache/spark-docker/pull/55]

> Add official image Dockerfile for Apache Spark 3.5.0
> 
>
> Key: SPARK-45169
> URL: https://issues.apache.org/jira/browse/SPARK-45169
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.5.0
>Reporter: Yikun Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45169) Add official image Dockerfile for Apache Spark 3.5.0

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45169:
---
Labels: pull-request-available  (was: )

> Add official image Dockerfile for Apache Spark 3.5.0
> 
>
> Key: SPARK-45169
> URL: https://issues.apache.org/jira/browse/SPARK-45169
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.5.0
>Reporter: Yikun Jiang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45169) Add official image Dockerfile for Apache Spark 3.5.0

2023-09-14 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-45169:
---

 Summary: Add official image Dockerfile for Apache Spark 3.5.0
 Key: SPARK-45169
 URL: https://issues.apache.org/jira/browse/SPARK-45169
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Docker
Affects Versions: 3.5.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45168) Increate Pandas minimum version to 1.4.4

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45168:
---
Labels: pull-request-available  (was: )

> Increate Pandas minimum version to 1.4.4
> 
>
> Key: SPARK-45168
> URL: https://issues.apache.org/jira/browse/SPARK-45168
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45168) Increate Pandas minimum version to 1.4.4

2023-09-14 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-45168:
-

 Summary: Increate Pandas minimum version to 1.4.4
 Key: SPARK-45168
 URL: https://issues.apache.org/jira/browse/SPARK-45168
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45167) Python Spark Connect client does not call `releaseAll`

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45167:
---
Labels: pull-request-available  (was: )

> Python Spark Connect client does not call `releaseAll`
> --
>
> Key: SPARK-45167
> URL: https://issues.apache.org/jira/browse/SPARK-45167
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Priority: Major
>  Labels: pull-request-available
>
> The Python client does not call release all previous responses on the server 
> and thus does not properly close the queries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45167) Python Spark Connect client does not call `releaseAll`

2023-09-14 Thread Juliusz Sompolski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-45167:
--
Epic Link: SPARK-43754  (was: SPARK-39375)

> Python Spark Connect client does not call `releaseAll`
> --
>
> Key: SPARK-45167
> URL: https://issues.apache.org/jira/browse/SPARK-45167
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Priority: Major
>
> The Python client does not call release all previous responses on the server 
> and thus does not properly close the queries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45167) Python Spark Connect client does not call `releaseAll`

2023-09-14 Thread Martin Grund (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Grund updated SPARK-45167:
-
Issue Type: Bug  (was: Improvement)

> Python Spark Connect client does not call `releaseAll`
> --
>
> Key: SPARK-45167
> URL: https://issues.apache.org/jira/browse/SPARK-45167
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Priority: Major
>
> The Python client does not call release all previous responses on the server 
> and thus does not properly close the queries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45167) Python Spark Connect client does not call `releaseAll`

2023-09-14 Thread Martin Grund (Jira)

Martin Grund created SPARK-45167:


 Summary: Python Spark Connect client does not call `releaseAll`
 Key: SPARK-45167
 URL: https://issues.apache.org/jira/browse/SPARK-45167
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Martin Grund


The Python client does not call release all previous responses on the server 
and thus does not properly close the queries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45166) Clean up unused code paths for pyarrow<4

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45166:
---
Labels: pull-request-available  (was: )

> Clean up unused code paths for pyarrow<4
> 
>
> Key: SPARK-45166
> URL: https://issues.apache.org/jira/browse/SPARK-45166
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45166) Clean up unused code paths for pyarrow<4

2023-09-14 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-45166:
-

 Summary: Clean up unused code paths for pyarrow<4
 Key: SPARK-45166
 URL: https://issues.apache.org/jira/browse/SPARK-45166
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31177) DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has non-".gz" extension

2023-09-14 Thread Avi minsky (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765139#comment-17765139
 ] 

Avi minsky commented on SPARK-31177:


[~markwaddle] , [~maropu] how was this resolved? 

> DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has 
> non-".gz" extension
> --
>
> Key: SPARK-31177
> URL: https://issues.apache.org/jira/browse/SPARK-31177
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Mark Waddle
>Priority: Major
>  Labels: bulk-closed
>
> i have large CSV files that are gzipped and uploaded to S3 with 
> Content-Encoding=gzip. the files have file extension ".csv", as most web 
> clients will automatically decompress the file based on the Content-Encoding 
> header. using pyspark to read these CSV files does not mimic this behavior.
> works as expected:
> {code:java}
> df = spark.read.csv('s3://bucket/large.csv.gz', header=True)
> {code}
> does not decompress and tries to load entire contents of file as the first 
> row:
> {code:java}
> df = spark.read.csv('s3://bucket/large.csv', header=True)
> {code}
> it looks like it's relying on the file extension to determine if the file is 
> gzip compressed or not. it would be great if S3 resources, and any other http 
> based resources, could consult the Content-Encoding response header as well.
> i tried to find the code that determines this, but i'm not familiar with the 
> code base. any pointers would be helpful. and i can look into fixing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45119) Refine docstring of `inline`

2023-09-14 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-45119:
-

Assignee: Allison Wang

> Refine docstring of `inline`
> 
>
> Key: SPARK-45119
> URL: https://issues.apache.org/jira/browse/SPARK-45119
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Refine docstring of the `inline` function



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45119) Refine docstring of `inline`

2023-09-14 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-45119.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42875
[https://github.com/apache/spark/pull/42875]

> Refine docstring of `inline`
> 
>
> Key: SPARK-45119
> URL: https://issues.apache.org/jira/browse/SPARK-45119
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Refine docstring of the `inline` function



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45154) Pyspark DecisionTreeClassifier: results and tree structure in spark3 very different from that of the spark2 version on the same data and with the same hyperparameters.

2023-09-14 Thread Oumar Nour (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oumar Nour updated SPARK-45154:
---
Priority: Critical  (was: Major)

> Pyspark DecisionTreeClassifier: results and tree structure in spark3 very 
> different from that of the spark2 version on the same data and with the same 
> hyperparameters.
> ---
>
> Key: SPARK-45154
> URL: https://issues.apache.org/jira/browse/SPARK-45154
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.0.0, 3.3.1, 3.2.4, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Oumar Nour
>Priority: Critical
>  Labels: decisiontree, pyspark3, spark2, spark3
>
> Hello,
> I have an engine running on spark2 using a DecisionTreeClassifier model using 
> the CrossValidator. 
>  
> {code:java}
> dt  = DecisionTreeClassifier(maxBins=1, seed=0)   
> cv_dt_evaluator = BinaryClassificationEvaluator(
>             metricName="", 
>             rawPredictionCol="probability")
> # Create param grid and cross validator for model selection
> dt_grid = ParamGridBuilder()\
>             .addGrid(
>                 dt.minInstancesPerNode, 100
>         )\
>             .addGrid(
>                 dt.maxDepth, 10
>         )\
>             .build()
> cv = CrossValidator(
>             estimator=dt, estimatorParamMaps=dt_grid, 
> evaluator=cv_dt_evaluator,
>             parallelism=4
>             numFolds=4
>         ){code}
>  
> I want to {*}migrate from spark2  to spark3{*}. I've run 
> *DecisionTreeClassifier* on the same data with the same parameter values. But 
> unfortunately my results are {*}completely different, especially in terms of 
> tree structure{*}. I have trees with less depth and fewer splits on spark3. 
> I've tried to read the documenttaion but I haven't found an answer to my 
> question.
> I read somewhere that the behavior of the *minInstancesPerNode* parameter has 
> changed and that in Spark 3, {*}minInstancesPerNode{*}(It now controls the 
> minimum number of instances per data partition in the node to create a child 
> node) no longer applies to the total number of instances in a node but rather 
> to the number of instances per partition. This change may have an impact on 
> the way the decision tree is built, particularly when working with unevenly 
> partitioned data. *IS THIS TRUE?*
> Can you help me find a solution to this problem?
> Thanks in advance for your help 
>         
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45163) Merge TABLE_OPERATION & _LEGACY_ERROR_TEMP_1113 into UNSUPPORTED_TABLE_OPERATION and refactor some logic

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45163:
---
Labels: pull-request-available  (was: )

> Merge TABLE_OPERATION & _LEGACY_ERROR_TEMP_1113 into 
> UNSUPPORTED_TABLE_OPERATION and refactor some logic
> 
>
> Key: SPARK-45163
> URL: https://issues.apache.org/jira/browse/SPARK-45163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45163) Merge TABLE_OPERATION & _LEGACY_ERROR_TEMP_1113 into UNSUPPORTED_TABLE_OPERATION and refactor some logic

2023-09-14 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-45163:

Summary: Merge TABLE_OPERATION & _LEGACY_ERROR_TEMP_1113 into 
UNSUPPORTED_TABLE_OPERATION and refactor some logic  (was: Merge 
UNSUPPORTED_FEATURE.TABLE_OPERATION into UNSUPPORTED_TABLE_OPERATION and 
refactor some logic)

> Merge TABLE_OPERATION & _LEGACY_ERROR_TEMP_1113 into 
> UNSUPPORTED_TABLE_OPERATION and refactor some logic
> 
>
> Key: SPARK-45163
> URL: https://issues.apache.org/jira/browse/SPARK-45163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45088) Make `getitem` work with duplicated columns

2023-09-14 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-45088:
-

Assignee: Ruifeng Zheng

> Make `getitem` work with duplicated columns
> ---
>
> Key: SPARK-45088
> URL: https://issues.apache.org/jira/browse/SPARK-45088
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45088) Make `getitem` work with duplicated columns

2023-09-14 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-45088.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42828
[https://github.com/apache/spark/pull/42828]

> Make `getitem` work with duplicated columns
> ---
>
> Key: SPARK-45088
> URL: https://issues.apache.org/jira/browse/SPARK-45088
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45165) Remove `inplace` parameter from `Categorical` APIs

2023-09-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45165:
---
Labels: pull-request-available  (was: )

> Remove `inplace` parameter from `Categorical` APIs
> --
>
> Key: SPARK-45165
> URL: https://issues.apache.org/jira/browse/SPARK-45165
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> `inplace` should be removed from CategoricalIndex APIs to match the pandas 
> behavior



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 116 matches

Mail list logo