[jira] [Updated] (SPARK-45179) Increase Numpy minimum version to 1.21
[ https://issues.apache.org/jira/browse/SPARK-45179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45179: --- Labels: pull-request-available (was: ) > Increase Numpy minimum version to 1.21 > -- > > Key: SPARK-45179 > URL: https://issues.apache.org/jira/browse/SPARK-45179 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45175) download krb5.conf from remote storage in spark-sumbit on k8s
[ https://issues.apache.org/jira/browse/SPARK-45175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45175: --- Labels: pull-request-available (was: ) > download krb5.conf from remote storage in spark-sumbit on k8s > - > > Key: SPARK-45175 > URL: https://issues.apache.org/jira/browse/SPARK-45175 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.1 >Reporter: Qian Sun >Priority: Minor > Labels: pull-request-available > > krb5.conf currently only supports the local file format. Tenants would like > to save this file on their own servers and download it during the > spark-submit phase for better implementation of multi-tenant scenarios. The > proposed solution is to use the *downloadFile* function[1], similar to the > configuration of *spark.kubernetes.driver/executor.podTemplateFile* > > [1]https://github.com/apache/spark/blob/822f58f0d26b7d760469151a65eaf9ee863a07a1/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/PodTemplateConfigMapStep.scala#L82C24-L82C24 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44376) Build using maven is broken using 2.13 and Java 11 and Java 17
[ https://issues.apache.org/jira/browse/SPARK-44376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44376: --- Labels: pull-request-available (was: ) > Build using maven is broken using 2.13 and Java 11 and Java 17 > -- > > Key: SPARK-44376 > URL: https://issues.apache.org/jira/browse/SPARK-44376 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.5.0 >Reporter: Emil Ejbyfeldt >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0, 4.0.0 > > > Fails with > ``` > $ ./build/mvn compile -Pscala-2.13 -Djava.version=11 -X > ... > [WARNING] [Warn] : [deprecation @ | origin= | version=] -target is > deprecated: Use -release instead to compile against the correct platform API. > [ERROR] [Error] : target platform version 8 is older than the release version > 11 > [WARNING] one warning found > [ERROR] one error found > ... > ``` > if setting the `java.version` property or > ``` > $ ./build/mvn compile -Pscala-2.13 > ... > [WARNING] [Warn] : [deprecation @ | origin= | version=] -target is > deprecated: Use -release instead to compile against the correct platform API. > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/serializer/SerializationDebugger.scala:71: > not found: value sun > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26: > not found: object sun > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27: > not found: object sun > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:206: > not found: type DirectBuffer > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:210: > not found: type Unsafe > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:212: > not found: type Unsafe > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:213: > not found: type DirectBuffer > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:216: > not found: type DirectBuffer > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:236: > not found: type DirectBuffer > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26: > Unused import > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27: > Unused import > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala:452: > not found: value sun > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: > not found: object sun > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99: > not found: type SignalHandler > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99: > not found: type Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:83: > not found: type Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108: > not found: type SignalHandler > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108: > not found: value Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:114: > not found: type Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:116: > not found: value Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:128: > not found: value Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: > Unused import > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: > Unused import > [WARNING] one warning found > [ERROR] 23 errors found > ... > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To uns
[jira] [Created] (SPARK-45179) Increase Numpy minimum version to 1.21
Ruifeng Zheng created SPARK-45179: - Summary: Increase Numpy minimum version to 1.21 Key: SPARK-45179 URL: https://issues.apache.org/jira/browse/SPARK-45179 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-26365: --- Labels: pull-request-available (was: ) > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0 >Reporter: Oscar Bonilla >Priority: Major > Labels: pull-request-available > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43874) Enable GroupByTests for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43874: --- Labels: pull-request-available (was: ) > Enable GroupByTests for pandas 2.0.0. > - > > Key: SPARK-43874 > URL: https://issues.apache.org/jira/browse/SPARK-43874 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > test list: > * test_prod > * test_nth > * test_mad > * test_basic_stat_funcs > * test_groupby_multiindex_columns > * test_apply_without_shortcut > * test_mean > * test_apply -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43811) Enable DataFrameTests.test_reindex for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-43811. - Resolution: Fixed > Enable DataFrameTests.test_reindex for pandas 2.0.0. > > > Key: SPARK-43811 > URL: https://issues.apache.org/jira/browse/SPARK-43811 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44276) Match behavior with pandas for `SeriesStringTests.test_string_replace`
[ https://issues.apache.org/jira/browse/SPARK-44276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-44276. - Resolution: Fixed > Match behavior with pandas for `SeriesStringTests.test_string_replace` > -- > > Key: SPARK-44276 > URL: https://issues.apache.org/jira/browse/SPARK-44276 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > See https://github.com/apache/spark/pull/41823/files -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43644) Enable DatetimeIndexTests.test_indexer_between_time for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-43644. - Resolution: Fixed > Enable DatetimeIndexTests.test_indexer_between_time for pandas 2.0.0. > - > > Key: SPARK-43644 > URL: https://issues.apache.org/jira/browse/SPARK-43644 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Enable DatetimeIndexTests.test_indexer_between_time for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43433) Match `GroupBy.nth` behavior with new pandas behavior
[ https://issues.apache.org/jira/browse/SPARK-43433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-43433. - Resolution: Fixed > Match `GroupBy.nth` behavior with new pandas behavior > - > > Key: SPARK-43433 > URL: https://issues.apache.org/jira/browse/SPARK-43433 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Match behavior with > https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#dataframegroupby-nth-and-seriesgroupby-nth-now-behave-as-filtrations -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43291) Generate proper warning on different behavior with numeric_only
[ https://issues.apache.org/jira/browse/SPARK-43291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-43291. - Resolution: Won't Fix We should match the behavior w/ Pandas instead of warning. > Generate proper warning on different behavior with numeric_only > --- > > Key: SPARK-43291 > URL: https://issues.apache.org/jira/browse/SPARK-43291 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Should enable test below: > {code:java} > pdf = pd.DataFrame([("1", "2"), ("0", "3"), ("2", "0"), ("1", "1")], > columns=["a", "b"]) > psdf = ps.from_pandas(pdf) > self.assert_eq(pdf.cov(), psdf.cov()) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43282) Investigate DataFrame.sort_values with pandas behavior.
[ https://issues.apache.org/jira/browse/SPARK-43282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-43282. - Resolution: Won't Fix > Investigate DataFrame.sort_values with pandas behavior. > --- > > Key: SPARK-43282 > URL: https://issues.apache.org/jira/browse/SPARK-43282 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > {code:java} > import pandas as pd > pdf = pd.DataFrame( > { > "a": pd.Categorical([1, 2, 3, 1, 2, 3]), > "b": pd.Categorical( > ["b", "a", "c", "c", "b", "a"], categories=["c", "b", "d", "a"] > ), > }, > ) > pdf.groupby("a").apply(lambda x: x).sort_values(["a"]) > Traceback (most recent call last): > ... > ValueError: 'a' is both an index level and a column label, which is > ambiguous. {code} > We should investigate this issue whether this is intended behavior or just > bug in pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43271) Match behavior with DataFrame.reindex with specifying `index`.
[ https://issues.apache.org/jira/browse/SPARK-43271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-43271. - Resolution: Fixed > Match behavior with DataFrame.reindex with specifying `index`. > -- > > Key: SPARK-43271 > URL: https://issues.apache.org/jira/browse/SPARK-43271 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > > Re-enable pandas 2.0.0 test in DataFrameTests.test_reindex in proper way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45168) Increate Pandas minimum version to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-45168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-45168. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42930 [https://github.com/apache/spark/pull/42930] > Increate Pandas minimum version to 1.4.4 > > > Key: SPARK-45168 > URL: https://issues.apache.org/jira/browse/SPARK-45168 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45168) Increate Pandas minimum version to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-45168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-45168: - Assignee: Ruifeng Zheng > Increate Pandas minimum version to 1.4.4 > > > Key: SPARK-45168 > URL: https://issues.apache.org/jira/browse/SPARK-45168 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45178) Fallback to use single batch executor for Trigger.AvailableNow with unsupported sources rather than using wrapper
[ https://issues.apache.org/jira/browse/SPARK-45178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45178: --- Labels: pull-request-available (was: ) > Fallback to use single batch executor for Trigger.AvailableNow with > unsupported sources rather than using wrapper > - > > Key: SPARK-45178 > URL: https://issues.apache.org/jira/browse/SPARK-45178 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jungtaek Lim >Priority: Major > Labels: pull-request-available > > We have observed the case where wrapper implementation of > Trigger.AvailableNow ( > AvailableNowDataStreamWrapper and subclasses) is not fully compatible with > 3rd party data source and brought up correctness issue. > > While we could persuade 3rd party data source to support > Trigger.AvailableNow, pursuing all 3rd parties to do this is too aggressive > and challenging goal we never be able to make. Also, it may not be also > possible to come up with the wrapper implementation which would have zero > issue with any arbitrary source. > > As a mitigation, we want to make a slight behavioral change for such case, > falling back to single batch execution (a.k.a. Trigger.Once) rather than > using wrapper implementation. The exact behavior between Trigger.AvailableNow > and Trigger.Once are different so it's technically behavioral change, but > it's probably lot less surprised than failing the query. > > For extreme case where users are confident that there will be no issue at all > on using wrapper, we will come up with a flag to provide the previous > behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43254) Assign a name to the error class _LEGACY_ERROR_TEMP_2018
[ https://issues.apache.org/jira/browse/SPARK-43254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43254: --- Labels: pull-request-available starter (was: starter) > Assign a name to the error class _LEGACY_ERROR_TEMP_2018 > > > Key: SPARK-43254 > URL: https://issues.apache.org/jira/browse/SPARK-43254 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2018* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45178) Fallback to use single batch executor for Trigger.AvailableNow with unsupported sources rather than using wrapper
[ https://issues.apache.org/jira/browse/SPARK-45178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765451#comment-17765451 ] Jungtaek Lim commented on SPARK-45178: -- PR will be available sooner. > Fallback to use single batch executor for Trigger.AvailableNow with > unsupported sources rather than using wrapper > - > > Key: SPARK-45178 > URL: https://issues.apache.org/jira/browse/SPARK-45178 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jungtaek Lim >Priority: Major > > We have observed the case where wrapper implementation of > Trigger.AvailableNow ( > AvailableNowDataStreamWrapper and subclasses) is not fully compatible with > 3rd party data source and brought up correctness issue. > > While we could persuade 3rd party data source to support > Trigger.AvailableNow, pursuing all 3rd parties to do this is too aggressive > and challenging goal we never be able to make. Also, it may not be also > possible to come up with the wrapper implementation which would have zero > issue with any arbitrary source. > > As a mitigation, we want to make a slight behavioral change for such case, > falling back to single batch execution (a.k.a. Trigger.Once) rather than > using wrapper implementation. The exact behavior between Trigger.AvailableNow > and Trigger.Once are different so it's technically behavioral change, but > it's probably lot less surprised than failing the query. > > For extreme case where users are confident that there will be no issue at all > on using wrapper, we will come up with a flag to provide the previous > behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44788) XML: Add pyspark.sql.functions
[ https://issues.apache.org/jira/browse/SPARK-44788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44788: --- Labels: pull-request-available (was: ) > XML: Add pyspark.sql.functions > -- > > Key: SPARK-44788 > URL: https://issues.apache.org/jira/browse/SPARK-44788 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45178) Fallback to use single batch executor for Trigger.AvailableNow with unsupported sources rather than using wrapper
Jungtaek Lim created SPARK-45178: Summary: Fallback to use single batch executor for Trigger.AvailableNow with unsupported sources rather than using wrapper Key: SPARK-45178 URL: https://issues.apache.org/jira/browse/SPARK-45178 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Jungtaek Lim We have observed the case where wrapper implementation of Trigger.AvailableNow ( AvailableNowDataStreamWrapper and subclasses) is not fully compatible with 3rd party data source and brought up correctness issue. While we could persuade 3rd party data source to support Trigger.AvailableNow, pursuing all 3rd parties to do this is too aggressive and challenging goal we never be able to make. Also, it may not be also possible to come up with the wrapper implementation which would have zero issue with any arbitrary source. As a mitigation, we want to make a slight behavioral change for such case, falling back to single batch execution (a.k.a. Trigger.Once) rather than using wrapper implementation. The exact behavior between Trigger.AvailableNow and Trigger.Once are different so it's technically behavioral change, but it's probably lot less surprised than failing the query. For extreme case where users are confident that there will be no issue at all on using wrapper, we will come up with a flag to provide the previous behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45177) Remove `col_space` parameter from `DataFrame.to_latex`
[ https://issues.apache.org/jira/browse/SPARK-45177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45177: --- Labels: pull-request-available (was: ) > Remove `col_space` parameter from `DataFrame.to_latex` > -- > > Key: SPARK-45177 > URL: https://issues.apache.org/jira/browse/SPARK-45177 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45177) Remove `col_space` parameter from `DataFrame.to_latex`
Haejoon Lee created SPARK-45177: --- Summary: Remove `col_space` parameter from `DataFrame.to_latex` Key: SPARK-45177 URL: https://issues.apache.org/jira/browse/SPARK-45177 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45143) Make PySpark compatible with PyArrow 13.0.0
[ https://issues.apache.org/jira/browse/SPARK-45143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45143: -- Parent: SPARK-43831 Issue Type: Sub-task (was: Improvement) > Make PySpark compatible with PyArrow 13.0.0 > --- > > Key: SPARK-45143 > URL: https://issues.apache.org/jira/browse/SPARK-45143 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [https://github.com/apache/spark/actions/runs/6167186123/job/16738683872] > > {code:java} > == > FAIL [0.095s]: test_from_to_pandas > (pyspark.pandas.tests.data_type_ops.test_datetime_ops.DatetimeOpsTests) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 122, in > _assert_pandas_equal > assert_series_equal( > File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", > line 931, in assert_series_equal > assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}") > File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", > line 415, in assert_attr_equal > raise_assert_detail(obj, msg, left_attr, right_attr) > File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", > line 599, in raise_assert_detail > raise AssertionError(msg) > AssertionError: Attributes of Series are different > Attribute "dtype" are different > [left]: datetime64[ns] > [right]: datetime64[us] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44434) Add more tests for Scala foreachBatch and streaming listeners
[ https://issues.apache.org/jira/browse/SPARK-44434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44434: - Fix Version/s: (was: 3.5.0) > Add more tests for Scala foreachBatch and streaming listeners > -- > > Key: SPARK-44434 > URL: https://issues.apache.org/jira/browse/SPARK-44434 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.4.1 >Reporter: Raghu Angadi >Priority: Major > > Currently there are very few tests for Scala foreachBatch. Consider adding > more tests and covering more test scenarios (multiple queries etc). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45143) Make PySpark compatible with PyArrow 13.0.0
[ https://issues.apache.org/jira/browse/SPARK-45143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45143. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42920 [https://github.com/apache/spark/pull/42920] > Make PySpark compatible with PyArrow 13.0.0 > --- > > Key: SPARK-45143 > URL: https://issues.apache.org/jira/browse/SPARK-45143 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [https://github.com/apache/spark/actions/runs/6167186123/job/16738683872] > > {code:java} > == > FAIL [0.095s]: test_from_to_pandas > (pyspark.pandas.tests.data_type_ops.test_datetime_ops.DatetimeOpsTests) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 122, in > _assert_pandas_equal > assert_series_equal( > File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", > line 931, in assert_series_equal > assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}") > File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", > line 415, in assert_attr_equal > raise_assert_detail(obj, msg, left_attr, right_attr) > File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", > line 599, in raise_assert_detail > raise AssertionError(msg) > AssertionError: Attributes of Series are different > Attribute "dtype" are different > [left]: datetime64[ns] > [right]: datetime64[us] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45143) Make PySpark compatible with PyArrow 13.0.0
[ https://issues.apache.org/jira/browse/SPARK-45143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45143: - Assignee: Ruifeng Zheng > Make PySpark compatible with PyArrow 13.0.0 > --- > > Key: SPARK-45143 > URL: https://issues.apache.org/jira/browse/SPARK-45143 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > > [https://github.com/apache/spark/actions/runs/6167186123/job/16738683872] > > {code:java} > == > FAIL [0.095s]: test_from_to_pandas > (pyspark.pandas.tests.data_type_ops.test_datetime_ops.DatetimeOpsTests) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 122, in > _assert_pandas_equal > assert_series_equal( > File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", > line 931, in assert_series_equal > assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}") > File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", > line 415, in assert_attr_equal > raise_assert_detail(obj, msg, left_attr, right_attr) > File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", > line 599, in raise_assert_detail > raise AssertionError(msg) > AssertionError: Attributes of Series are different > Attribute "dtype" are different > [left]: datetime64[ns] > [right]: datetime64[us] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44699) Add logging for complete write events to file in EventLogFileWriter.closeWriter
[ https://issues.apache.org/jira/browse/SPARK-44699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44699: - Fix Version/s: (was: 3.5.0) > Add logging for complete write events to file in > EventLogFileWriter.closeWriter > --- > > Key: SPARK-44699 > URL: https://issues.apache.org/jira/browse/SPARK-44699 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: shuyouZZ >Priority: Major > > Sometimes we want to know when to finish logging the events to eventLog file, > we need add a log to make it clearer. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
[ https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42252: - Target Version/s: (was: 3.5.0) > Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config > -- > > Key: SPARK-42252 > URL: https://issues.apache.org/jira/browse/SPARK-42252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: Wei Guo >Priority: Minor > > After Jira SPARK-28209 and PR > [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer > api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, > SortShuffleWriter, UnsafeShuffleWriter) are based on > LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config > spark.shuffle.unsafe.file.output.buffer used in > LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. > > It's better to rename it and make it more suitable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44307) Bloom filter is not added for left outer join if the left side table is smaller than broadcast threshold.
[ https://issues.apache.org/jira/browse/SPARK-44307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44307: - Fix Version/s: (was: 3.5.0) > Bloom filter is not added for left outer join if the left side table is > smaller than broadcast threshold. > - > > Key: SPARK-44307 > URL: https://issues.apache.org/jira/browse/SPARK-44307 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: mahesh kumar behera >Priority: Major > > In case of left outer join, even if the left side table is small enough to be > broadcasted, shuffle join is used. This is because of the property of the > left outer join. If the left side is broadcasted in left outer join, the > result generated will be wrong. But this is not taken care of in bloom > filter. While injecting the bloom filter, if lest side is smaller than > broadcast threshold, bloom filter is not added. It assumes that the left side > will be broadcast and there is no need for a bloom filter. This causes bloom > filter optimization to be missed in case of left outer join with small left > side and huge right-side table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38945) simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep
[ https://issues.apache.org/jira/browse/SPARK-38945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38945: - Fix Version/s: (was: 3.5.0) > simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep > > > Key: SPARK-38945 > URL: https://issues.apache.org/jira/browse/SPARK-38945 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: Qian Sun >Priority: Minor > > Simply KEYTAB and PRINCIPAL in KerberosConfDriverFeatureStep, because already > imported -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43155) DataSourceV2 is hard to be implemented without following V1
[ https://issues.apache.org/jira/browse/SPARK-43155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-43155: - Target Version/s: (was: 3.5.0) > DataSourceV2 is hard to be implemented without following V1 > --- > > Key: SPARK-43155 > URL: https://issues.apache.org/jira/browse/SPARK-43155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: PEIYUAN SUN >Priority: Major > Labels: features > Original Estimate: 672h > Remaining Estimate: 672h > > h1. Description > The current interface of DataSourceV2 becomes overly complicated than the > Spark 2.x versions. To implement under the DataSourceV2, user needs to learn > not only the V2 APIs and interfaces. But also the DataSourceV1 (as it is a > failback version). > h2. Interface Gaps > There is no easy way and clear examples on how to implement both for a new > dataSource. For example, the examples in standard spark repo like orc, > parquet, json has a FileFormat interface for V1 while all these are not > feasible to be followed since the SPI is hard-code as `DefaultSource` instead > of dynamic loading if from user provided class outside the Spark Repo. > Different data sources are not strictly following a same pattern in V1 and > not decoupled well with customized logic within it. > > h2. Loss of simple layer over different kinds of dataSource > With original V1, user can actually implement a new wrapper on top of > orc/parquet easily with Relation Interface. The DataSourceV2 again here > becomes too low level and hard to be used in this case. > > h2. No explicit guidance > The functionality interfaces are not well organized which forces the reader > spend lots of time to understand the commit history, existing patterns as > well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41259) Spark-sql cli query results should correspond to schema
[ https://issues.apache.org/jira/browse/SPARK-41259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41259: - Fix Version/s: (was: 3.5.0) > Spark-sql cli query results should correspond to schema > --- > > Key: SPARK-41259 > URL: https://issues.apache.org/jira/browse/SPARK-41259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: yikaifei >Priority: Minor > > When using the spark-sql cli, Spark outputs only one column in the `show > tables` and `show views` commands to be compatible with Hive output, but the > output schema is still the three columns of Spark -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
[ https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42252: - Fix Version/s: (was: 3.5.0) > Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config > -- > > Key: SPARK-42252 > URL: https://issues.apache.org/jira/browse/SPARK-42252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: Wei Guo >Priority: Minor > > After Jira SPARK-28209 and PR > [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer > api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, > SortShuffleWriter, UnsafeShuffleWriter) are based on > LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config > spark.shuffle.unsafe.file.output.buffer used in > LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. > > It's better to rename it and make it more suitable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37935) Migrate onto error classes
[ https://issues.apache.org/jira/browse/SPARK-37935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37935: - Fix Version/s: (was: 3.5.0) > Migrate onto error classes > -- > > Key: SPARK-37935 > URL: https://issues.apache.org/jira/browse/SPARK-37935 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The PR https://github.com/apache/spark/pull/32850 introduced error classes as > a part of the error messages framework > (https://issues.apache.org/jira/browse/SPARK-33539). Need to migrate all > exceptions from QueryExecutionErrors, QueryCompilationErrors and > QueryParsingErrors on the error classes using instances of SparkThrowable, > and carefully test every error class by writing tests in dedicated test > suites: > * QueryExecutionErrorsSuite for the errors that are occurred during query > execution > * QueryCompilationErrorsSuite ... query compilation or eagerly executing > commands > * QueryParsingErrorsSuite ... parsing errors > Here is an example https://github.com/apache/spark/pull/35157 of how an > existing Java exception can be replaced, and testing of related error > classes.At the end, we should migrate all exceptions from the files > Query.*Errors.scala and cover all error classes from the error-classes.json > file by tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39136) JDBCTable support properties
[ https://issues.apache.org/jira/browse/SPARK-39136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39136: - Fix Version/s: (was: 3.5.0) > JDBCTable support properties > > > Key: SPARK-39136 > URL: https://issues.apache.org/jira/browse/SPARK-39136 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Priority: Major > > {code:java} > > > > desc formatted jdbc.test.people; > NAME string > IDint > # Partitioning > Not partitioned > # Detailed Table Information > Name test.people > Table Properties [] > Time taken: 0.048 seconds, Fetched 9 row(s) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)
[ https://issues.apache.org/jira/browse/SPARK-39892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39892: - Fix Version/s: (was: 3.5.0) > Use ArrowType.Decimal(precision, scale, bitWidth) instead of > ArrowType.Decimal(precision, scale) > > > Key: SPARK-39892 > URL: https://issues.apache.org/jira/browse/SPARK-39892 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > [warn] > /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:48:49: > [deprecation @ org.apache.spark.sql.util.ArrowUtils.toArrowType | > origin=org.apache.arrow.vector.types.pojo.ArrowType.Decimal. | > version=] constructor Decimal in class Decimal is deprecated -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43155) DataSourceV2 is hard to be implemented without following V1
[ https://issues.apache.org/jira/browse/SPARK-43155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-43155: - Fix Version/s: (was: 3.5.0) > DataSourceV2 is hard to be implemented without following V1 > --- > > Key: SPARK-43155 > URL: https://issues.apache.org/jira/browse/SPARK-43155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: PEIYUAN SUN >Priority: Major > Labels: features > Original Estimate: 672h > Remaining Estimate: 672h > > h1. Description > The current interface of DataSourceV2 becomes overly complicated than the > Spark 2.x versions. To implement under the DataSourceV2, user needs to learn > not only the V2 APIs and interfaces. But also the DataSourceV1 (as it is a > failback version). > h2. Interface Gaps > There is no easy way and clear examples on how to implement both for a new > dataSource. For example, the examples in standard spark repo like orc, > parquet, json has a FileFormat interface for V1 while all these are not > feasible to be followed since the SPI is hard-code as `DefaultSource` instead > of dynamic loading if from user provided class outside the Spark Repo. > Different data sources are not strictly following a same pattern in V1 and > not decoupled well with customized logic within it. > > h2. Loss of simple layer over different kinds of dataSource > With original V1, user can actually implement a new wrapper on top of > orc/parquet easily with Relation Interface. The DataSourceV2 again here > becomes too low level and hard to be used in this case. > > h2. No explicit guidance > The functionality interfaces are not well organized which forces the reader > spend lots of time to understand the commit history, existing patterns as > well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39814) Use AmazonKinesisClientBuilder.withCredentials instead of new AmazonKinesisClient(credentials)
[ https://issues.apache.org/jira/browse/SPARK-39814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39814: - Fix Version/s: (was: 3.5.0) > Use AmazonKinesisClientBuilder.withCredentials instead of new > AmazonKinesisClient(credentials) > -- > > Key: SPARK-39814 > URL: https://issues.apache.org/jira/browse/SPARK-39814 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > > [warn] > /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala:108:25: > [deprecation @ > org.apache.spark.examples.streaming.KinesisWordCountASL.main.kinesisClient | > origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] > constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated > [warn] > /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala:224:25: > [deprecation @ > org.apache.spark.examples.streaming.KinesisWordProducerASL.generate.kinesisClient > | origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | > version=] constructor AmazonKinesisClient in class AmazonKinesisClient is > deprecated > [warn] > /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala:142:24: > [deprecation @ > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.client | > origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] > constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated > [warn] > /home/runner/work/spark/spark/connector/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisTestUtils.scala:58:18: > [deprecation @ > org.apache.spark.streaming.kinesis.KinesisTestUtils.kinesisClient.client | > origin=com.amazonaws.services.kinesis.AmazonKinesisClient. | version=] > constructor AmazonKinesisClient in class AmazonKinesisClient is deprecated -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44307) Bloom filter is not added for left outer join if the left side table is smaller than broadcast threshold.
[ https://issues.apache.org/jira/browse/SPARK-44307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44307: - Target Version/s: (was: 3.4.1) > Bloom filter is not added for left outer join if the left side table is > smaller than broadcast threshold. > - > > Key: SPARK-44307 > URL: https://issues.apache.org/jira/browse/SPARK-44307 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.4.1 >Reporter: mahesh kumar behera >Priority: Major > > In case of left outer join, even if the left side table is small enough to be > broadcasted, shuffle join is used. This is because of the property of the > left outer join. If the left side is broadcasted in left outer join, the > result generated will be wrong. But this is not taken care of in bloom > filter. While injecting the bloom filter, if lest side is smaller than > broadcast threshold, bloom filter is not added. It assumes that the left side > will be broadcast and there is no need for a bloom filter. This causes bloom > filter optimization to be missed in case of left outer join with small left > side and huge right-side table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43318) spark reader csv and json support wholetext parameters
[ https://issues.apache.org/jira/browse/SPARK-43318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-43318: - Fix Version/s: (was: 3.5.0) > spark reader csv and json support wholetext parameters > -- > > Key: SPARK-43318 > URL: https://issues.apache.org/jira/browse/SPARK-43318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: melin >Priority: Major > > FTPInputStream used by Hadoop FTPFileSystem does not support seek, and spark > HadoopFileLinesReader fails to be read. > Support to read the entire file, and then split lines, avoid reading failure > > [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ftp/FTPInputStream.java] > > [~cloud_fan] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to 1.24.0
[ https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45172. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42934 [https://github.com/apache/spark/pull/42934] > Upgrade commons-compress.version from 1.23.0 to 1.24.0 > -- > > Key: SPARK-45172 > URL: https://issues.apache.org/jira/browse/SPARK-45172 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to 1.24.0
[ https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45172: - Assignee: Hyukjin Kwon > Upgrade commons-compress.version from 1.23.0 to 1.24.0 > -- > > Key: SPARK-45172 > URL: https://issues.apache.org/jira/browse/SPARK-45172 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45171) GenerateExec fails to initialize non-deterministic expressions before use
[ https://issues.apache.org/jira/browse/SPARK-45171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45171: Assignee: Bruce Robbins > GenerateExec fails to initialize non-deterministic expressions before use > - > > Key: SPARK-45171 > URL: https://issues.apache.org/jira/browse/SPARK-45171 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Labels: pull-request-available > > The following query fails: > {noformat} > select * > from explode( > transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22) > ); > {noformat} > The error is: > {noformat} > 23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) > java.lang.IllegalArgumentException: requirement failed: Nondeterministic > expression org.apache.spark.sql.catalyst.expressions.Rand should be > initialized before eval. > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495) > at > org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35) > at > org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543) > at > org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) > at > org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062) > at > org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275) > at > org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274) > at > org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308) > at > org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) > ... > {noformat} > However, this query succeeds: > {noformat} > select * > from explode( > sequence(0, cast(rand()*1000 as int) + 1) > ); > {noformat} > The difference is that {{transform}} turns off whole-stage codegen, which > exposes a bug in {{GenerateExec}} where the non-deterministic expression > passed to the generator function is not initialized before being used. > An even simpler reprod case is: > {noformat} > set spark.sql.codegen.wholeStage=false; > select explode(array(rand())); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45171) GenerateExec fails to initialize non-deterministic expressions before use
[ https://issues.apache.org/jira/browse/SPARK-45171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45171. -- Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 42933 [https://github.com/apache/spark/pull/42933] > GenerateExec fails to initialize non-deterministic expressions before use > - > > Key: SPARK-45171 > URL: https://issues.apache.org/jira/browse/SPARK-45171 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Labels: pull-request-available > Fix For: 3.5.1, 4.0.0 > > > The following query fails: > {noformat} > select * > from explode( > transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22) > ); > {noformat} > The error is: > {noformat} > 23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) > java.lang.IllegalArgumentException: requirement failed: Nondeterministic > expression org.apache.spark.sql.catalyst.expressions.Rand should be > initialized before eval. > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495) > at > org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35) > at > org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543) > at > org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) > at > org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062) > at > org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275) > at > org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274) > at > org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308) > at > org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) > ... > {noformat} > However, this query succeeds: > {noformat} > select * > from explode( > sequence(0, cast(rand()*1000 as int) + 1) > ); > {noformat} > The difference is that {{transform}} turns off whole-stage codegen, which > exposes a bug in {{GenerateExec}} where the non-deterministic expression > passed to the generator function is not initialized before being used. > An even simpler reprod case is: > {noformat} > set spark.sql.codegen.wholeStage=false; > select explode(array(rand())); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45174) Support spark.deploy.maxDrivers
[ https://issues.apache.org/jira/browse/SPARK-45174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45174. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42936 [https://github.com/apache/spark/pull/42936] > Support spark.deploy.maxDrivers > --- > > Key: SPARK-45174 > URL: https://issues.apache.org/jira/browse/SPARK-45174 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Like `spark.mesos.maxDrivers`, this issue aims to add > `spark.deploy.maxDrivers`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45174) Support spark.deploy.maxDrivers
[ https://issues.apache.org/jira/browse/SPARK-45174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45174: - Assignee: Dongjoon Hyun > Support spark.deploy.maxDrivers > --- > > Key: SPARK-45174 > URL: https://issues.apache.org/jira/browse/SPARK-45174 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Like `spark.mesos.maxDrivers`, this issue aims to add > `spark.deploy.maxDrivers`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45165) Remove `inplace` parameter from `Categorical` APIs
[ https://issues.apache.org/jira/browse/SPARK-45165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45165. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42927 [https://github.com/apache/spark/pull/42927] > Remove `inplace` parameter from `Categorical` APIs > -- > > Key: SPARK-45165 > URL: https://issues.apache.org/jira/browse/SPARK-45165 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > `inplace` should be removed from CategoricalIndex APIs to match the pandas > behavior -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45165) Remove `inplace` parameter from `Categorical` APIs
[ https://issues.apache.org/jira/browse/SPARK-45165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45165: - Assignee: Haejoon Lee > Remove `inplace` parameter from `Categorical` APIs > -- > > Key: SPARK-45165 > URL: https://issues.apache.org/jira/browse/SPARK-45165 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Labels: pull-request-available > > `inplace` should be removed from CategoricalIndex APIs to match the pandas > behavior -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45176) AggregatingAccumulator with TypedImperativeAggregate throwing ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-45176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huw updated SPARK-45176: Description: Probably related to SPARK-39044. But potentially also this comment in Executor.scala. {quote}// TODO: do not serialize value twice val directResult = new DirectTaskResult(valueByteBuffer, accumUpdates, metricPeaks) {quote} The class cast exception I'm seeing is {quote} java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir {quote} But I've seen it with other aggregation buffers like QuantileSummaries as well. It's my belief that withBufferSerialized() for the Aggregating Accumulator is being called twice, leading to on serializeAggregateBuffernPlace(buffer) also being called twice for the an Imperative aggregate, the second time round, the buffer is already a byte array and the asInstanceOf[T] in getBufferObject is throwing. This doesn't appear to happen on all runs, and it might be its only occurring when there's a transitive exception. I have a further suspicion that the cause might originate with {quote} SerializationDebugger.improveException {quote} which is traversing the task and forcing writeExternal, to be called. Setting |spark.serializer.extraDebugInfo|false| Seems to make things a bit more reliable (I haven't seen the error while this setting is on), and points strongly in that direction. Stack trace: {quote} Job aborted due to stage failure: Authorized committer (attemptNumber=0, stage=15, partition=10) failed; but task commit success, data duplication may happen. reason=ExceptionFailure(java.io.IOException,java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module java.base of loader 'bootstrap'; org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed module of loader 'app'),[Ljava.lang.StackTraceElement;@7fe2f462,java.io.IOException: java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module java.base of loader 'bootstrap'; org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed module of loader 'app') at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1502) at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:59) at java.base/java.io.ObjectOutputStream.writeExternalData(Unknown Source) at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source) at java.base/java.io.ObjectOutputStream.writeObject0(Unknown Source) at java.base/java.io.ObjectOutputStream.writeObject(Unknown Source) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.SerializerHelper$.serializeToChunkedBuffer(SerializerHelper.scala:42) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module java.base of loader 'bootstrap'; org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.aggregate.ReservoirSample.serialize(ReservoirSample.scala:33) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:624) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:206) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33) at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186) at jdk.internal.reflect.GeneratedMethodAccessor62.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at java.base/java.io.ObjectStreamClass.invokeWriteReplace(Unknown Source) at java.base/java.io.ObjectOutputStream.writeObject0(Unknown Source) at java.base/java.io.ObjectOutputStream.writeObject(Unknown Source) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:62) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:62) at scala.collection.immutable.Vector.foreach(Vector.scala:1856) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:62) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.util.Utils$.tryOrIO
[jira] [Created] (SPARK-45176) AggregatingAccumulator with TypedImperativeAggregate throwing ClassCastException
Huw created SPARK-45176: --- Summary: AggregatingAccumulator with TypedImperativeAggregate throwing ClassCastException Key: SPARK-45176 URL: https://issues.apache.org/jira/browse/SPARK-45176 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1, 3.4.0 Reporter: Huw Probably related to SPARK-39044. But potentially also this comment in Executor.scala. // TODO: do not serialize value twice val directResult = new DirectTaskResult(valueByteBuffer, accumUpdates, metricPeaks) The class cast exception I'm seeing is java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir But I've seen it with other aggregation buffers like QuantileSummaries as well. It's my belief that withBufferSerialized() for the Aggregating Accumulator is being called twice, leading to on serializeAggregateBuffernPlace(buffer) also being called twice for the an Imperative aggregate, the second time round, the buffer is already a byte array and the asInstanceOf[T] in getBufferObject is throwing. This doesn't appear to happen on all runs, and it might be its only occurring when there's a transitive exception. I have a further suspicion that the cause might originate with SerializationDebugger.improveException which is traversing the task and forcing writeExternal, to be called. Setting |spark.serializer.extraDebugInfo|false| Seems to make things a bit more reliable (I haven't seen the error while this setting is on), and points strongly in that direction. Stack trace: Job aborted due to stage failure: Authorized committer (attemptNumber=0, stage=15, partition=10) failed; but task commit success, data duplication may happen. reason=ExceptionFailure(java.io.IOException,java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module java.base of loader 'bootstrap'; org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed module of loader 'app'),[Ljava.lang.StackTraceElement;@7fe2f462,java.io.IOException: java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module java.base of loader 'bootstrap'; org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed module of loader 'app') at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1502) at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:59) at java.base/java.io.ObjectOutputStream.writeExternalData(Unknown Source) at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source) at java.base/java.io.ObjectOutputStream.writeObject0(Unknown Source) at java.base/java.io.ObjectOutputStream.writeObject(Unknown Source) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.SerializerHelper$.serializeToChunkedBuffer(SerializerHelper.scala:42) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir ([B is in module java.base of loader 'bootstrap'; org.apache.spark.sql.catalyst.expressions.aggregate.Reservoir is in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.aggregate.ReservoirSample.serialize(ReservoirSample.scala:33) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:624) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:206) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33) at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186) at jdk.internal.reflect.GeneratedMethodAccessor62.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at java.base/java.io.ObjectStreamClass.invokeWriteReplace(Unknown Source) at java.base/java.io.ObjectOutputStream.writeObject0(Unknown Source) at java.base/java.io.ObjectOutputStream.writeObject(Unknown Source) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:62) at org.apa
[jira] [Created] (SPARK-45175) download krb5.conf from remote storage in spark-sumbit on k8s
Qian Sun created SPARK-45175: Summary: download krb5.conf from remote storage in spark-sumbit on k8s Key: SPARK-45175 URL: https://issues.apache.org/jira/browse/SPARK-45175 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.4.1 Reporter: Qian Sun krb5.conf currently only supports the local file format. Tenants would like to save this file on their own servers and download it during the spark-submit phase for better implementation of multi-tenant scenarios. The proposed solution is to use the *downloadFile* function[1], similar to the configuration of *spark.kubernetes.driver/executor.podTemplateFile* [1]https://github.com/apache/spark/blob/822f58f0d26b7d760469151a65eaf9ee863a07a1/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/PodTemplateConfigMapStep.scala#L82C24-L82C24 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45174) Support spark.deploy.maxDrivers
Dongjoon Hyun created SPARK-45174: - Summary: Support spark.deploy.maxDrivers Key: SPARK-45174 URL: https://issues.apache.org/jira/browse/SPARK-45174 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun Like `spark.mesos.maxDrivers`, this issue aims to add `spark.deploy.maxDrivers`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45174) Support spark.deploy.maxDrivers
[ https://issues.apache.org/jira/browse/SPARK-45174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45174: --- Labels: pull-request-available (was: ) > Support spark.deploy.maxDrivers > --- > > Key: SPARK-45174 > URL: https://issues.apache.org/jira/browse/SPARK-45174 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Like `spark.mesos.maxDrivers`, this issue aims to add > `spark.deploy.maxDrivers`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45173) Remove some unnecessary sourceMapping files in UI
[ https://issues.apache.org/jira/browse/SPARK-45173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45173: --- Labels: pull-request-available (was: ) > Remove some unnecessary sourceMapping files in UI > - > > Key: SPARK-45173 > URL: https://issues.apache.org/jira/browse/SPARK-45173 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45173) Remove some unnecessary sourceMapping files in UI
Kent Yao created SPARK-45173: Summary: Remove some unnecessary sourceMapping files in UI Key: SPARK-45173 URL: https://issues.apache.org/jira/browse/SPARK-45173 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45159) Handle named arguments only when necessary
[ https://issues.apache.org/jira/browse/SPARK-45159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45159. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42915 [https://github.com/apache/spark/pull/42915] > Handle named arguments only when necessary > -- > > Key: SPARK-45159 > URL: https://issues.apache.org/jira/browse/SPARK-45159 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45159) Handle named arguments only when necessary
[ https://issues.apache.org/jira/browse/SPARK-45159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45159: Assignee: Takuya Ueshin > Handle named arguments only when necessary > -- > > Key: SPARK-45159 > URL: https://issues.apache.org/jira/browse/SPARK-45159 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44752) XML: Update Spark Docs
[ https://issues.apache.org/jira/browse/SPARK-44752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765419#comment-17765419 ] tangjiafu commented on SPARK-44752: --- I have used Spark XML in my project before, and I think I can do some testing and complete this PR. Can you assign this PR to me? This is my 'good first issue' for Spark > XML: Update Spark Docs > -- > > Key: SPARK-44752 > URL: https://issues.apache.org/jira/browse/SPARK-44752 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Priority: Major > > [https://spark.apache.org/docs/latest/sql-data-sources.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45084) ProgressReport should include an accurate effective shuffle partition number
[ https://issues.apache.org/jira/browse/SPARK-45084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-45084. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42822 [https://github.com/apache/spark/pull/42822] > ProgressReport should include an accurate effective shuffle partition number > > > Key: SPARK-45084 > URL: https://issues.apache.org/jira/browse/SPARK-45084 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.2 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently, there is a numShufflePartitions "metric" reported in > StateOperatorProgress part of the progress report. However, the number is > reported by aggregating executors so in the case of task retry or speculative > executor, the metric is higher than number of shuffle partitions for the > query plan. Number of shuffle partitions can be useful for reporting purpose > so having a metric is helpful. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45084) ProgressReport should include an accurate effective shuffle partition number
[ https://issues.apache.org/jira/browse/SPARK-45084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-45084: Assignee: Siying Dong > ProgressReport should include an accurate effective shuffle partition number > > > Key: SPARK-45084 > URL: https://issues.apache.org/jira/browse/SPARK-45084 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.2 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Labels: pull-request-available > > Currently, there is a numShufflePartitions "metric" reported in > StateOperatorProgress part of the progress report. However, the number is > reported by aggregating executors so in the case of task retry or speculative > executor, the metric is higher than number of shuffle partitions for the > query plan. Number of shuffle partitions can be useful for reporting purpose > so having a metric is helpful. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-43406. - Resolution: Duplicate > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43406: Target Version/s: (was: 4.0.0) > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43406: Fix Version/s: (was: 3.5.0) > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43406: Target Version/s: 4.0.0 > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-37487) CollectMetrics is executed twice if it is followed by a sort
[ https://issues.apache.org/jira/browse/SPARK-37487 ] Huw deleted comment on SPARK-37487: - was (Author: JIRAUSER288917): I think I've seen crashes because of this in production. I can't reproduce locally, but I believe that Imperative aggregates are having their `serialiseAggregateBufferInPlace` function called twice, the second time it's doing an unsafe coerce onto a byte buffer. {quote}Caused by: java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest ([B is in module java.base of loader 'bootstrap'; org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest is in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxQuantiles.serialize(ApproxQuantiles.scala:19) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:624) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:206) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33){quote} > CollectMetrics is executed twice if it is followed by a sort > > > Key: SPARK-37487 > URL: https://issues.apache.org/jira/browse/SPARK-37487 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Tanel Kiis >Priority: Major > Labels: correctness > > It is best examplified by this new UT in DataFrameCallbackSuite: > {code} > test("SPARK-37487: get observable metrics with sort by callback") { > val df = spark.range(100) > .observe( > name = "my_event", > min($"id").as("min_val"), > max($"id").as("max_val"), > // Test unresolved alias > sum($"id"), > count(when($"id" % 2 === 0, 1)).as("num_even")) > .observe( > name = "other_event", > avg($"id").cast("int").as("avg_val")) > .sort($"id".desc) > validateObservedMetrics(df) > } > {code} > The count and sum aggregate report twice the number of rows: > {code} > [info] - SPARK-37487: get observable metrics with sort by callback *** FAILED > *** (169 milliseconds) > [info] [0,99,9900,100] did not equal [0,99,4950,50] > (DataFrameCallbackSuite.scala:342) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > [info] at > org.apache.spark.sql.util.DataFrameCallbackSuite.checkMetrics$1(DataFrameCallbackSuite.scala:342) > [info] at > org.apache.spark.sql.util.DataFrameCallbackSuite.validateObservedMetrics(DataFrameCallbackSuite.scala:350) > [info] at > org.apache.spark.sql.util.DataFrameCallbackSuite.$anonfun$new$21(DataFrameCallbackSuite.scala:324) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > {code} > I could not figure out how this happes. Hopefully the UT can help with > debugging -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42466) spark.kubernetes.file.upload.path not deleting files under HDFS after job completes
[ https://issues.apache.org/jira/browse/SPARK-42466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-42466: --- Labels: pull-request-available (was: ) > spark.kubernetes.file.upload.path not deleting files under HDFS after job > completes > --- > > Key: SPARK-42466 > URL: https://issues.apache.org/jira/browse/SPARK-42466 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.3.2 >Reporter: Jagadeeswara Rao >Priority: Major > Labels: pull-request-available > > In cluster mode after uploading files to HDFS location using > spark.kubernetes.file.upload.path property files are not getting cleared . > File is successfully uploaded to hdfs location in this format > spark-upload-[randomUUID] using {{KubernetesUtils}} is requested to > uploadFileUri . > [https://github.com/apache/spark/blob/76a134ade60a9f354aca01eaca0b2e2477c6bd43/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala#L310] > following is driver log , driver is completed successfully and shutdownhook > is not cleared the hdfs files. > {code:java} > 23/02/16 18:06:56 INFO KubernetesClusterSchedulerBackend: Shutting down all > executors > 23/02/16 18:06:56 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 23/02/16 18:06:56 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has > been closed. > 23/02/16 18:06:57 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 23/02/16 18:06:57 INFO MemoryStore: MemoryStore cleared > 23/02/16 18:06:57 INFO BlockManager: BlockManager stopped > 23/02/16 18:06:57 INFO BlockManagerMaster: BlockManagerMaster stopped > 23/02/16 18:06:57 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 23/02/16 18:06:57 INFO SparkContext: Successfully stopped SparkContext > 23/02/16 18:06:57 INFO ShutdownHookManager: Shutdown hook called > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /tmp/spark-efb8f725-4ead-4729-a8e0-f478280121b7 > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /spark-local2/spark-66dbf7e6-fe7e-4655-8724-69d76d93fc1f > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /spark-local1/spark-53aefaee-58a5-4fce-b5b0-5e29f42e337f{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog
[ https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-16484: --- Labels: bulk-closed pull-request-available (was: bulk-closed) > Incremental Cardinality estimation operations with Hyperloglog > -- > > Key: SPARK-16484 > URL: https://issues.apache.org/jira/browse/SPARK-16484 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yongjia Wang >Assignee: Ryan Berti >Priority: Major > Labels: bulk-closed, pull-request-available > Fix For: 3.5.0 > > > Efficient cardinality estimation is very important, and SparkSQL has had > approxCountDistinct based on Hyperloglog for quite some time. However, there > isn't a way to do incremental estimation. For example, if we want to get > updated distinct counts of the last 90 days, we need to do the aggregation > for the entire window over and over again. The more efficient way involves > serializing the counter for smaller time windows (such as hourly) so the > counts can be efficiently updated in an incremental fashion for any time > window. > With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus > implementation in the current Spark version, it's easy enough to extend the > functionality to include incremental counting, and even other general set > operations such as intersection and set difference. Spark API is already as > elegant as it can be, but it still takes quite some effort to do a custom > implementation of the aforementioned operations which are supposed to be in > high demand. I have been searching but failed to find an usable existing > solution nor any ongoing effort for this. The closest I got is the following > but it does not work with Spark 1.6 due to API changes. > https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala > I wonder if it worth to integrate such operations into SparkSQL. The only > problem I see is it depends on serialization of a specific HLL implementation > and introduce compatibility issues. But as long as the user is aware of such > issue, it should be fine. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to 1.24.0
[ https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45172: --- Labels: pull-request-available (was: ) > Upgrade commons-compress.version from 1.23.0 to 1.24.0 > -- > > Key: SPARK-45172 > URL: https://issues.apache.org/jira/browse/SPARK-45172 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to 1.24.0
[ https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-45172: - Summary: Upgrade commons-compress.version from 1.23.0 to 1.24.0 (was: Upgrade commons-compress.version from 1.23.0 to .124.0) > Upgrade commons-compress.version from 1.23.0 to 1.24.0 > -- > > Key: SPARK-45172 > URL: https://issues.apache.org/jira/browse/SPARK-45172 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to .124.0
Hyukjin Kwon created SPARK-45172: Summary: Upgrade commons-compress.version from 1.23.0 to .124.0 Key: SPARK-45172 URL: https://issues.apache.org/jira/browse/SPARK-45172 Project: Spark Issue Type: Bug Components: Build Affects Versions: 4.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45172) Upgrade commons-compress.version from 1.23.0 to .124.0
[ https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-45172: - Issue Type: Improvement (was: Bug) > Upgrade commons-compress.version from 1.23.0 to .124.0 > -- > > Key: SPARK-45172 > URL: https://issues.apache.org/jira/browse/SPARK-45172 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45171) GenerateExec fails to initialize non-deterministic expressions before use
[ https://issues.apache.org/jira/browse/SPARK-45171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45171: --- Labels: pull-request-available (was: ) > GenerateExec fails to initialize non-deterministic expressions before use > - > > Key: SPARK-45171 > URL: https://issues.apache.org/jira/browse/SPARK-45171 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: pull-request-available > > The following query fails: > {noformat} > select * > from explode( > transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22) > ); > {noformat} > The error is: > {noformat} > 23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) > java.lang.IllegalArgumentException: requirement failed: Nondeterministic > expression org.apache.spark.sql.catalyst.expressions.Rand should be > initialized before eval. > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495) > at > org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35) > at > org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543) > at > org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) > at > org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062) > at > org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275) > at > org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274) > at > org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308) > at > org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) > ... > {noformat} > However, this query succeeds: > {noformat} > select * > from explode( > sequence(0, cast(rand()*1000 as int) + 1) > ); > {noformat} > The difference is that {{transform}} turns off whole-stage codegen, which > exposes a bug in {{GenerateExec}} where the non-deterministic expression > passed to the generator function is not initialized before being used. > An even simpler reprod case is: > {noformat} > set spark.sql.codegen.wholeStage=false; > select explode(array(rand())); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45161) Bump `previousSparkVersion` to 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45161: - Assignee: Yang Jie > Bump `previousSparkVersion` to 3.5.0 > > > Key: SPARK-45161 > URL: https://issues.apache.org/jira/browse/SPARK-45161 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45161) Bump `previousSparkVersion` to 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45161. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42921 [https://github.com/apache/spark/pull/42921] > Bump `previousSparkVersion` to 3.5.0 > > > Key: SPARK-45161 > URL: https://issues.apache.org/jira/browse/SPARK-45161 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45137) Unsupported map and array constructors by `sql()` in connect clients
[ https://issues.apache.org/jira/browse/SPARK-45137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45137: --- Labels: pull-request-available (was: ) > Unsupported map and array constructors by `sql()` in connect clients > > > Key: SPARK-45137 > URL: https://issues.apache.org/jira/browse/SPARK-45137 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > Labels: pull-request-available > > The code below demonstrates the issue: > {code:scala} > spark.sql("select element_at(?, 1)", Array(array(lit(1.collect() > {code} > It fails with the error: > {code:java} > [info] java.lang.UnsupportedOperationException: literal unresolved_function > { > [info] function_name: "array" > [info] arguments { > [info] literal { > [info] integer: 1 > [info] } > [info] } > [info] } > [info] not supported (yet). > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45118) Refactor converters for complex types to short cut when the element types don't need converters
[ https://issues.apache.org/jira/browse/SPARK-45118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-45118. --- Fix Version/s: 4.0.0 Assignee: Takuya Ueshin Resolution: Fixed Issue resolved by pull request 42874 https://github.com/apache/spark/pull/42874 > Refactor converters for complex types to short cut when the element types > don't need converters > --- > > Key: SPARK-45118 > URL: https://issues.apache.org/jira/browse/SPARK-45118 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45171) GenerateExec fails to initialize non-deterministic expressions before use
Bruce Robbins created SPARK-45171: - Summary: GenerateExec fails to initialize non-deterministic expressions before use Key: SPARK-45171 URL: https://issues.apache.org/jira/browse/SPARK-45171 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins The following query fails: {noformat} select * from explode( transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22) ); {noformat} The error is: {noformat} 23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.IllegalArgumentException: requirement failed: Nondeterministic expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized before eval. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497) at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495) at org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35) at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543) at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) at org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062) at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275) at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274) at org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) ... {noformat} However, this query succeeds: {noformat} select * from explode( sequence(0, cast(rand()*1000 as int) + 1) ); {noformat} The difference is that {{transform}} turns off whole-stage codegen, which exposes a bug in {{GenerateExec}} where the non-deterministic expression passed to the generator function is not initialized before being used. An even simpler reprod case is: {noformat} set spark.sql.codegen.wholeStage=false; select explode(array(rand())); {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43966) Support non-deterministic Python UDTFs
[ https://issues.apache.org/jira/browse/SPARK-43966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43966: --- Labels: pull-request-available (was: ) > Support non-deterministic Python UDTFs > -- > > Key: SPARK-43966 > URL: https://issues.apache.org/jira/browse/SPARK-43966 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Support Python UDTFs with non-deterministic function body and inputs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45170) Scala-specific improvements in Dataset[T] API
Danila Goloshchapov created SPARK-45170: --- Summary: Scala-specific improvements in Dataset[T] API Key: SPARK-45170 URL: https://issues.apache.org/jira/browse/SPARK-45170 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.1 Reporter: Danila Goloshchapov *Q1.* What are you trying to do? The main idea is to use the power of scala's macrosses to give developers more convenient and typesafe API to use in join conditions. *Q2.* What problem is this proposal NOT designed to solve? R/Java/Python/DataFrame API is out of scope. The solution is not affecting plan generation too. *Q3.* How is it done today, and what are the limits of current practice? Currently the join condition is specified via strings, which might lead to silly mistakes (typos, incompatible column types etc) and sometimes hard to read (in case when several joins are made and the final type is tuple of tuple of tuples...) *Q4.* What is new in your approach and why do you think it will be successful? Scala macroses can be used to extract the column name directly from lambda (extractor). As a side effect its possible to check the column type and prohibit to build inconsistent join expression (like boolean-timestamp comparison) *Q5.* Who cares? If you are successful, what difference will it make? Mainly scala developers who prefers typesafe code - they would have a more clean and nice API that will make the codebase a bit clearer, especially in case when several chained joins is used *Q6.* What are the risks? The overusage of macrosses may slow down the compilation speed. In additional macrosses are hard to maintain *Q7.* How long will it take? Currently the approach is already implemented as a separate [lib|https://github.com/Salamahin/joinwiz] that makes a bit more than just gives alternative API (for example abstracts Dataset[T] to F[T] which allows to run some spark-specific code without spark session for testing purposes) Adaptation of it won't be a hard job, matter of several weeks *Q8.* What are the mid-term and final “exams” to check for success? API convenience is very hard to estimate as its more or less a question of taste *Appendix A* You may find the examples of such 'cleaner' API [here|https://github.com/Salamahin/joinwiz/blob/master/joinwiz_core/src/test/scala/joinwiz/ComputationEngineTest.scala] Note that backward and forward compatibility is achieved by introducing a brand-new API without modifying an old one -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44141) Remove need to preinstall the buf compiler
[ https://issues.apache.org/jira/browse/SPARK-44141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44141: --- Labels: pull-request-available (was: ) > Remove need to preinstall the buf compiler > -- > > Key: SPARK-44141 > URL: https://issues.apache.org/jira/browse/SPARK-44141 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Arnar Pall >Priority: Minor > Labels: pull-request-available > > In order to lower the barrier of entry even further for this project we can > remove need to have {{buf}} preinstalled and just use {{go run}} > This also ensures that the tool chain remains consistent and there is less > works on my machine issues to be had. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45169) Add official image Dockerfile for Apache Spark 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang resolved SPARK-45169. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 55 [https://github.com/apache/spark-docker/pull/55] > Add official image Dockerfile for Apache Spark 3.5.0 > > > Key: SPARK-45169 > URL: https://issues.apache.org/jira/browse/SPARK-45169 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.5.0 >Reporter: Yikun Jiang >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45169) Add official image Dockerfile for Apache Spark 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45169: --- Labels: pull-request-available (was: ) > Add official image Dockerfile for Apache Spark 3.5.0 > > > Key: SPARK-45169 > URL: https://issues.apache.org/jira/browse/SPARK-45169 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.5.0 >Reporter: Yikun Jiang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45169) Add official image Dockerfile for Apache Spark 3.5.0
Yikun Jiang created SPARK-45169: --- Summary: Add official image Dockerfile for Apache Spark 3.5.0 Key: SPARK-45169 URL: https://issues.apache.org/jira/browse/SPARK-45169 Project: Spark Issue Type: Sub-task Components: Spark Docker Affects Versions: 3.5.0 Reporter: Yikun Jiang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45168) Increate Pandas minimum version to 1.4.4
[ https://issues.apache.org/jira/browse/SPARK-45168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45168: --- Labels: pull-request-available (was: ) > Increate Pandas minimum version to 1.4.4 > > > Key: SPARK-45168 > URL: https://issues.apache.org/jira/browse/SPARK-45168 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45168) Increate Pandas minimum version to 1.4.4
Ruifeng Zheng created SPARK-45168: - Summary: Increate Pandas minimum version to 1.4.4 Key: SPARK-45168 URL: https://issues.apache.org/jira/browse/SPARK-45168 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45167) Python Spark Connect client does not call `releaseAll`
[ https://issues.apache.org/jira/browse/SPARK-45167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45167: --- Labels: pull-request-available (was: ) > Python Spark Connect client does not call `releaseAll` > -- > > Key: SPARK-45167 > URL: https://issues.apache.org/jira/browse/SPARK-45167 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Priority: Major > Labels: pull-request-available > > The Python client does not call release all previous responses on the server > and thus does not properly close the queries. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45167) Python Spark Connect client does not call `releaseAll`
[ https://issues.apache.org/jira/browse/SPARK-45167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliusz Sompolski updated SPARK-45167: -- Epic Link: SPARK-43754 (was: SPARK-39375) > Python Spark Connect client does not call `releaseAll` > -- > > Key: SPARK-45167 > URL: https://issues.apache.org/jira/browse/SPARK-45167 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Priority: Major > > The Python client does not call release all previous responses on the server > and thus does not properly close the queries. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45167) Python Spark Connect client does not call `releaseAll`
[ https://issues.apache.org/jira/browse/SPARK-45167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Grund updated SPARK-45167: - Issue Type: Bug (was: Improvement) > Python Spark Connect client does not call `releaseAll` > -- > > Key: SPARK-45167 > URL: https://issues.apache.org/jira/browse/SPARK-45167 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Priority: Major > > The Python client does not call release all previous responses on the server > and thus does not properly close the queries. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45167) Python Spark Connect client does not call `releaseAll`
Martin Grund created SPARK-45167: Summary: Python Spark Connect client does not call `releaseAll` Key: SPARK-45167 URL: https://issues.apache.org/jira/browse/SPARK-45167 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Martin Grund The Python client does not call release all previous responses on the server and thus does not properly close the queries. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45166) Clean up unused code paths for pyarrow<4
[ https://issues.apache.org/jira/browse/SPARK-45166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45166: --- Labels: pull-request-available (was: ) > Clean up unused code paths for pyarrow<4 > > > Key: SPARK-45166 > URL: https://issues.apache.org/jira/browse/SPARK-45166 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45166) Clean up unused code paths for pyarrow<4
Ruifeng Zheng created SPARK-45166: - Summary: Clean up unused code paths for pyarrow<4 Key: SPARK-45166 URL: https://issues.apache.org/jira/browse/SPARK-45166 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31177) DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has non-".gz" extension
[ https://issues.apache.org/jira/browse/SPARK-31177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765139#comment-17765139 ] Avi minsky commented on SPARK-31177: [~markwaddle] , [~maropu] how was this resolved? > DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has > non-".gz" extension > -- > > Key: SPARK-31177 > URL: https://issues.apache.org/jira/browse/SPARK-31177 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.4 >Reporter: Mark Waddle >Priority: Major > Labels: bulk-closed > > i have large CSV files that are gzipped and uploaded to S3 with > Content-Encoding=gzip. the files have file extension ".csv", as most web > clients will automatically decompress the file based on the Content-Encoding > header. using pyspark to read these CSV files does not mimic this behavior. > works as expected: > {code:java} > df = spark.read.csv('s3://bucket/large.csv.gz', header=True) > {code} > does not decompress and tries to load entire contents of file as the first > row: > {code:java} > df = spark.read.csv('s3://bucket/large.csv', header=True) > {code} > it looks like it's relying on the file extension to determine if the file is > gzip compressed or not. it would be great if S3 resources, and any other http > based resources, could consult the Content-Encoding response header as well. > i tried to find the code that determines this, but i'm not familiar with the > code base. any pointers would be helpful. and i can look into fixing it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45119) Refine docstring of `inline`
[ https://issues.apache.org/jira/browse/SPARK-45119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-45119: - Assignee: Allison Wang > Refine docstring of `inline` > > > Key: SPARK-45119 > URL: https://issues.apache.org/jira/browse/SPARK-45119 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > > Refine docstring of the `inline` function -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45119) Refine docstring of `inline`
[ https://issues.apache.org/jira/browse/SPARK-45119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-45119. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42875 [https://github.com/apache/spark/pull/42875] > Refine docstring of `inline` > > > Key: SPARK-45119 > URL: https://issues.apache.org/jira/browse/SPARK-45119 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Refine docstring of the `inline` function -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45154) Pyspark DecisionTreeClassifier: results and tree structure in spark3 very different from that of the spark2 version on the same data and with the same hyperparameters.
[ https://issues.apache.org/jira/browse/SPARK-45154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oumar Nour updated SPARK-45154: --- Priority: Critical (was: Major) > Pyspark DecisionTreeClassifier: results and tree structure in spark3 very > different from that of the spark2 version on the same data and with the same > hyperparameters. > --- > > Key: SPARK-45154 > URL: https://issues.apache.org/jira/browse/SPARK-45154 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 3.0.0, 3.3.1, 3.2.4, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Oumar Nour >Priority: Critical > Labels: decisiontree, pyspark3, spark2, spark3 > > Hello, > I have an engine running on spark2 using a DecisionTreeClassifier model using > the CrossValidator. > > {code:java} > dt = DecisionTreeClassifier(maxBins=1, seed=0) > cv_dt_evaluator = BinaryClassificationEvaluator( > metricName="", > rawPredictionCol="probability") > # Create param grid and cross validator for model selection > dt_grid = ParamGridBuilder()\ > .addGrid( > dt.minInstancesPerNode, 100 > )\ > .addGrid( > dt.maxDepth, 10 > )\ > .build() > cv = CrossValidator( > estimator=dt, estimatorParamMaps=dt_grid, > evaluator=cv_dt_evaluator, > parallelism=4 > numFolds=4 > ){code} > > I want to {*}migrate from spark2 to spark3{*}. I've run > *DecisionTreeClassifier* on the same data with the same parameter values. But > unfortunately my results are {*}completely different, especially in terms of > tree structure{*}. I have trees with less depth and fewer splits on spark3. > I've tried to read the documenttaion but I haven't found an answer to my > question. > I read somewhere that the behavior of the *minInstancesPerNode* parameter has > changed and that in Spark 3, {*}minInstancesPerNode{*}(It now controls the > minimum number of instances per data partition in the node to create a child > node) no longer applies to the total number of instances in a node but rather > to the number of instances per partition. This change may have an impact on > the way the decision tree is built, particularly when working with unevenly > partitioned data. *IS THIS TRUE?* > Can you help me find a solution to this problem? > Thanks in advance for your help > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45163) Merge TABLE_OPERATION & _LEGACY_ERROR_TEMP_1113 into UNSUPPORTED_TABLE_OPERATION and refactor some logic
[ https://issues.apache.org/jira/browse/SPARK-45163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45163: --- Labels: pull-request-available (was: ) > Merge TABLE_OPERATION & _LEGACY_ERROR_TEMP_1113 into > UNSUPPORTED_TABLE_OPERATION and refactor some logic > > > Key: SPARK-45163 > URL: https://issues.apache.org/jira/browse/SPARK-45163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45163) Merge TABLE_OPERATION & _LEGACY_ERROR_TEMP_1113 into UNSUPPORTED_TABLE_OPERATION and refactor some logic
[ https://issues.apache.org/jira/browse/SPARK-45163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-45163: Summary: Merge TABLE_OPERATION & _LEGACY_ERROR_TEMP_1113 into UNSUPPORTED_TABLE_OPERATION and refactor some logic (was: Merge UNSUPPORTED_FEATURE.TABLE_OPERATION into UNSUPPORTED_TABLE_OPERATION and refactor some logic) > Merge TABLE_OPERATION & _LEGACY_ERROR_TEMP_1113 into > UNSUPPORTED_TABLE_OPERATION and refactor some logic > > > Key: SPARK-45163 > URL: https://issues.apache.org/jira/browse/SPARK-45163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45088) Make `getitem` work with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-45088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-45088: - Assignee: Ruifeng Zheng > Make `getitem` work with duplicated columns > --- > > Key: SPARK-45088 > URL: https://issues.apache.org/jira/browse/SPARK-45088 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45088) Make `getitem` work with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-45088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-45088. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42828 [https://github.com/apache/spark/pull/42828] > Make `getitem` work with duplicated columns > --- > > Key: SPARK-45088 > URL: https://issues.apache.org/jira/browse/SPARK-45088 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45165) Remove `inplace` parameter from `Categorical` APIs
[ https://issues.apache.org/jira/browse/SPARK-45165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45165: --- Labels: pull-request-available (was: ) > Remove `inplace` parameter from `Categorical` APIs > -- > > Key: SPARK-45165 > URL: https://issues.apache.org/jira/browse/SPARK-45165 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > `inplace` should be removed from CategoricalIndex APIs to match the pandas > behavior -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org