[jira] [Updated] (SPARK-48068) Fix `mypy` failure in Python 3.10 and 3.11
[ https://issues.apache.org/jira/browse/SPARK-48068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48068: -- Description: We assumed that `PYTHON_EXECUTABLE` is used for `dev/lint-python` like the following. That's not true. We need to use `mypy`'s parameter to make it sure. https://github.com/apache/spark/blob/ff401dde50343c9bbc1c49a0294272f2da7d01e2/.github/workflows/build_and_test.yml#L705 was: {code} $ python3 --version Python 3.10.13 $ dev/lint-python --mypy starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" comment [unused-ignore] Found 1 error in 1 file (checked 1013 source files) 1 {code} {code} $ python3 --version Python 3.11.8 $ dev/lint-python --mypy starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" comment [unused-ignore] Found 1 error in 1 file (checked 1013 source files) 1 {code} > Fix `mypy` failure in Python 3.10 and 3.11 > -- > > Key: SPARK-48068 > URL: https://issues.apache.org/jira/browse/SPARK-48068 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0, 4.0.0, 3.5.1, 3.3.4, 3.4.3 > Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > We assumed that `PYTHON_EXECUTABLE` is used for `dev/lint-python` like the > following. That's not true. We need to use `mypy`'s parameter to make it sure. > https://github.com/apache/spark/blob/ff401dde50343c9bbc1c49a0294272f2da7d01e2/.github/workflows/build_and_test.yml#L705 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48068) Fix `mypy` failure in Python 3.10 and 3.11
[ https://issues.apache.org/jira/browse/SPARK-48068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48068: -- Affects Version/s: 3.4.3 3.3.4 3.5.1 3.3.0 > Fix `mypy` failure in Python 3.10 and 3.11 > -- > > Key: SPARK-48068 > URL: https://issues.apache.org/jira/browse/SPARK-48068 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0, 4.0.0, 3.5.1, 3.3.4, 3.4.3 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > $ python3 --version > Python 3.10.13 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} > {code} > $ python3 --version > Python 3.11.8 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48071) Use Python 3.10 in `Python Linter` step
[ https://issues.apache.org/jira/browse/SPARK-48071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48071: -- Issue Type: Improvement (was: Bug) > Use Python 3.10 in `Python Linter` step > --- > > Key: SPARK-48071 > URL: https://issues.apache.org/jira/browse/SPARK-48071 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48069) Handle PEP-632 by checking `ModuleNotFoundError` on `setuptools` in Python 3.12
[ https://issues.apache.org/jira/browse/SPARK-48069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48069. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46315 [https://github.com/apache/spark/pull/46315] > Handle PEP-632 by checking `ModuleNotFoundError` on `setuptools` in Python > 3.12 > --- > > Key: SPARK-48069 > URL: https://issues.apache.org/jira/browse/SPARK-48069 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48069) Handle PEP-632 by checking `ModuleNotFoundError` on `setuptools` in Python 3.12
[ https://issues.apache.org/jira/browse/SPARK-48069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48069: -- Summary: Handle PEP-632 by checking `ModuleNotFoundError` on `setuptools` in Python 3.12 (was: Handle PEP-632 by checking `ModuleNotFoundError` on `setuptools`) > Handle PEP-632 by checking `ModuleNotFoundError` on `setuptools` in Python > 3.12 > --- > > Key: SPARK-48069 > URL: https://issues.apache.org/jira/browse/SPARK-48069 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48068) Fix `mypy` failure in Python 3.10 and 3.11
[ https://issues.apache.org/jira/browse/SPARK-48068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48068: -- Summary: Fix `mypy` failure in Python 3.10 and 3.11 (was: Fix `mypy` failure in Python 3.10+) > Fix `mypy` failure in Python 3.10 and 3.11 > -- > > Key: SPARK-48068 > URL: https://issues.apache.org/jira/browse/SPARK-48068 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > $ python3 --version > Python 3.10.13 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} > {code} > $ python3 --version > Python 3.11.8 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48069) Handle PEP-632 by checking `ModuleNotFoundError` on `setuptools`
Dongjoon Hyun created SPARK-48069: - Summary: Handle PEP-632 by checking `ModuleNotFoundError` on `setuptools` Key: SPARK-48069 URL: https://issues.apache.org/jira/browse/SPARK-48069 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48069) Handle PEP-632 by checking `ModuleNotFoundError` on `setuptools`
[ https://issues.apache.org/jira/browse/SPARK-48069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48069: - Assignee: Dongjoon Hyun > Handle PEP-632 by checking `ModuleNotFoundError` on `setuptools` > > > Key: SPARK-48069 > URL: https://issues.apache.org/jira/browse/SPARK-48069 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48068) Fix `mypy` failure in Python 3.10+
[ https://issues.apache.org/jira/browse/SPARK-48068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48068: - Assignee: Dongjoon Hyun > Fix `mypy` failure in Python 3.10+ > -- > > Key: SPARK-48068 > URL: https://issues.apache.org/jira/browse/SPARK-48068 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > $ python3 --version > Python 3.10.13 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} > {code} > $ python3 --version > Python 3.11.8 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48068) Fix `mypy` failure in Python 3.10+
[ https://issues.apache.org/jira/browse/SPARK-48068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48068: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Bug) > Fix `mypy` failure in Python 3.10+ > -- > > Key: SPARK-48068 > URL: https://issues.apache.org/jira/browse/SPARK-48068 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > $ python3 --version > Python 3.10.13 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} > {code} > $ python3 --version > Python 3.11.8 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48068) Fix `mypy` failure in Python 3.10+
[ https://issues.apache.org/jira/browse/SPARK-48068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48068: -- Description: {code} $ python3 --version Python 3.10.13 $ dev/lint-python --mypy starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" comment [unused-ignore] Found 1 error in 1 file (checked 1013 source files) 1 {code} {code} $ python3 --version Python 3.11.8 $ dev/lint-python --mypy starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" comment [unused-ignore] Found 1 error in 1 file (checked 1013 source files) 1 {code} was: {code} $ python3 --version Python 3.10.13 $ dev/lint-python --mypy starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" comment [unused-ignore] Found 1 error in 1 file (checked 1013 source files) 1 {code} > Fix `mypy` failure in Python 3.10+ > -- > > Key: SPARK-48068 > URL: https://issues.apache.org/jira/browse/SPARK-48068 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > $ python3 --version > Python 3.10.13 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} > {code} > $ python3 --version > Python 3.11.8 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48068) Fix `mypy` failure in Python 3.10+
[ https://issues.apache.org/jira/browse/SPARK-48068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48068: -- Summary: Fix `mypy` failure in Python 3.10+ (was: Fix `mypy` failure in Python 3.10) > Fix `mypy` failure in Python 3.10+ > -- > > Key: SPARK-48068 > URL: https://issues.apache.org/jira/browse/SPARK-48068 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > $ python3 --version > Python 3.10.13 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48068) Fix `mypy` failure in Python 3.10
[ https://issues.apache.org/jira/browse/SPARK-48068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48068: -- Description: {code} $ python3 --version Python 3.10.13 $ dev/lint-python --mypy starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" comment [unused-ignore] Found 1 error in 1 file (checked 1013 source files) 1 {code} was: {code} $ dev/lint-python --mypy starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" comment [unused-ignore] Found 1 error in 1 file (checked 1013 source files) 1 {code} > Fix `mypy` failure in Python 3.10 > - > > Key: SPARK-48068 > URL: https://issues.apache.org/jira/browse/SPARK-48068 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > {code} > $ python3 --version > Python 3.10.13 > $ dev/lint-python --mypy > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" > comment [unused-ignore] > Found 1 error in 1 file (checked 1013 source files) > 1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48068) Fix `mypy` failure in Python 3.10
Dongjoon Hyun created SPARK-48068: - Summary: Fix `mypy` failure in Python 3.10 Key: SPARK-48068 URL: https://issues.apache.org/jira/browse/SPARK-48068 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 4.0.0 Reporter: Dongjoon Hyun {code} $ dev/lint-python --mypy starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/pandas/conversion.py:450: error: Unused "type: ignore" comment [unused-ignore] Found 1 error in 1 file (checked 1013 source files) 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48028) Regenerate benchmark results after turning ANSI on
[ https://issues.apache.org/jira/browse/SPARK-48028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48028. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46266 [https://github.com/apache/spark/pull/46266] > Regenerate benchmark results after turning ANSI on > -- > > Key: SPARK-48028 > URL: https://issues.apache.org/jira/browse/SPARK-48028 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48063) Enable `spark.stage.ignoreDecommissionFetchFailure` by default
[ https://issues.apache.org/jira/browse/SPARK-48063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48063. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46308 [https://github.com/apache/spark/pull/46308] > Enable `spark.stage.ignoreDecommissionFetchFailure` by default > -- > > Key: SPARK-48063 > URL: https://issues.apache.org/jira/browse/SPARK-48063 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48063) Enable `spark.stage.ignoreDecommissionFetchFailure` by default
[ https://issues.apache.org/jira/browse/SPARK-48063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48063: - Assignee: Dongjoon Hyun > Enable `spark.stage.ignoreDecommissionFetchFailure` by default > -- > > Key: SPARK-48063 > URL: https://issues.apache.org/jira/browse/SPARK-48063 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48063) Enable `spark.stage.ignoreDecommissionFetchFailure` by default
Dongjoon Hyun created SPARK-48063: - Summary: Enable `spark.stage.ignoreDecommissionFetchFailure` by default Key: SPARK-48063 URL: https://issues.apache.org/jira/browse/SPARK-48063 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48060) Fix StreamingQueryHashPartitionVerifySuite to update golden files correctly
[ https://issues.apache.org/jira/browse/SPARK-48060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48060: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Test) > Fix StreamingQueryHashPartitionVerifySuite to update golden files correctly > --- > > Key: SPARK-48060 > URL: https://issues.apache.org/jira/browse/SPARK-48060 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming, Tests >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48060) Fix StreamingQueryHashPartitionVerifySuite to update golden files correctly
[ https://issues.apache.org/jira/browse/SPARK-48060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48060. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46304 [https://github.com/apache/spark/pull/46304] > Fix StreamingQueryHashPartitionVerifySuite to update golden files correctly > --- > > Key: SPARK-48060 > URL: https://issues.apache.org/jira/browse/SPARK-48060 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48057) Enable `GroupedApplyInPandasTests. test_grouped_with_empty_partition`
[ https://issues.apache.org/jira/browse/SPARK-48057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48057. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46299 [https://github.com/apache/spark/pull/46299] > Enable `GroupedApplyInPandasTests. test_grouped_with_empty_partition` > - > > Key: SPARK-48057 > URL: https://issues.apache.org/jira/browse/SPARK-48057 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48061) Parameterize max limits of `spark.sql.test.randomDataGenerator`
[ https://issues.apache.org/jira/browse/SPARK-48061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48061. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46305 [https://github.com/apache/spark/pull/46305] > Parameterize max limits of `spark.sql.test.randomDataGenerator` > --- > > Key: SPARK-48061 > URL: https://issues.apache.org/jira/browse/SPARK-48061 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48061) Parameterize max limits of `spark.sql.test.randomDataGenerator`
[ https://issues.apache.org/jira/browse/SPARK-48061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48061: - Assignee: Dongjoon Hyun > Parameterize max limits of `spark.sql.test.randomDataGenerator` > --- > > Key: SPARK-48061 > URL: https://issues.apache.org/jira/browse/SPARK-48061 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48061) Parameterize max limits of `spark.sql.test.randomDataGenerator`
[ https://issues.apache.org/jira/browse/SPARK-48061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48061: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Test) > Parameterize max limits of `spark.sql.test.randomDataGenerator` > --- > > Key: SPARK-48061 > URL: https://issues.apache.org/jira/browse/SPARK-48061 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48061) Parameterize max limits of `spark.sql.test.randomDataGenerator`
Dongjoon Hyun created SPARK-48061: - Summary: Parameterize max limits of `spark.sql.test.randomDataGenerator` Key: SPARK-48061 URL: https://issues.apache.org/jira/browse/SPARK-48061 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48060) Fix StreamingQueryHashPartitionVerifySuite to update golden files correctly
Dongjoon Hyun created SPARK-48060: - Summary: Fix StreamingQueryHashPartitionVerifySuite to update golden files correctly Key: SPARK-48060 URL: https://issues.apache.org/jira/browse/SPARK-48060 Project: Spark Issue Type: Test Components: Structured Streaming, Tests Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46122) Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default
[ https://issues.apache.org/jira/browse/SPARK-46122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46122. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46207 [https://github.com/apache/spark/pull/46207] > Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default > - > > Key: SPARK-46122 > URL: https://issues.apache.org/jira/browse/SPARK-46122 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang > Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[VOTE][RESULT] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false
The vote passes with 11 +1s (6 binding +1s) and one -1. Thanks to all who helped with the vote! (* = binding) +1: - Dongjoon Hyun * - Gengliang Wang * - Liang-Chi Hsieh * - Holden Karau * - Zhou Jiang - Cheng Pan - Hyukjin Kwon * - DB Tsai * - Ye Xianjin - XiDuo You - Nimrod Ofek +0: None -1: - Mich Talebzadeh
[jira] [Commented] (SPARK-48016) Fix a bug in try_divide function when with decimals
[ https://issues.apache.org/jira/browse/SPARK-48016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842200#comment-17842200 ] Dongjoon Hyun commented on SPARK-48016: --- Thank you so much! > Fix a bug in try_divide function when with decimals > --- > > Key: SPARK-48016 > URL: https://issues.apache.org/jira/browse/SPARK-48016 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0, 3.5.2 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: pull-request-available > > Binary Arithmetic operators should include the evalMode during makeCopy. > Otherwise, the following query will throw DIVIDE_BY_ZERO error instead of > returning null > > {code:java} > SELECT try_divide(1, decimal(0)); {code} > This is caused from the rule DecimalPrecision: > {code:java} > case b @ BinaryOperator(left, right) if left.dataType != right.dataType => > (left, right) match { > ... > case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && > l.dataType.isInstanceOf[IntegralType] && > literalPickMinimumPrecision => > b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r)) {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48016) Fix a bug in try_divide function when with decimals
[ https://issues.apache.org/jira/browse/SPARK-48016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842191#comment-17842191 ] Dongjoon Hyun commented on SPARK-48016: --- Hi, [~Gengliang.Wang] . - I updated the JIRA title according to the commit title. - The umbrella Jira issue is done at Apache Spark 3.4.0. To give a more visibility, shall we move this to SPARK-44111 because recent ANSI JIRA issues are there? > Fix a bug in try_divide function when with decimals > --- > > Key: SPARK-48016 > URL: https://issues.apache.org/jira/browse/SPARK-48016 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0, 3.5.2 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: pull-request-available > > Binary Arithmetic operators should include the evalMode during makeCopy. > Otherwise, the following query will throw DIVIDE_BY_ZERO error instead of > returning null > > {code:java} > SELECT try_divide(1, decimal(0)); {code} > This is caused from the rule DecimalPrecision: > {code:java} > case b @ BinaryOperator(left, right) if left.dataType != right.dataType => > (left, right) match { > ... > case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && > l.dataType.isInstanceOf[IntegralType] && > literalPickMinimumPrecision => > b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r)) {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48016) Fix a bug in try_divide function when with decimals
[ https://issues.apache.org/jira/browse/SPARK-48016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48016: -- Summary: Fix a bug in try_divide function when with decimals (was: Binary Arithmetic operators should include the evalMode when makeCopy) > Fix a bug in try_divide function when with decimals > --- > > Key: SPARK-48016 > URL: https://issues.apache.org/jira/browse/SPARK-48016 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0, 3.5.2 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: pull-request-available > > Binary Arithmetic operators should include the evalMode during makeCopy. > Otherwise, the following query will throw DIVIDE_BY_ZERO error instead of > returning null > > {code:java} > SELECT try_divide(1, decimal(0)); {code} > This is caused from the rule DecimalPrecision: > {code:java} > case b @ BinaryOperator(left, right) if left.dataType != right.dataType => > (left, right) match { > ... > case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && > l.dataType.isInstanceOf[IntegralType] && > literalPickMinimumPrecision => > b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r)) {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false
? I'm not sure why you think in that direction. What I wrote was the following. - You voted +1 for SPARK-4 on April 14th (https://lists.apache.org/thread/tp92yzf8y4yjfk6r3dkqjtlb060g82sy) - You voted -1 for SPARK-46122 on April 26th. (https://lists.apache.org/thread/2ybq1jb19j0c52rgo43zfd9br1yhtfj8) You showed a dual-standard for the same kind of SQL votes in two weeks. We always count all votes from all contributors in order to keep the record of all comprehensive feedbacks. Dongjoon. On 2024/04/29 17:49:36 Mich Talebzadeh wrote: > Your point > > ".. t's a surprise to me to see that someone has different positions in a > very short period of time in the community" > > Well, I have been with Spark since 2015 and this is the article in the > medium dated February 7, 2016 with regard to both Hive and Spark and also > presented in Hortonworks meet-up. > > Hive on Spark Engine Versus Spark Using Hive Metastore > <https://www.linkedin.com/pulse/hive-spark-engine-versus-using-metastore-mich-talebzadeh-ph-d-/> > > With regard to why I castred +1 votre for one and -1 for the other, I think > it is my prerogative how I vote and we leave it at that., > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative AI | FinCrime > London > United Kingdom > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Mon, 29 Apr 2024 at 17:32, Dongjoon Hyun wrote: > > > It's a surprise to me to see that someone has different positions > > in a very short period of time in the community. > > > > Mitch casted +1 for SPARK-4 and -1 for SPARK-46122. > > - https://lists.apache.org/thread/4cbkpvc3vr3b6k0wp6lgsw37spdpnqrc > > - https://lists.apache.org/thread/x09gynt90v3hh5sql1gt9dlcn6m6699p > > > > To Mitch, what I'm interested in is the following specifically. > > > 2. Compatibility: Changing the default behavior could potentially > > > break existing workflows or pipelines that rely on the current behavior. > > > > May I ask you the following questions? > > A. What is the purpose of the migration guide in the ASF projects? > > > > B. Do you claim that there is incompatibility when you have > > spark.sql.legacy.createHiveTableByDefault=true which is described > > in the migration guide? > > > > C. Do you know that ANSI SQL has new RUNTIME exceptions > > which are harder than SPARK-46122? > > > > D. Or, did you cast +1 for SPARK-4 because > > you think there is no breaking change by default? > > > > I guess there is some misunderstanding on the proposal. > > > > Thanks, > > Dongjoon. > > > > > > On Fri, Apr 26, 2024 at 12:05 PM Mich Talebzadeh < > > mich.talebza...@gmail.com> wrote: > > > >> Hi, > >> > >> I would like to add a side note regarding the discussion process and the > >> current title of the proposal. The title '[DISCUSS] SPARK-46122: Set > >> spark.sql.legacy.createHiveTableByDefault to false' focuses on a specific > >> configuration parameter, which might lead some participants to overlook its > >> broader implications (as was raised by myself and others). I believe that a > >> more descriptive title, encompassing the broader discussion on default > >> behaviours for creating Hive tables in Spark SQL, could enable greater > >> engagement within the community. This is an important topic that deserves > >> thorough consideration. > >> > >> HTH > >> > >> Mich Talebzadeh, > >> Technologist | Architect | Data Engineer | Generative AI | FinCrime > >> London > >> United Kingdom > >> > >> > >>view my Linkedin profile > >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > >> > >> > >> https://en.everybodywiki.com/Mich_Talebzadeh > >> > >> > >> > >> *Disclaimer:* The information provided is correct to the best of my > >> knowledge but of course cannot be guaranteed . It is essential to note > >> that, as with any advice, quote &quo
[jira] [Resolved] (SPARK-48042) Don't use a copy of timestamp formatter with a new override zone for each value
[ https://issues.apache.org/jira/browse/SPARK-48042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48042. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46282 [https://github.com/apache/spark/pull/46282] > Don't use a copy of timestamp formatter with a new override zone for each > value > --- > > Key: SPARK-48042 > URL: https://issues.apache.org/jira/browse/SPARK-48042 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48044) Cache `DataFrame.isStreaming`
[ https://issues.apache.org/jira/browse/SPARK-48044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48044. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46281 [https://github.com/apache/spark/pull/46281] > Cache `DataFrame.isStreaming` > - > > Key: SPARK-48044 > URL: https://issues.apache.org/jira/browse/SPARK-48044 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48046) Remove `clock` parameter from `DriverServiceFeatureStep`
[ https://issues.apache.org/jira/browse/SPARK-48046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48046. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46284 [https://github.com/apache/spark/pull/46284] > Remove `clock` parameter from `DriverServiceFeatureStep` > > > Key: SPARK-48046 > URL: https://issues.apache.org/jira/browse/SPARK-48046 > Project: Spark > Issue Type: Task > Components: Kubernetes >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48046) Remove `clock` parameter from `DriverServiceFeatureStep`
[ https://issues.apache.org/jira/browse/SPARK-48046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48046: - Assignee: Dongjoon Hyun > Remove `clock` parameter from `DriverServiceFeatureStep` > > > Key: SPARK-48046 > URL: https://issues.apache.org/jira/browse/SPARK-48046 > Project: Spark > Issue Type: Task > Components: Kubernetes >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false
t;>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh < >>>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>>> >>>>>>>>>>> My take regarding your question is that your mileage varies so >>>>>>>>>>> to speak. >>>>>>>>>>> >>>>>>>>>>> 1) Hive provides a more mature and widely adopted catalog >>>>>>>>>>> solution that integrates well with other components in the Hadoop >>>>>>>>>>> ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop >>>>>>>>>>> centric S(say >>>>>>>>>>> on-premise), using Hive may offer better compatibility and >>>>>>>>>>> interoperability. >>>>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users >>>>>>>>>>> who are accustomed to traditional RDBMs. If your use case involves >>>>>>>>>>> complex >>>>>>>>>>> SQL queries or existing SQL-based workflows, using Hive may be >>>>>>>>>>> advantageous. >>>>>>>>>>> 3) If you are looking for performance, spark's native catalog >>>>>>>>>>> tends to offer better performance for certain workloads, >>>>>>>>>>> particularly those >>>>>>>>>>> that involve iterative processing or complex data >>>>>>>>>>> transformations.(my >>>>>>>>>>> understanding). Spark's in-memory processing capabilities and >>>>>>>>>>> optimizations >>>>>>>>>>> make it well-suited for interactive analytics and machine learning >>>>>>>>>>> tasks.(my favourite) >>>>>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark >>>>>>>>>>> for data processing and analytics, using Spark's native catalog may >>>>>>>>>>> simplify workflow management and reduce overhead, Spark's tight >>>>>>>>>>> integration with its catalog allows for seamless interaction with >>>>>>>>>>> Spark >>>>>>>>>>> applications and libraries. >>>>>>>>>>> 5) There seems to be some similarity with spark catalog and >>>>>>>>>>> Databricks unity catalog, so that may favour the choice. >>>>>>>>>>> >>>>>>>>>>> HTH >>>>>>>>>>> >>>>>>>>>>> Mich Talebzadeh, >>>>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>>>> FinCrime >>>>>>>>>>> London >>>>>>>>>>> United Kingdom >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>view my Linkedin profile >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Disclaimer:* The information provided is correct to the best >>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is >>>>>>>>>>> essential to >>>>>>>>>>> note that, as with any advice, quote "one test result is worth >>>>>>>>>>> one-thousand expert opinions (Werner >>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I will also appreciate some material that describes the >>>>>>>>>>>> differences between Spark native tables vs hive tables and why >>>>>>>>>>>> each should >>>>>>>>>>>> be used... >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Nimrod >>>>>>>>>>>> >>>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh < >>>>>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>>>>> >>>>>>>>>>>>> I see a statement made as below and I quote >>>>>>>>>>>>> >>>>>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of >>>>>>>>>>>>> this >>>>>>>>>>>>> configuration from `true` to `false` to use Spark native >>>>>>>>>>>>> tables because >>>>>>>>>>>>> we support better." >>>>>>>>>>>>> >>>>>>>>>>>>> Can you please elaborate on the above specifically with regard >>>>>>>>>>>>> to the phrase ".. because >>>>>>>>>>>>> we support better." >>>>>>>>>>>>> >>>>>>>>>>>>> Are you referring to the performance of Spark catalog (I >>>>>>>>>>>>> believe it is internal) or integration with Spark? >>>>>>>>>>>>> >>>>>>>>>>>>> HTH >>>>>>>>>>>>> >>>>>>>>>>>>> Mich Talebzadeh, >>>>>>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>>>>>> FinCrime >>>>>>>>>>>>> London >>>>>>>>>>>>> United Kingdom >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>view my Linkedin profile >>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best >>>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is >>>>>>>>>>>>> essential to >>>>>>>>>>>>> note that, as with any advice, quote "one test result is >>>>>>>>>>>>> worth one-thousand expert opinions (Werner >>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> +1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-4. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Kent Yao >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dongjoon Hyun 于2024年4月25日周四 >>>>>>>>>>>>>>> 14:39写道: >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Hi, All. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > It's great to see community activities to polish 4.0.0 >>>>>>>>>>>>>>> more and more. >>>>>>>>>>>>>>> > Thank you all. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you >>>>>>>>>>>>>>> from the subtasks >>>>>>>>>>>>>>> > of SPARK-4 (Prepare Apache Spark 4.0.0), >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122 >>>>>>>>>>>>>>> >Set `spark.sql.legacy.createHiveTableByDefault` to >>>>>>>>>>>>>>> `false` by default >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL >>>>>>>>>>>>>>> syntax without >>>>>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to >>>>>>>>>>>>>>> `Hive` table. >>>>>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value >>>>>>>>>>>>>>> of this >>>>>>>>>>>>>>> > configuration from `true` to `false` to use Spark native >>>>>>>>>>>>>>> tables because >>>>>>>>>>>>>>> > we support better. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > In other words, Spark will use the value of >>>>>>>>>>>>>>> `spark.sql.sources.default` >>>>>>>>>>>>>>> > as the table provider instead of `Hive` like the other >>>>>>>>>>>>>>> Spark APIs. Of course, >>>>>>>>>>>>>>> > the users can get all the legacy behavior by setting back >>>>>>>>>>>>>>> to `true`. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Historically, this behavior change was merged once at >>>>>>>>>>>>>>> Apache Spark 3.0.0 >>>>>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during >>>>>>>>>>>>>>> the 3.0.0 RC period. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider >>>>>>>>>>>>>>> for CREATE TABLE >>>>>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default >>>>>>>>>>>>>>> datasource as >>>>>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about >>>>>>>>>>>>>>> this and defined it >>>>>>>>>>>>>>> > as one of legacy behavior via this configuration via >>>>>>>>>>>>>>> reused ID, SPARK-30098. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > 2020-12-01: >>>>>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 >>>>>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default >>>>>>>>>>>>>>> datasource as >>>>>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Last year, we received two additional requests twice to >>>>>>>>>>>>>>> switch this because >>>>>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for >>>>>>>>>>>>>>> the future direction. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea. >>>>>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 >>>>>>>>>>>>>>> idea >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR >>>>>>>>>>>>>>> which is one line of main >>>>>>>>>>>>>>> > code, one line of migration guide, and a few lines of test >>>>>>>>>>>>>>> code. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207 >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Dongjoon. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
[jira] [Created] (SPARK-48046) Remove `clock` parameter from `DriverServiceFeatureStep`
Dongjoon Hyun created SPARK-48046: - Summary: Remove `clock` parameter from `DriverServiceFeatureStep` Key: SPARK-48046 URL: https://issues.apache.org/jira/browse/SPARK-48046 Project: Spark Issue Type: Task Components: Kubernetes Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48038) Promote driverServiceName to KubernetesDriverConf
[ https://issues.apache.org/jira/browse/SPARK-48038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48038: - Assignee: Cheng Pan > Promote driverServiceName to KubernetesDriverConf > - > > Key: SPARK-48038 > URL: https://issues.apache.org/jira/browse/SPARK-48038 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48038) Promote driverServiceName to KubernetesDriverConf
[ https://issues.apache.org/jira/browse/SPARK-48038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48038. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46276 [https://github.com/apache/spark/pull/46276] > Promote driverServiceName to KubernetesDriverConf > - > > Key: SPARK-48038 > URL: https://issues.apache.org/jira/browse/SPARK-48038 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48036) Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md`
[ https://issues.apache.org/jira/browse/SPARK-48036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48036. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46271 [https://github.com/apache/spark/pull/46271] > Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` > --- > > Key: SPARK-48036 > URL: https://issues.apache.org/jira/browse/SPARK-48036 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48029) Update the packages name removed in building the spark docker image
[ https://issues.apache.org/jira/browse/SPARK-48029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48029. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46258 [https://github.com/apache/spark/pull/46258] > Update the packages name removed in building the spark docker image > --- > > Key: SPARK-48029 > URL: https://issues.apache.org/jira/browse/SPARK-48029 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48029) Update the packages name removed in building the spark docker image
[ https://issues.apache.org/jira/browse/SPARK-48029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48029: - Assignee: BingKun Pan > Update the packages name removed in building the spark docker image > --- > > Key: SPARK-48029 > URL: https://issues.apache.org/jira/browse/SPARK-48029 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48036) Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md`
Dongjoon Hyun created SPARK-48036: - Summary: Update `sql-ref-ansi-compliance.md` and `sql-ref-identifier.md` Key: SPARK-48036 URL: https://issues.apache.org/jira/browse/SPARK-48036 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48032) Upgrade `commons-codec` to 1.17.0
[ https://issues.apache.org/jira/browse/SPARK-48032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48032. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46268 [https://github.com/apache/spark/pull/46268] > Upgrade `commons-codec` to 1.17.0 > - > > Key: SPARK-48032 > URL: https://issues.apache.org/jira/browse/SPARK-48032 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47730) Support APP_ID and EXECUTOR_ID placeholder in labels
[ https://issues.apache.org/jira/browse/SPARK-47730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47730: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Improvement) > Support APP_ID and EXECUTOR_ID placeholder in labels > > > Key: SPARK-47730 > URL: https://issues.apache.org/jira/browse/SPARK-47730 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: 3.5.1 >Reporter: Xi Chen >Assignee: Xi Chen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47730) Support APP_ID and EXECUTOR_ID placeholder in labels
[ https://issues.apache.org/jira/browse/SPARK-47730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47730. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46149 [https://github.com/apache/spark/pull/46149] > Support APP_ID and EXECUTOR_ID placeholder in labels > > > Key: SPARK-47730 > URL: https://issues.apache.org/jira/browse/SPARK-47730 > Project: Spark > Issue Type: Improvement > Components: k8s >Affects Versions: 3.5.1 >Reporter: Xi Chen >Assignee: Xi Chen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47730) Support APP_ID and EXECUTOR_ID placeholder in labels
[ https://issues.apache.org/jira/browse/SPARK-47730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47730: - Assignee: Xi Chen > Support APP_ID and EXECUTOR_ID placeholder in labels > > > Key: SPARK-47730 > URL: https://issues.apache.org/jira/browse/SPARK-47730 > Project: Spark > Issue Type: Improvement > Components: k8s >Affects Versions: 3.5.1 >Reporter: Xi Chen >Assignee: Xi Chen >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48021) Add `--add-modules=jdk.incubator.vector` to `JavaModuleOptions`
[ https://issues.apache.org/jira/browse/SPARK-48021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48021: - Assignee: BingKun Pan > Add `--add-modules=jdk.incubator.vector` to `JavaModuleOptions` > --- > > Key: SPARK-48021 > URL: https://issues.apache.org/jira/browse/SPARK-48021 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48021) Add `--add-modules=jdk.incubator.vector` to `JavaModuleOptions`
[ https://issues.apache.org/jira/browse/SPARK-48021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48021. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46246 [https://github.com/apache/spark/pull/46246] > Add `--add-modules=jdk.incubator.vector` to `JavaModuleOptions` > --- > > Key: SPARK-48021 > URL: https://issues.apache.org/jira/browse/SPARK-48021 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47408) Fix mathExpressions that use StringType
[ https://issues.apache.org/jira/browse/SPARK-47408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47408. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46227 [https://github.com/apache/spark/pull/46227] > Fix mathExpressions that use StringType > --- > > Key: SPARK-47408 > URL: https://issues.apache.org/jira/browse/SPARK-47408 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47943) Add Operator CI Task for Java Build and Test
[ https://issues.apache.org/jira/browse/SPARK-47943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47943: -- Fix Version/s: kubernetes-operator-0.1.0 (was: 4.0.0) > Add Operator CI Task for Java Build and Test > > > Key: SPARK-47943 > URL: https://issues.apache.org/jira/browse/SPARK-47943 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Assignee: Zhou JIANG >Priority: Major > Labels: pull-request-available > Fix For: kubernetes-operator-0.1.0 > > > We need to add CI task to build and test Java code for upcoming operator pull > requests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48015) Update `build.gradle` to fix deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-48015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-48015: - Assignee: Dongjoon Hyun > Update `build.gradle` to fix deprecation warnings > - > > Key: SPARK-48015 > URL: https://issues.apache.org/jira/browse/SPARK-48015 > Project: Spark > Issue Type: Sub-task > Components: Build, Kubernetes >Affects Versions: kubernetes-operator-0.1.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Trivial > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48015) Update `build.gradle` to fix deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-48015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48015. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 9 [https://github.com/apache/spark-kubernetes-operator/pull/9] > Update `build.gradle` to fix deprecation warnings > - > > Key: SPARK-48015 > URL: https://issues.apache.org/jira/browse/SPARK-48015 > Project: Spark > Issue Type: Sub-task > Components: Build, Kubernetes >Affects Versions: kubernetes-operator-0.1.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Trivial > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47929) Setup Static Analysis for Operator
[ https://issues.apache.org/jira/browse/SPARK-47929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47929: -- Fix Version/s: kubernetes-operator-0.1.0 (was: 4.0.0) > Setup Static Analysis for Operator > -- > > Key: SPARK-47929 > URL: https://issues.apache.org/jira/browse/SPARK-47929 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Assignee: Zhou JIANG >Priority: Major > Labels: pull-request-available > Fix For: kubernetes-operator-0.1.0 > > > Add common analysis tasks including checkstyle, spotbugs, jacoco. Also > include spotless for style fix. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47950) Add Java API Module for Spark Operator
[ https://issues.apache.org/jira/browse/SPARK-47950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47950: -- Fix Version/s: kubernetes-operator-0.1.0 (was: 4.0.0) > Add Java API Module for Spark Operator > -- > > Key: SPARK-47950 > URL: https://issues.apache.org/jira/browse/SPARK-47950 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Assignee: Zhou JIANG >Priority: Major > Labels: pull-request-available > Fix For: kubernetes-operator-0.1.0 > > > Spark Operator API refers to the > [CustomResourceDefinition|https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/] > __ that represents the spec for Spark Application in k8s. > This aims to add Java API library for Spark Operator, with the ability to > generate yaml spec. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48015) Update `build.gradle` to fix deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-48015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48015: -- Fix Version/s: kubernetes-operator-0.1.0 (was: 4.0.0) > Update `build.gradle` to fix deprecation warnings > - > > Key: SPARK-48015 > URL: https://issues.apache.org/jira/browse/SPARK-48015 > Project: Spark > Issue Type: Sub-task > Components: Build, Kubernetes >Affects Versions: kubernetes-operator-0.1.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Trivial > Labels: pull-request-available > Fix For: kubernetes-operator-0.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48015) Update `build.gradle` to fix deprecation warnings
Dongjoon Hyun created SPARK-48015: - Summary: Update `build.gradle` to fix deprecation warnings Key: SPARK-48015 URL: https://issues.apache.org/jira/browse/SPARK-48015 Project: Spark Issue Type: Sub-task Components: Build, Kubernetes Affects Versions: kubernetes-operator-0.1.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47950) Add Java API Module for Spark Operator
[ https://issues.apache.org/jira/browse/SPARK-47950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47950: - Assignee: Zhou JIANG > Add Java API Module for Spark Operator > -- > > Key: SPARK-47950 > URL: https://issues.apache.org/jira/browse/SPARK-47950 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Assignee: Zhou JIANG >Priority: Major > Labels: pull-request-available > > Spark Operator API refers to the > [CustomResourceDefinition|https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/] > __ that represents the spec for Spark Application in k8s. > This aims to add Java API library for Spark Operator, with the ability to > generate yaml spec. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47950) Add Java API Module for Spark Operator
[ https://issues.apache.org/jira/browse/SPARK-47950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47950. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 8 [https://github.com/apache/spark-kubernetes-operator/pull/8] > Add Java API Module for Spark Operator > -- > > Key: SPARK-47950 > URL: https://issues.apache.org/jira/browse/SPARK-47950 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: kubernetes-operator-0.1.0 >Reporter: Zhou JIANG >Assignee: Zhou JIANG >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Spark Operator API refers to the > [CustomResourceDefinition|https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/] > __ that represents the spec for Spark Application in k8s. > This aims to add Java API library for Spark Operator, with the ability to > generate yaml spec. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48010) Avoid repeated calls to conf.resolver in resolveExpression
[ https://issues.apache.org/jira/browse/SPARK-48010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48010. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46248 [https://github.com/apache/spark/pull/46248] > Avoid repeated calls to conf.resolver in resolveExpression > -- > > Key: SPARK-48010 > URL: https://issues.apache.org/jira/browse/SPARK-48010 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.3 >Reporter: Nikhil Sheoran >Assignee: Nikhil Sheoran >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Consider a view with a large number of columns (~1000s). When resolving this > view, looking at the flamegraph, observed repeated initializations of `conf` > to obtain the `resolver` for each column of the view. > This can be easily optimized to reuse the same resolver (obtained once) for > the various calls to `innerResolve` in `resolveExpression`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46122) Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default
[ https://issues.apache.org/jira/browse/SPARK-46122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46122: - Assignee: Dongjoon Hyun > Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default > - > > Key: SPARK-46122 > URL: https://issues.apache.org/jira/browse/SPARK-46122 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang > Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48005) Enable `DefaultIndexParityTests. test_index_distributed_sequence_cleanup`
[ https://issues.apache.org/jira/browse/SPARK-48005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48005. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46242 [https://github.com/apache/spark/pull/46242] > Enable `DefaultIndexParityTests. test_index_distributed_sequence_cleanup` > - > > Key: SPARK-48005 > URL: https://issues.apache.org/jira/browse/SPARK-48005 > Project: Spark > Issue Type: Sub-task > Components: Connect, PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false
I'll start with my +1. Dongjoon. On 2024/04/26 16:45:51 Dongjoon Hyun wrote: > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault > to `false` by default. The technical scope is defined in the following PR. > > - DISCUSSION: > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122 > - PR: https://github.com/apache/spark/pull/46207 > > The vote is open until April 30th 1AM (PST) and passes > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by default > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because ... > > Thank you in advance. > > Dongjoon > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false
Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault to `false` by default. The technical scope is defined in the following PR. - DISCUSSION: https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd - JIRA: https://issues.apache.org/jira/browse/SPARK-46122 - PR: https://github.com/apache/spark/pull/46207 The vote is open until April 30th 1AM (PST) and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by default [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because ... Thank you in advance. Dongjoon
Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false
com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> >>>>>>>> >>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>> note >>>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>>> expert opinions (Werner >>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>> >>>>>>>> >>>>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for the detailed answer. >>>>>>>>> The thing I'm missing is this: let's say that the output format I >>>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. >>>>>>>>> Where >>>>>>>>> does the catalog implementation (which holds metadata afaik, same >>>>>>>>> metadata >>>>>>>>> that iceberg and delta lake save for their tables about their columns) >>>>>>>>> comes into play and why should it affect performance? >>>>>>>>> Another thing is that if I understand correctly, and I might be >>>>>>>>> totally wrong here, the internal spark catalog is a local >>>>>>>>> installation of >>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with >>>>>>>>> anything. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> >>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh < >>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>> >>>>>>>>>> My take regarding your question is that your mileage varies so to >>>>>>>>>> speak. >>>>>>>>>> >>>>>>>>>> 1) Hive provides a more mature and widely adopted catalog >>>>>>>>>> solution that integrates well with other components in the Hadoop >>>>>>>>>> ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop centric >>>>>>>>>> S(say >>>>>>>>>> on-premise), using Hive may offer better compatibility and >>>>>>>>>> interoperability. >>>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users >>>>>>>>>> who are accustomed to traditional RDBMs. If your use case involves >>>>>>>>>> complex >>>>>>>>>> SQL queries or existing SQL-based workflows, using Hive may be >>>>>>>>>> advantageous. >>>>>>>>>> 3) If you are looking for performance, spark's native catalog >>>>>>>>>> tends to offer better performance for certain workloads, >>>>>>>>>> particularly those >>>>>>>>>> that involve iterative processing or complex data transformations.(my >>>>>>>>>> understanding). Spark's in-memory processing capabilities and >>>>>>>>>> optimizations >>>>>>>>>> make it well-suited for interactive analytics and machine learning >>>>>>>>>> tasks.(my favourite) >>>>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark >>>>>>>>>> for data processing and analytics, using Spark's native catalog may >>>>>>>>>> simplify workflow management and reduce overhead, Spark's tight >>>>>>>>>> integration with its catalog allows for seamless interaction with >>>>>>>>>> Spark >>>>>>>>>> applications and libraries. >>>>>>>>>> 5) There seems to be some similari
[jira] [Assigned] (ORC-1705) Upgrade `zstd-jni` to 1.5.6-3
[ https://issues.apache.org/jira/browse/ORC-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned ORC-1705: -- Assignee: dzcxzl > Upgrade `zstd-jni` to 1.5.6-3 > - > > Key: ORC-1705 > URL: https://issues.apache.org/jira/browse/ORC-1705 > Project: ORC > Issue Type: Improvement > Components: Java >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ORC-1705) Upgrade `zstd-jni` to 1.5.6-3
[ https://issues.apache.org/jira/browse/ORC-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved ORC-1705. Fix Version/s: 2.0.1 2.1.0 Resolution: Fixed Issue resolved by pull request 1914 [https://github.com/apache/orc/pull/1914] > Upgrade `zstd-jni` to 1.5.6-3 > - > > Key: ORC-1705 > URL: https://issues.apache.org/jira/browse/ORC-1705 > Project: ORC > Issue Type: Improvement > Components: Java >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Major > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (SPARK-48007) MsSQLServer: upgrade mssql.jdbc.version to 12.6.1.jre11
[ https://issues.apache.org/jira/browse/SPARK-48007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48007. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46244 [https://github.com/apache/spark/pull/46244] > MsSQLServer: upgrade mssql.jdbc.version to 12.6.1.jre11 > --- > > Key: SPARK-48007 > URL: https://issues.apache.org/jira/browse/SPARK-48007 > Project: Spark > Issue Type: Sub-task > Components: Build, Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47991) Arrange the test cases for window frames and window functions.
[ https://issues.apache.org/jira/browse/SPARK-47991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47991. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46226 [https://github.com/apache/spark/pull/46226] > Arrange the test cases for window frames and window functions. > -- > > Key: SPARK-47991 > URL: https://issues.apache.org/jira/browse/SPARK-47991 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22231) Support of map, filter, withField, dropFields in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840934#comment-17840934 ] Dongjoon Hyun commented on SPARK-22231: --- I removed the outdated target version from this issue. > Support of map, filter, withField, dropFields in nested list of structures > -- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: struct (
[jira] [Updated] (SPARK-22231) Support of map, filter, withField, dropFields in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22231: -- Target Version/s: (was: 3.2.0) > Support of map, filter, withField, dropFields in nested list of structures > -- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- a: integer (nullable = tr
[jira] [Updated] (SPARK-24941) Add RDDBarrier.coalesce() function
[ https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24941: -- Target Version/s: (was: 3.2.0) > Add RDDBarrier.coalesce() function > -- > > Key: SPARK-24941 > URL: https://issues.apache.org/jira/browse/SPARK-24941 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Xingbo Jiang >Priority: Major > > https://github.com/apache/spark/pull/21758#discussion_r204917245 > The number of partitions from the input data can be unexpectedly large, eg. > if you do > {code} > sc.textFile(...).barrier().mapPartitions() > {code} > The number of input partitions is based on the hdfs input splits. We shall > provide a way in RDDBarrier to enable users to specify the number of tasks in > a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) > . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25383) Image data source supports sample pushdown
[ https://issues.apache.org/jira/browse/SPARK-25383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25383: -- Target Version/s: (was: 3.2.0) > Image data source supports sample pushdown > -- > > Key: SPARK-25383 > URL: https://issues.apache.org/jira/browse/SPARK-25383 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 3.1.0 >Reporter: Xiangrui Meng >Priority: Major > > After SPARK-25349, we should update image data source to support sampling. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25752) Add trait to easily whitelist logical operators that produce named output from CleanupAliases
[ https://issues.apache.org/jira/browse/SPARK-25752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25752: -- Target Version/s: (was: 3.2.0) > Add trait to easily whitelist logical operators that produce named output > from CleanupAliases > - > > Key: SPARK-25752 > URL: https://issues.apache.org/jira/browse/SPARK-25752 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > > The rule `CleanupAliases` cleans up aliases from logical operators that do > not match a whitelist. This whitelist is hardcoded inside the rule which is > cumbersome. This PR is to clean that up by making a trait `HasNamedOutput` > that will be ignored by `CleanupAliases` and other ops that require aliases > to be preserved in the operator should extend it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28629) Capture the missing rules in HiveSessionStateBuilder
[ https://issues.apache.org/jira/browse/SPARK-28629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840928#comment-17840928 ] Dongjoon Hyun commented on SPARK-28629: --- I removed the outdated target version from this issue. > Capture the missing rules in HiveSessionStateBuilder > > > Key: SPARK-28629 > URL: https://issues.apache.org/jira/browse/SPARK-28629 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiao Li >Priority: Major > > A general mistake for new contributors is to forget adding the corresponding > rules into the extended extendedResolutionRules, postHocResolutionRules, > extendedCheckRules in HiveSessionStateBuilder. We need to avoid missing the > rules or capture them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade
[ https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840930#comment-17840930 ] Dongjoon Hyun commented on SPARK-27780: --- I removed the outdated target version from this issue. > Shuffle server & client should be versioned to enable smoother upgrade > -- > > Key: SPARK-27780 > URL: https://issues.apache.org/jira/browse/SPARK-27780 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Imran Rashid >Priority: Major > > The external shuffle service is often upgraded at a different time than spark > itself. However, this causes problems when the protocol changes between the > shuffle service and the spark runtime -- this forces users to upgrade > everything simultaneously. > We should add versioning to the shuffle client & server, so they know what > messages the other will support. This would allow better handling of mixed > versions, from better error msgs to allowing some mismatched versions (with > reduced capabilities). > This originally came up in a discussion here: > https://github.com/apache/spark/pull/24565#issuecomment-493496466 > There are a few ways we could do the versioning which we still need to > discuss: > 1) Version specified by config. This allows for mixed versions across the > cluster and rolling upgrades. It also will let a spark 3.0 client talk to a > 2.4 shuffle service. But, may be a nuisance for users to get this right. > 2) Auto-detection during registration with local shuffle service. This makes > the versioning easy for the end user, and can even handle a 2.4 shuffle > service though it does not support the new versioning. However, it will not > handle a rolling upgrade correctly -- if the local shuffle service has been > upgraded, but other nodes in the cluster have not, it will get the version > wrong. > 3) Exchange versions per-connection. When a connection is opened, the server > & client could first exchange messages with their versions, so they know how > to continue communication after that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28629) Capture the missing rules in HiveSessionStateBuilder
[ https://issues.apache.org/jira/browse/SPARK-28629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28629: -- Target Version/s: (was: 3.2.0) > Capture the missing rules in HiveSessionStateBuilder > > > Key: SPARK-28629 > URL: https://issues.apache.org/jira/browse/SPARK-28629 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiao Li >Priority: Major > > A general mistake for new contributors is to forget adding the corresponding > rules into the extended extendedResolutionRules, postHocResolutionRules, > extendedCheckRules in HiveSessionStateBuilder. We need to avoid missing the > rules or capture them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade
[ https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27780: -- Target Version/s: (was: 3.2.0) > Shuffle server & client should be versioned to enable smoother upgrade > -- > > Key: SPARK-27780 > URL: https://issues.apache.org/jira/browse/SPARK-27780 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Imran Rashid >Priority: Major > > The external shuffle service is often upgraded at a different time than spark > itself. However, this causes problems when the protocol changes between the > shuffle service and the spark runtime -- this forces users to upgrade > everything simultaneously. > We should add versioning to the shuffle client & server, so they know what > messages the other will support. This would allow better handling of mixed > versions, from better error msgs to allowing some mismatched versions (with > reduced capabilities). > This originally came up in a discussion here: > https://github.com/apache/spark/pull/24565#issuecomment-493496466 > There are a few ways we could do the versioning which we still need to > discuss: > 1) Version specified by config. This allows for mixed versions across the > cluster and rolling upgrades. It also will let a spark 3.0 client talk to a > 2.4 shuffle service. But, may be a nuisance for users to get this right. > 2) Auto-detection during registration with local shuffle service. This makes > the versioning easy for the end user, and can even handle a 2.4 shuffle > service though it does not support the new versioning. However, it will not > handle a rolling upgrade correctly -- if the local shuffle service has been > upgraded, but other nodes in the cluster have not, it will get the version > wrong. > 3) Exchange versions per-connection. When a connection is opened, the server > & client could first exchange messages with their versions, so they know how > to continue communication after that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30324) Simplify API for JSON access in DataFrames/SQL
[ https://issues.apache.org/jira/browse/SPARK-30324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840927#comment-17840927 ] Dongjoon Hyun commented on SPARK-30324: --- I removed the outdated target version from this issue. > Simplify API for JSON access in DataFrames/SQL > -- > > Key: SPARK-30324 > URL: https://issues.apache.org/jira/browse/SPARK-30324 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Burak Yavuz >Priority: Major > > get_json_object() is a UDF to parse JSON fields. It is verbose and hard to > use, e.g. I wasn't expecting the path to a field to have to start with "$.". > We can simplify all of this when a column is of StringType, and a nested > field is requested. This API sugar will in the query planner be rewritten as > get_json_object. > This nested access can then be extended in the future to other > semi-structured formats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30324) Simplify API for JSON access in DataFrames/SQL
[ https://issues.apache.org/jira/browse/SPARK-30324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30324: -- Target Version/s: (was: 3.2.0) > Simplify API for JSON access in DataFrames/SQL > -- > > Key: SPARK-30324 > URL: https://issues.apache.org/jira/browse/SPARK-30324 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Burak Yavuz >Priority: Major > > get_json_object() is a UDF to parse JSON fields. It is verbose and hard to > use, e.g. I wasn't expecting the path to a field to have to start with "$.". > We can simplify all of this when a column is of StringType, and a nested > field is requested. This API sugar will in the query planner be rewritten as > get_json_object. > This nested access can then be extended in the future to other > semi-structured formats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30334) Add metadata around semi-structured columns to Spark
[ https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30334: -- Target Version/s: (was: 3.2.0) > Add metadata around semi-structured columns to Spark > > > Key: SPARK-30334 > URL: https://issues.apache.org/jira/browse/SPARK-30334 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Burak Yavuz >Priority: Major > > Semi-structured data is used widely in the data industry for reporting events > in a wide variety of formats. Click events in product analytics can be stored > as json. Some application logs can be in the form of delimited key=value > text. Some data may be in xml. > The goal of this project is to be able to signal Spark that such a column > exists. This will then enable Spark to "auto-parse" these columns on the fly. > The proposal is to store this information as part of the column metadata, in > the fields: > - format: The format of the semi-structured column, e.g. json, xml, avro > - options: Options for parsing these columns > Then imagine having the following data: > {code:java} > ++---++ > | ts | event |raw | > ++---++ > | 2019-10-12 | click | {"field":"value"} | > ++---++ {code} > SELECT raw.field FROM data > will return "value" > or the following data > {code:java} > ++---+--+ > | ts | event | raw | > ++---+--+ > | 2019-10-12 | click | field1=v1|field2=v2 | > ++---+--+ {code} > SELECT raw.field1 FROM data > will return v1. > > As a first step, we will introduce the function "as_json", which accomplishes > this for JSON columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30334) Add metadata around semi-structured columns to Spark
[ https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840926#comment-17840926 ] Dongjoon Hyun commented on SPARK-30334: --- I removed the outdated target version from this issue. > Add metadata around semi-structured columns to Spark > > > Key: SPARK-30334 > URL: https://issues.apache.org/jira/browse/SPARK-30334 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Burak Yavuz >Priority: Major > > Semi-structured data is used widely in the data industry for reporting events > in a wide variety of formats. Click events in product analytics can be stored > as json. Some application logs can be in the form of delimited key=value > text. Some data may be in xml. > The goal of this project is to be able to signal Spark that such a column > exists. This will then enable Spark to "auto-parse" these columns on the fly. > The proposal is to store this information as part of the column metadata, in > the fields: > - format: The format of the semi-structured column, e.g. json, xml, avro > - options: Options for parsing these columns > Then imagine having the following data: > {code:java} > ++---++ > | ts | event |raw | > ++---++ > | 2019-10-12 | click | {"field":"value"} | > ++---++ {code} > SELECT raw.field FROM data > will return "value" > or the following data > {code:java} > ++---+--+ > | ts | event | raw | > ++---+--+ > | 2019-10-12 | click | field1=v1|field2=v2 | > ++---+--+ {code} > SELECT raw.field1 FROM data > will return v1. > > As a first step, we will introduce the function "as_json", which accomplishes > this for JSON columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage
[ https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840913#comment-17840913 ] Dongjoon Hyun commented on SPARK-24942: --- I removed the outdated target version, `3.2.0`, from this Jira. For now, Apache Spark community has no target version for this issue. > Improve cluster resource management with jobs containing barrier stage > -- > > Key: SPARK-24942 > URL: https://issues.apache.org/jira/browse/SPARK-24942 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Xingbo Jiang >Priority: Major > > https://github.com/apache/spark/pull/21758#discussion_r205652317 > We shall improve cluster resource management to address the following issues: > - With dynamic resource allocation enabled, it may happen that we acquire > some executors (but not enough to launch all the tasks in a barrier stage) > and later release them due to executor idle time expire, and then acquire > again. > - There can be deadlock with two concurrent applications. Each application > may acquire some resources, but not enough to launch all the tasks in a > barrier stage. And after hitting the idle timeout and releasing them, they > may acquire resources again, but just continually trade resources between > each other. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage
[ https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24942: -- Target Version/s: (was: 3.2.0) > Improve cluster resource management with jobs containing barrier stage > -- > > Key: SPARK-24942 > URL: https://issues.apache.org/jira/browse/SPARK-24942 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Xingbo Jiang >Priority: Major > > https://github.com/apache/spark/pull/21758#discussion_r205652317 > We shall improve cluster resource management to address the following issues: > - With dynamic resource allocation enabled, it may happen that we acquire > some executors (but not enough to launch all the tasks in a barrier stage) > and later release them due to executor idle time expire, and then acquire > again. > - There can be deadlock with two concurrent applications. Each application > may acquire some resources, but not enough to launch all the tasks in a > barrier stage. And after hitting the idle timeout and releasing them, they > may acquire resources again, but just continually trade resources between > each other. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44111) Prepare Apache Spark 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-44111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840853#comment-17840853 ] Dongjoon Hyun commented on SPARK-44111: --- Yes, we will provide `4.0.0-preview` in advance, [~fbiville] . Here is the discussion thread on Apache Spark dev mailing list. * [https://lists.apache.org/thread/nxmvz2j7kp96otzlnl3kd277knlb6qgb] [~cloud_fan] is the release manager who is leading Apache Spark 4.0.0 release (including preview). > Prepare Apache Spark 4.0.0 > -- > > Key: SPARK-44111 > URL: https://issues.apache.org/jira/browse/SPARK-44111 > Project: Spark > Issue Type: Umbrella > Components: Build >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun >Priority: Critical > Labels: pull-request-available > > For now, this issue aims to collect ideas for planning Apache Spark 4.0.0. > We will add more items which will be excluded from Apache Spark 3.5.0 > (Feature Freeze: July 16th, 2023). > {code} > Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3) > Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8) > Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x) > Spark 4: 2024.06 (4.0.0, NEW) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[FYI] SPARK-47993: Drop Python 3.8
FYI, there is a proposal to drop Python 3.8 because its EOL is October 2024. https://github.com/apache/spark/pull/46228 [SPARK-47993][PYTHON] Drop Python 3.8 Since it's still alive and there will be an overlap between the lifecycle of Python 3.8 and Apache Spark 4.0.0, please give us your feedback on the PR, if you have any concerns. >From my side, I agree with this decision. Thanks, Dongjoon.
[jira] [Resolved] (SPARK-47987) Enable `ArrowParityTests.test_createDataFrame_empty_partition`
[ https://issues.apache.org/jira/browse/SPARK-47987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47987. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46220 [https://github.com/apache/spark/pull/46220] > Enable `ArrowParityTests.test_createDataFrame_empty_partition` > -- > > Key: SPARK-47987 > URL: https://issues.apache.org/jira/browse/SPARK-47987 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47990) Upgrade `zstd-jni` to 1.5.6-3
[ https://issues.apache.org/jira/browse/SPARK-47990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47990. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46225 [https://github.com/apache/spark/pull/46225] > Upgrade `zstd-jni` to 1.5.6-3 > - > > Key: SPARK-47990 > URL: https://issues.apache.org/jira/browse/SPARK-47990 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46122) Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default
[ https://issues.apache.org/jira/browse/SPARK-46122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840644#comment-17840644 ] Dongjoon Hyun commented on SPARK-46122: --- I sent the discussion thread for this issue. - [https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd] > Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default > - > > Key: SPARK-46122 > URL: https://issues.apache.org/jira/browse/SPARK-46122 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (ORC-1704) Migration to Scala 2.13 of Apache Spark 3.5.1 at SparkBenchmark
[ https://issues.apache.org/jira/browse/ORC-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved ORC-1704. Fix Version/s: 2.0.1 2.1.0 Resolution: Fixed Issue resolved by pull request 1912 [https://github.com/apache/orc/pull/1912] > Migration to Scala 2.13 of Apache Spark 3.5.1 at SparkBenchmark > --- > > Key: ORC-1704 > URL: https://issues.apache.org/jira/browse/ORC-1704 > Project: ORC > Issue Type: Improvement >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ORC-1704) Migration to Scala 2.13 of Apache Spark 3.5.1 at SparkBenchmark
[ https://issues.apache.org/jira/browse/ORC-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned ORC-1704: -- Assignee: dzcxzl > Migration to Scala 2.13 of Apache Spark 3.5.1 at SparkBenchmark > --- > > Key: ORC-1704 > URL: https://issues.apache.org/jira/browse/ORC-1704 > Project: ORC > Issue Type: Improvement >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false
Hi, All. It's great to see community activities to polish 4.0.0 more and more. Thank you all. I'd like to bring SPARK-46122 (another SQL topic) to you from the subtasks of SPARK-4 (Prepare Apache Spark 4.0.0), - https://issues.apache.org/jira/browse/SPARK-46122 Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default This legacy configuration is about `CREATE TABLE` SQL syntax without `USING` and `STORED AS`, which is currently mapped to `Hive` table. The proposal of SPARK-46122 is to switch the default value of this configuration from `true` to `false` to use Spark native tables because we support better. In other words, Spark will use the value of `spark.sql.sources.default` as the table provider instead of `Hive` like the other Spark APIs. Of course, the users can get all the legacy behavior by setting back to `true`. Historically, this behavior change was merged once at Apache Spark 3.0.0 preparation via SPARK-30098 already, but reverted during the 3.0.0 RC period. 2019-12-06: SPARK-30098 Use default datasource as provider for CREATE TABLE 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as provider for CREATE TABLE command At Apache Spark 3.1.0, we had another discussion about this and defined it as one of legacy behavior via this configuration via reused ID, SPARK-30098. 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 2020-12-03: SPARK-30098 Add a configuration to use default datasource as provider for CREATE TABLE command Last year, we received two additional requests twice to switch this because Apache Spark 4.0.0 is a good time to make a decision for the future direction. 2023-02-27: SPARK-42603 as an independent idea. 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea WDYT? The technical scope is defined in the following PR which is one line of main code, one line of migration guide, and a few lines of test code. - https://github.com/apache/spark/pull/46207 Dongjoon.
[jira] [Updated] (SPARK-46122) Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default
[ https://issues.apache.org/jira/browse/SPARK-46122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46122: -- Summary: Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default (was: Set `spark.sql.legacy.createHiveTableByDefault` to false by default) > Set `spark.sql.legacy.createHiveTableByDefault` to `false` by default > - > > Key: SPARK-46122 > URL: https://issues.apache.org/jira/browse/SPARK-46122 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46122) Set `spark.sql.legacy.createHiveTableByDefault` to false by default
[ https://issues.apache.org/jira/browse/SPARK-46122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46122: -- Summary: Set `spark.sql.legacy.createHiveTableByDefault` to false by default (was: Disable spark.sql.legacy.createHiveTableByDefault by default) > Set `spark.sql.legacy.createHiveTableByDefault` to false by default > --- > > Key: SPARK-46122 > URL: https://issues.apache.org/jira/browse/SPARK-46122 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47979) Use Hive tables explicitly for Hive table capability tests
[ https://issues.apache.org/jira/browse/SPARK-47979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47979. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46211 [https://github.com/apache/spark/pull/46211] > Use Hive tables explicitly for Hive table capability tests > -- > > Key: SPARK-47979 > URL: https://issues.apache.org/jira/browse/SPARK-47979 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47979) Use Hive table explicitly for Hive table capability tests
Dongjoon Hyun created SPARK-47979: - Summary: Use Hive table explicitly for Hive table capability tests Key: SPARK-47979 URL: https://issues.apache.org/jira/browse/SPARK-47979 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47979) Use Hive tables explicitly for Hive table capability tests
[ https://issues.apache.org/jira/browse/SPARK-47979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47979: -- Summary: Use Hive tables explicitly for Hive table capability tests (was: Use Hive table explicitly for Hive table capability tests) > Use Hive tables explicitly for Hive table capability tests > -- > > Key: SPARK-47979 > URL: https://issues.apache.org/jira/browse/SPARK-47979 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 4.0.0 > Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45265) Support Hive 4.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-45265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45265: - Assignee: (was: Attila Zsolt Piros) > Support Hive 4.0 metastore > -- > > Key: SPARK-45265 > URL: https://issues.apache.org/jira/browse/SPARK-45265 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > Labels: pull-request-available > > Although Hive 4.0.0 is still beta I would like to work on this as Hive 4.0.0 > will support support the pushdowns of partition column filters with > VARCHAR/CHAR types. > For details please see HIVE-26661: Support partition filter for char and > varchar types on Hive metastore -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44677) Drop legacy Hive-based ORC file format
[ https://issues.apache.org/jira/browse/SPARK-44677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44677: -- Parent: (was: SPARK-44111) Issue Type: Task (was: Sub-task) > Drop legacy Hive-based ORC file format > -- > > Key: SPARK-44677 > URL: https://issues.apache.org/jira/browse/SPARK-44677 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > > Currently, Spark allows to use spark.sql.orc.impl=native/hive to switch the > ORC FileFormat implementation. > SPARK-23456(2.4) switched the default value of spark.sql.orc.impl from "hive" > to "native". and prepared to drop the "hive" implementation in the future. > > ... eventually, Apache Spark will drop old Hive-based ORC code. > The native implementation works well during the whole Spark 3.x period, so > it's a good time to consider dropping the "hive" one in Spark 4.0. > Also, we should take care about the backward-compatibility during change. > > BTW, IIRC, there was a different at Hive ORC CHAR implementation before. > > So, we couldn't remove it for backward-compatibility issues. Since Spark > > implements many CHAR features, we need to re-verify that {{native}} > > implementation has all legacy Hive-based ORC features -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org