[jira] [Updated] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42596: --- Description: Run this PySpark script with `spark.executor.cores=1` {code:python} import os from pyspark.sql import SparkSession from pyspark.sql.functions import udf spark = SparkSession.builder.getOrCreate() var_name = 'OMP_NUM_THREADS' def get_env_var(): return os.getenv(var_name) udf_get_env_var = udf(get_env_var) spark.range(1).toDF("id").withColumn(f"env_{var_name}", udf_get_env_var()).show(truncate=False) {code} Output with release `3.3.2`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |null | +---+---+ {noformat} Output with release `3.3.0`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |1 | +---+---+ {noformat} was: Run this PySpark script with `spark.executor.cores=1` {code:python} import os from pyspark.sql import SparkSession from pyspark.sql.functions import udf spark = SparkSession.builder.getOrCreate() var_name = 'OMP_NUM_THREADS' def get_env_var(): return os.getenv(var_name) udf_get_env_var = udf(get_env_var) spark.range(1).toDF("id").withColumn(f"env_{var_name}", udf_get_env_var()).show(truncate=False) {code} Output with release `3.3.2`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |null | +---+---+ {noformat} Output with release `3.3.0`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |1 | +---+---+ {noformat} > [YARN] OMP_NUM_THREADS not set to number of executor cores by default > - > > Key: SPARK-42596 > URL: https://issues.apache.org/jira/browse/SPARK-42596 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > Run this PySpark script with `spark.executor.cores=1` > {code:python} > import os > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf > spark = SparkSession.builder.getOrCreate() > var_name = 'OMP_NUM_THREADS' > def get_env_var(): > return os.getenv(var_name) > udf_get_env_var = udf(get_env_var) > spark.range(1).toDF("id").withColumn(f"env_{var_name}", > udf_get_env_var()).show(truncate=False) > {code} > Output with release `3.3.2`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |null | > +---+---+ > {noformat} > Output with release `3.3.0`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42596: --- Description: Run this PySpark script with `spark.executor.cores=1` {code:python} import os from pyspark.sql import SparkSession from pyspark.sql.functions import udf spark = SparkSession.builder.getOrCreate() var_name = 'OMP_NUM_THREADS' def get_env_var(): return os.getenv(var_name) udf_get_env_var = udf(get_env_var) spark.range(1).toDF("id").withColumn(f"env_{var_name}", udf_get_env_var()).show(truncate=False) {code} Output with release `3.3.2`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |null | +---+---+ {noformat} Output with release `3.3.0`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |1 | +---+---+ {noformat} was: Run this PySpark script with `spark.executor.cores=1` {code:python} import os from pyspark.sql import SparkSession from pyspark.sql.functions import udf spark = SparkSession.builder.getOrCreate() var_name = 'OMP_NUM_THREADS' def get_env_var(): return os.getenv(var_name) udf_get_env_var = udf(get_env_var) spark.range(1).toDF("id").withColumn(f"env_{var_name}", udf_get_env_var()).show(truncate=False) {code} Output with release `3.3.2`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |null | +---+---+ {noformat} Output with release `3.3.0`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |1 | +---+---+ {noformat} > [YARN] OMP_NUM_THREADS not set to number of executor cores by default > - > > Key: SPARK-42596 > URL: https://issues.apache.org/jira/browse/SPARK-42596 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > Run this PySpark script with `spark.executor.cores=1` > {code:python} > import os > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf > spark = SparkSession.builder.getOrCreate() > var_name = 'OMP_NUM_THREADS' > def get_env_var(): > return os.getenv(var_name) > udf_get_env_var = udf(get_env_var) > spark.range(1).toDF("id").withColumn(f"env_{var_name}", > udf_get_env_var()).show(truncate=False) > {code} > Output with release `3.3.2`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |null | > +---+---+ > {noformat} > Output with release `3.3.0`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693837#comment-17693837 ] John Zhuge commented on SPARK-42596: Looks like a regression from SPARK-41188 where it removed the code that sets the default OMP_NUM_THREADS from PythonRunner. Its PR assumes the code can be moved to SparkContext, unfortunately `SparkContext#executorEnvs` is only used by StandaloneSchedulerBackend for Spark's standalone cluster manager, thus the PR broke YARN as shown in the test case above, probably Mesos as well but I don't have a way to test. > [YARN] OMP_NUM_THREADS not set to number of executor cores by default > - > > Key: SPARK-42596 > URL: https://issues.apache.org/jira/browse/SPARK-42596 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > Run this PySpark script with `spark.executor.cores=1` > {code:python} > import os > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf > spark = SparkSession.builder.getOrCreate() > var_name = 'OMP_NUM_THREADS' > def get_env_var(): > return os.getenv(var_name) > udf_get_env_var = udf(get_env_var) > spark.range(1).toDF("id").withColumn(f"env_{var_name}", > udf_get_env_var()).show(truncate=False) > {code} > Output with release `3.3.2`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |null | > +---+---+ > {noformat} > Output with release `3.3.0`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42572) Logic error for StateStore.validateStateRowFormat
[ https://issues.apache.org/jira/browse/SPARK-42572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693835#comment-17693835 ] Apache Spark commented on SPARK-42572: -- User 'WweiL' has created a pull request for this issue: https://github.com/apache/spark/pull/40187 > Logic error for StateStore.validateStateRowFormat > - > > Key: SPARK-42572 > URL: https://issues.apache.org/jira/browse/SPARK-42572 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Wei Liu >Priority: Major > > SPARK-42484 Changed the logic of whether to check state store format in > StateStore.validateStateRowFormat. Revert it and add unit test to make sure > this won't happen again -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42572) Logic error for StateStore.validateStateRowFormat
[ https://issues.apache.org/jira/browse/SPARK-42572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42572: Assignee: Apache Spark > Logic error for StateStore.validateStateRowFormat > - > > Key: SPARK-42572 > URL: https://issues.apache.org/jira/browse/SPARK-42572 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Wei Liu >Assignee: Apache Spark >Priority: Major > > SPARK-42484 Changed the logic of whether to check state store format in > StateStore.validateStateRowFormat. Revert it and add unit test to make sure > this won't happen again -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42572) Logic error for StateStore.validateStateRowFormat
[ https://issues.apache.org/jira/browse/SPARK-42572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42572: Assignee: (was: Apache Spark) > Logic error for StateStore.validateStateRowFormat > - > > Key: SPARK-42572 > URL: https://issues.apache.org/jira/browse/SPARK-42572 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Wei Liu >Priority: Major > > SPARK-42484 Changed the logic of whether to check state store format in > StateStore.validateStateRowFormat. Revert it and add unit test to make sure > this won't happen again -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42572) Logic error for StateStore.validateStateRowFormat
[ https://issues.apache.org/jira/browse/SPARK-42572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693834#comment-17693834 ] Wei Liu commented on SPARK-42572: - I'm not very sure what's the true process here.. We should still use some changes in #40073 (especially the logging part) I've create a PR for the fix: [https://github.com/apache/spark/pull/40187] But I could also revert it and combine the two PRs if that's the correct flow > Logic error for StateStore.validateStateRowFormat > - > > Key: SPARK-42572 > URL: https://issues.apache.org/jira/browse/SPARK-42572 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Wei Liu >Priority: Major > > SPARK-42484 Changed the logic of whether to check state store format in > StateStore.validateStateRowFormat. Revert it and add unit test to make sure > this won't happen again -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default
John Zhuge created SPARK-42596: -- Summary: [YARN] OMP_NUM_THREADS not set to number of executor cores by default Key: SPARK-42596 URL: https://issues.apache.org/jira/browse/SPARK-42596 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 3.3.2 Reporter: John Zhuge Run this PySpark script with `spark.executor.cores=1` {code:python} import os from pyspark.sql import SparkSession from pyspark.sql.functions import udf spark = SparkSession.builder.getOrCreate() var_name = 'OMP_NUM_THREADS' def get_env_var(): return os.getenv(var_name) udf_get_env_var = udf(get_env_var) spark.range(1).toDF("id").withColumn(f"env_{var_name}", udf_get_env_var()).show(truncate=False) {code} Output with release `3.3.2`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |null | +---+---+ {noformat} Output with release `3.3.0`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |1 | +---+---+ {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42595) Support query inserted partitions after insert data into table when hive.exec.dynamic.partition=true
zhang haoyan created SPARK-42595: Summary: Support query inserted partitions after insert data into table when hive.exec.dynamic.partition=true Key: SPARK-42595 URL: https://issues.apache.org/jira/browse/SPARK-42595 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.5.0 Reporter: zhang haoyan When hive.exec.dynamic.partition=true and hive.exec.dynamic.partition.mode=nonstrict, we can insert table by sql like 'insert overwrite table aaa partition(dt) select ', of course we can know the partitions inserted into the table by the sql itself, but if we want do something for common use, we need some common way to get the inserted partitions, for example: spark.sql("insert overwrite table aaa partition(dt) select ") //insert table val partitions = getInsertedPartitions() //need some way to get inserted partitions monitorInsertedPartitions(partitions) //do something for common use Since insert statement should not return any data, this ticket propose to introduce spark.hive.exec.dynamic.partition.savePartitions=true (default false) spark.hive.exec.dynamic.partition.savePartitions.tableNamePrefix=hive_dynamic_inserted_partitions when spark.hive.exec.dynamic.partition.savePartitions=true we save the partitions to the temporary view $spark.hive.exec.dynamic.partition.savePartitions.tableNamePrefix_$dbName_$tableName we will allow user to do this scala> spark.conf.set("hive.exec.dynamic.partition", true) scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") scala> spark.conf.set("spark.hive.exec.dynamic.partition.savePartitions", true) scala> spark.sql("insert overwrite table db1.test_partition_table partition (dt) select 1, '2023-02-22'").show(false) ++ || ++ ++ scala> spark.sql("select * from hive_dynamic_inserted_partitions_db1_test_partition_table").show(false) +--+ |dt | +--+ |2023-02-22| +--+ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42594) spark can not read lastest view sql when run `create or replace view` by hive
[ https://issues.apache.org/jira/browse/SPARK-42594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-42594. - Resolution: Not A Bug > spark can not read lastest view sql when run `create or replace view` by hive > - > > Key: SPARK-42594 > URL: https://issues.apache.org/jira/browse/SPARK-42594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > Attachments: image-2023-02-27-13-31-20-420.png > > > 1. Spark would save view schema as tabel param. > 2. Spark will make tabel param as output schema when select the view . > 3. Hive will not update tabel param when runing `create or replace view` to > update the view. > !image-2023-02-27-13-31-20-420.png! > So when hive and spark are mixed and update the view, spark may ignore some > col. > To reproduce this issue: > 1. running in spark > ``` > create table test_spark (id string); > create view test_spark_view as select id from test_spark; > ``` > 2. running in hive > ``` > create or replace view test_spark_view as select id , "test" as new_id from > test_spark; > ``` > 3. We can see spark will ignore `test_spark_view#new_id` when select > test_spark_view using spark. But hive can read it. > I'm not sure if this is a feature of spark. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-42594) spark can not read lastest view sql when run `create or replace view` by hive
[ https://issues.apache.org/jira/browse/SPARK-42594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reopened SPARK-42594: - > spark can not read lastest view sql when run `create or replace view` by hive > - > > Key: SPARK-42594 > URL: https://issues.apache.org/jira/browse/SPARK-42594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > Attachments: image-2023-02-27-13-31-20-420.png > > > 1. Spark would save view schema as tabel param. > 2. Spark will make tabel param as output schema when select the view . > 3. Hive will not update tabel param when runing `create or replace view` to > update the view. > !image-2023-02-27-13-31-20-420.png! > So when hive and spark are mixed and update the view, spark may ignore some > col. > To reproduce this issue: > 1. running in spark > ``` > create table test_spark (id string); > create view test_spark_view as select id from test_spark; > ``` > 2. running in hive > ``` > create or replace view test_spark_view as select id , "test" as new_id from > test_spark; > ``` > 3. We can see spark will ignore `test_spark_view#new_id` when select > test_spark_view using spark. But hive can read it. > I'm not sure if this is a feature of spark. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42594) spark can not read lastest view sql when run `create or replace view` by hive
[ https://issues.apache.org/jira/browse/SPARK-42594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ming95 resolved SPARK-42594. Resolution: Fixed > spark can not read lastest view sql when run `create or replace view` by hive > - > > Key: SPARK-42594 > URL: https://issues.apache.org/jira/browse/SPARK-42594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > Attachments: image-2023-02-27-13-31-20-420.png > > > 1. Spark would save view schema as tabel param. > 2. Spark will make tabel param as output schema when select the view . > 3. Hive will not update tabel param when runing `create or replace view` to > update the view. > !image-2023-02-27-13-31-20-420.png! > So when hive and spark are mixed and update the view, spark may ignore some > col. > To reproduce this issue: > 1. running in spark > ``` > create table test_spark (id string); > create view test_spark_view as select id from test_spark; > ``` > 2. running in hive > ``` > create or replace view test_spark_view as select id , "test" as new_id from > test_spark; > ``` > 3. We can see spark will ignore `test_spark_view#new_id` when select > test_spark_view using spark. But hive can read it. > I'm not sure if this is a feature of spark. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42594) spark can not read lastest view sql when run `create or replace view` by hive
[ https://issues.apache.org/jira/browse/SPARK-42594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693810#comment-17693810 ] ming95 commented on SPARK-42594: OK ,Thanks~ [~yumwang] > spark can not read lastest view sql when run `create or replace view` by hive > - > > Key: SPARK-42594 > URL: https://issues.apache.org/jira/browse/SPARK-42594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > Attachments: image-2023-02-27-13-31-20-420.png > > > 1. Spark would save view schema as tabel param. > 2. Spark will make tabel param as output schema when select the view . > 3. Hive will not update tabel param when runing `create or replace view` to > update the view. > !image-2023-02-27-13-31-20-420.png! > So when hive and spark are mixed and update the view, spark may ignore some > col. > To reproduce this issue: > 1. running in spark > ``` > create table test_spark (id string); > create view test_spark_view as select id from test_spark; > ``` > 2. running in hive > ``` > create or replace view test_spark_view as select id , "test" as new_id from > test_spark; > ``` > 3. We can see spark will ignore `test_spark_view#new_id` when select > test_spark_view using spark. But hive can read it. > I'm not sure if this is a feature of spark. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42528) Optimize PercentileHeap
[ https://issues.apache.org/jira/browse/SPARK-42528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42528. - Fix Version/s: 3.5.0 (was: 3.4.0) Resolution: Fixed Issue resolved by pull request 40121 [https://github.com/apache/spark/pull/40121] > Optimize PercentileHeap > --- > > Key: SPARK-42528 > URL: https://issues.apache.org/jira/browse/SPARK-42528 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Alkis Evlogimenos >Assignee: Alkis Evlogimenos >Priority: Major > Fix For: 3.5.0 > > > It is not fast enough when used inside the scheduler for estimations which > slows down scheduling rate and as a result query execution time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42594) spark can not read lastest view sql when run `create or replace view` by hive
[ https://issues.apache.org/jira/browse/SPARK-42594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693799#comment-17693799 ] Yuming Wang commented on SPARK-42594: - Spark saves information to table properties, Hive does not update this information. Please avoid updating view definition through Hive. > spark can not read lastest view sql when run `create or replace view` by hive > - > > Key: SPARK-42594 > URL: https://issues.apache.org/jira/browse/SPARK-42594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > Attachments: image-2023-02-27-13-31-20-420.png > > > 1. Spark would save view schema as tabel param. > 2. Spark will make tabel param as output schema when select the view . > 3. Hive will not update tabel param when runing `create or replace view` to > update the view. > !image-2023-02-27-13-31-20-420.png! > So when hive and spark are mixed and update the view, spark may ignore some > col. > To reproduce this issue: > 1. running in spark > ``` > create table test_spark (id string); > create view test_spark_view as select id from test_spark; > ``` > 2. running in hive > ``` > create or replace view test_spark_view as select id , "test" as new_id from > test_spark; > ``` > 3. We can see spark will ignore `test_spark_view#new_id` when select > test_spark_view using spark. But hive can read it. > I'm not sure if this is a feature of spark. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42594) spark can not read lastest view sql when run `create or replace view` by hive
[ https://issues.apache.org/jira/browse/SPARK-42594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693795#comment-17693795 ] ming95 commented on SPARK-42594: [~yumwang] [~gurwls223] gentel ping ~ > spark can not read lastest view sql when run `create or replace view` by hive > - > > Key: SPARK-42594 > URL: https://issues.apache.org/jira/browse/SPARK-42594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > Attachments: image-2023-02-27-13-31-20-420.png > > > 1. Spark would save view schema as tabel param. > 2. Spark will make tabel param as output schema when select the view . > 3. Hive will not update tabel param when runing `create or replace view` to > update the view. > !image-2023-02-27-13-31-20-420.png! > So when hive and spark are mixed and update the view, spark may ignore some > col. > To reproduce this issue: > 1. running in spark > ``` > create table test_spark (id string); > create view test_spark_view as select id from test_spark; > ``` > 2. running in hive > ``` > create or replace view test_spark_view as select id , "test" as new_id from > test_spark; > ``` > 3. We can see spark will ignore `test_spark_view#new_id` when select > test_spark_view using spark. But hive can read it. > I'm not sure if this is a feature of spark. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42594) spark can not read lastest view sql when run `create or replace view` by hive
[ https://issues.apache.org/jira/browse/SPARK-42594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ming95 updated SPARK-42594: --- Description: 1. Spark would save view schema as tabel param. 2. Spark will make tabel param as output schema when select the view . 3. Hive will not update tabel param when runing `create or replace view` to update the view. !image-2023-02-27-13-31-20-420.png! So when hive and spark are mixed and update the view, spark may ignore some col. To reproduce this issue: 1. running in spark ``` create table test_spark (id string); create view test_spark_view as select id from test_spark; ``` 2. running in hive ``` create or replace view test_spark_view as select id , "test" as new_id from test_spark; ``` 3. We can see spark will ignore `test_spark_view#new_id` when select test_spark_view using spark. But hive can read it. I'm not sure if this is a feature of spark. was: 1. Spark would save view schema as tabel param. 2. Spark will make tabel param as output schema when select the view . 3. Hive will not update tabel param when runing `create or replace view` to update the view. So when hive and spark are mixed and update the view, spark may ignore some strings. To reproduce this issue: 1. running in spark ``` create table test_spark (id string); create view test_spark_view as select id from test_spark; ``` 2. running in hive ``` create or replace view test_spark_view as select id , "test" as new_id from test_spark; ``` 3. We can see spark will ignore `test_spark_view#new_id` when select test_spark_view using spark. But hive can read it. > spark can not read lastest view sql when run `create or replace view` by hive > - > > Key: SPARK-42594 > URL: https://issues.apache.org/jira/browse/SPARK-42594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > Attachments: image-2023-02-27-13-31-20-420.png > > > 1. Spark would save view schema as tabel param. > 2. Spark will make tabel param as output schema when select the view . > 3. Hive will not update tabel param when runing `create or replace view` to > update the view. > !image-2023-02-27-13-31-20-420.png! > So when hive and spark are mixed and update the view, spark may ignore some > col. > To reproduce this issue: > 1. running in spark > ``` > create table test_spark (id string); > create view test_spark_view as select id from test_spark; > ``` > 2. running in hive > ``` > create or replace view test_spark_view as select id , "test" as new_id from > test_spark; > ``` > 3. We can see spark will ignore `test_spark_view#new_id` when select > test_spark_view using spark. But hive can read it. > I'm not sure if this is a feature of spark. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42594) spark can not read lastest view sql when run `create or replace view` by hive
[ https://issues.apache.org/jira/browse/SPARK-42594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ming95 updated SPARK-42594: --- Attachment: image-2023-02-27-13-31-20-420.png > spark can not read lastest view sql when run `create or replace view` by hive > - > > Key: SPARK-42594 > URL: https://issues.apache.org/jira/browse/SPARK-42594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > Attachments: image-2023-02-27-13-31-20-420.png > > > 1. Spark would save view schema as tabel param. > 2. Spark will make tabel param as output schema when select the view . > 3. Hive will not update tabel param when runing `create or replace view` to > update the view. > So when hive and spark are mixed and update the view, spark may ignore some > strings. > To reproduce this issue: > 1. running in spark > ``` > create table test_spark (id string); > create view test_spark_view as select id from test_spark; > ``` > 2. running in hive > ``` > create or replace view test_spark_view as select id , "test" as new_id from > test_spark; > ``` > 3. We can see spark will ignore `test_spark_view#new_id` when select > test_spark_view using spark. But hive can read it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42594) spark can not read lastest view sql when run `create or replace view` by hive
ming95 created SPARK-42594: -- Summary: spark can not read lastest view sql when run `create or replace view` by hive Key: SPARK-42594 URL: https://issues.apache.org/jira/browse/SPARK-42594 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.2 Reporter: ming95 1. Spark would save view schema as tabel param. 2. Spark will make tabel param as output schema when select the view . 3. Hive will not update tabel param when runing `create or replace view` to update the view. So when hive and spark are mixed and update the view, spark may ignore some strings. To reproduce this issue: 1. running in spark ``` create table test_spark (id string); create view test_spark_view as select id from test_spark; ``` 2. running in hive ``` create or replace view test_spark_view as select id , "test" as new_id from test_spark; ``` 3. We can see spark will ignore `test_spark_view#new_id` when select test_spark_view using spark. But hive can read it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42593) Deprecate the APIs that will be removed in pandas 2.0.
[ https://issues.apache.org/jira/browse/SPARK-42593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42593: Description: pandas is preparing to release 2.0 which includes bunch of API changes. ([https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v2.0.0.html#removal-of-prior-version-deprecations-changes]) We also should deprecate the APIs so that we can remove the API in next release. was: pandas is preparing to release 2.0 which includes bunch of API changes. We also should deprecate the APIs so that we can remove the API in next release. > Deprecate the APIs that will be removed in pandas 2.0. > -- > > Key: SPARK-42593 > URL: https://issues.apache.org/jira/browse/SPARK-42593 > Project: Spark > Issue Type: New Feature > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > pandas is preparing to release 2.0 which includes bunch of API changes. > ([https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v2.0.0.html#removal-of-prior-version-deprecations-changes]) > We also should deprecate the APIs so that we can remove the API in next > release. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42593) Deprecate the APIs that will be removed in pandas 2.0.
[ https://issues.apache.org/jira/browse/SPARK-42593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693788#comment-17693788 ] Haejoon Lee commented on SPARK-42593: - I'm taking a look at this one. Will submit a PR soon. > Deprecate the APIs that will be removed in pandas 2.0. > -- > > Key: SPARK-42593 > URL: https://issues.apache.org/jira/browse/SPARK-42593 > Project: Spark > Issue Type: New Feature > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > pandas is preparing to release 2.0 which includes bunch of API changes. > We also should deprecate the APIs so that we can remove the API in next > release. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42593) Deprecate the APIs that will be removed in pandas 2.0.
Haejoon Lee created SPARK-42593: --- Summary: Deprecate the APIs that will be removed in pandas 2.0. Key: SPARK-42593 URL: https://issues.apache.org/jira/browse/SPARK-42593 Project: Spark Issue Type: New Feature Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee pandas is preparing to release 2.0 which includes bunch of API changes. We also should deprecate the APIs so that we can remove the API in next release. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42592) Document SS guide doc for supporting multiple stateful operators (especially chained aggregations)
[ https://issues.apache.org/jira/browse/SPARK-42592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693786#comment-17693786 ] Jungtaek Lim commented on SPARK-42592: -- Ideally this should be a part of Spark 3.4.0... Since RC already happened, I'll try to see whether I can add the doc before RC2. > Document SS guide doc for supporting multiple stateful operators (especially > chained aggregations) > -- > > Key: SPARK-42592 > URL: https://issues.apache.org/jira/browse/SPARK-42592 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Priority: Major > > We made a change on the guide doc for SPARK-40925 via SPARK-42105, but from > SPARK-42105 we only removed the section of "limitation of global watermark". > That said, we haven't provided any example of new functionality, especially > that users need to know about the change of SQL function (window) in chained > time window aggregations. > In this ticket, we will add the example of chained time window aggregations, > with introducing new functionality of SQL function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42592) Document SS guide doc for supporting multiple stateful operators (especially chained aggregations)
Jungtaek Lim created SPARK-42592: Summary: Document SS guide doc for supporting multiple stateful operators (especially chained aggregations) Key: SPARK-42592 URL: https://issues.apache.org/jira/browse/SPARK-42592 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Jungtaek Lim We made a change on the guide doc for SPARK-40925 via SPARK-42105, but from SPARK-42105 we only removed the section of "limitation of global watermark". That said, we haven't provided any example of new functionality, especially that users need to know about the change of SQL function (window) in chained time window aggregations. In this ticket, we will add the example of chained time window aggregations, with introducing new functionality of SQL function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42591) Document SS guide doc for introducing watermark propagation among operators
Jungtaek Lim created SPARK-42591: Summary: Document SS guide doc for introducing watermark propagation among operators Key: SPARK-42591 URL: https://issues.apache.org/jira/browse/SPARK-42591 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Jungtaek Lim Once SPARK-42376 has merged, we would want to also provide the example of using stream-stream time interval join followed by streaming aggregation. Just adding the feature without proper document may lead to the case no one even knows this is supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42581) Add SparkSession implicits
[ https://issues.apache.org/jira/browse/SPARK-42581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693782#comment-17693782 ] Apache Spark commented on SPARK-42581: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40186 > Add SparkSession implicits > -- > > Key: SPARK-42581 > URL: https://issues.apache.org/jira/browse/SPARK-42581 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42362) Upgrade kubernetes-client from 6.4.0 to 6.4.1
[ https://issues.apache.org/jira/browse/SPARK-42362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42362: -- Affects Version/s: 3.4.0 (was: 3.5.0) > Upgrade kubernetes-client from 6.4.0 to 6.4.1 > - > > Key: SPARK-42362 > URL: https://issues.apache.org/jira/browse/SPARK-42362 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > Fix For: 3.4.0 > > > New version of kubernetes client > Release notes > https://github.com/fabric8io/kubernetes-client/releases/tag/v6.4.1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42586) Implement RuntimeConf
[ https://issues.apache.org/jira/browse/SPARK-42586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693777#comment-17693777 ] Apache Spark commented on SPARK-42586: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40185 > Implement RuntimeConf > - > > Key: SPARK-42586 > URL: https://issues.apache.org/jira/browse/SPARK-42586 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > Implement RuntimeConf for the Scala Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42497) Support of pandas API on Spark for Spark Connect.
[ https://issues.apache.org/jira/browse/SPARK-42497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42497: Affects Version/s: 3.5.0 (was: 3.4.0) > Support of pandas API on Spark for Spark Connect. > - > > Key: SPARK-42497 > URL: https://issues.apache.org/jira/browse/SPARK-42497 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > We should enable `pandas API on Spark` on Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42569) Throw unsupported exceptions for non-supported API
[ https://issues.apache.org/jira/browse/SPARK-42569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693776#comment-17693776 ] Apache Spark commented on SPARK-42569: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40184 > Throw unsupported exceptions for non-supported API > -- > > Key: SPARK-42569 > URL: https://issues.apache.org/jira/browse/SPARK-42569 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42569) Throw unsupported exceptions for non-supported API
[ https://issues.apache.org/jira/browse/SPARK-42569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693775#comment-17693775 ] Apache Spark commented on SPARK-42569: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40184 > Throw unsupported exceptions for non-supported API > -- > > Key: SPARK-42569 > URL: https://issues.apache.org/jira/browse/SPARK-42569 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42589) Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite
[ https://issues.apache.org/jira/browse/SPARK-42589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42589. --- Resolution: Cannot Reproduce > Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite > > > Key: SPARK-42589 > URL: https://issues.apache.org/jira/browse/SPARK-42589 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40821) Introduce window_time function to extract event time from the window column
[ https://issues.apache.org/jira/browse/SPARK-40821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-40821: - Summary: Introduce window_time function to extract event time from the window column (was: Fix late record filtering to support chaining of stateful operators) > Introduce window_time function to extract event time from the window column > --- > > Key: SPARK-40821 > URL: https://issues.apache.org/jira/browse/SPARK-40821 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Alex Balikov >Assignee: Alex Balikov >Priority: Major > Fix For: 3.4.0 > > > Currently chaining of stateful operators is Spark Structured Streaming is not > supported for various reasons and is blocked by the unsupported operations > check (spark.sql.streaming.unsupportedOperationCheck flag). We propose to fix > this as chaining of stateful operators is a common streaming scenario - e.g. > stream-stream join -> windowed aggregation > window aggregation -> window aggregation > etc > What is broken: > # every stateful operator performs late record filtering against the global > watermark. When chaining stateful operators (e.g. window aggregations) the > output produced by the first stateful operator is effectively late against > the watermark and thus filtered out by the next operator late record > filtering (technically the next operator should not do late record filtering > but it can be changed to assert for correctness detection, etc) > # when chaining window aggregations, the first window aggregating operator > produces records with schema \{ window: { start: Timestamp, end: Timestamp }, > agg: Long } - there is not explicit event time in the schema to be used by > the next stateful operator (the correct event time should be window.end - 1 ) > # stream-stream time-interval join can produce late records by semantics, > e.g. if the join condition is: > left.eventTime BETWEEN right.eventTime + INTERVAL 1 HOUR right.eventTime - > INTERVAL 1 HOUR > the produced records can be delayed by 1 hr relative to the > watermark. > Proposed fixes: > 1. 1 can be fixed by performing late record filtering against the previous > microbatch watermark instead of the current microbatch watermark. > 2. 2 can be fixed by allowing the window and session_window functions to work > on the window column directly and compute the correct event time > transparently to the user. Also, introduce window_time SQL function to > compute correct event time from the window column. > 3. 3 can be fixed by adding support for per-operator watermarks instead of a > single global watermark. In the example of stream-stream time interval join > followed by a stateful operator, the join operator will 'delay' the > downstream operator watermarks by a correct value to handle the delayed > records. Only stream-stream time-interval joins will be delaying the > watermark, any other operators will not delay downstream watermarks. > > *This ticket handles no. 2 of the proposal.* Others will be handled in > separate ticket. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42587) Use wrapper versions for SBT and Maven in `connect` module tests
[ https://issues.apache.org/jira/browse/SPARK-42587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693772#comment-17693772 ] Apache Spark commented on SPARK-42587: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40183 > Use wrapper versions for SBT and Maven in `connect` module tests > > > Key: SPARK-42587 > URL: https://issues.apache.org/jira/browse/SPARK-42587 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42538) `functions#lit` support more types
[ https://issues.apache.org/jira/browse/SPARK-42538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693770#comment-17693770 ] Yang Jie commented on SPARK-42538: -- Got it > `functions#lit` support more types > --- > > Key: SPARK-42538 > URL: https://issues.apache.org/jira/browse/SPARK-42538 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42590) Introduce Decimal128 as the physical type for DecimalType
jiaan.geng created SPARK-42590: -- Summary: Introduce Decimal128 as the physical type for DecimalType Key: SPARK-42590 URL: https://issues.apache.org/jira/browse/SPARK-42590 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.5.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42538) `functions#lit` support more types
[ https://issues.apache.org/jira/browse/SPARK-42538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693768#comment-17693768 ] Herman van Hövell commented on SPARK-42538: --- It technically could be released without it. Retargetting closed issues should be a part of the RC process. > `functions#lit` support more types > --- > > Key: SPARK-42538 > URL: https://issues.apache.org/jira/browse/SPARK-42538 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42588) collapse two adjacent windows with the equivalent partition/order expression
[ https://issues.apache.org/jira/browse/SPARK-42588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693766#comment-17693766 ] Apache Spark commented on SPARK-42588: -- User 'zml1206' has created a pull request for this issue: https://github.com/apache/spark/pull/40182 > collapse two adjacent windows with the equivalent partition/order expression > > > Key: SPARK-42588 > URL: https://issues.apache.org/jira/browse/SPARK-42588 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.3, 3.3.2 >Reporter: zhuml >Priority: Major > > Extend the CollapseWindow rule to collapse Window nodes with the equivalent > partition/order expressions > {code:java} > Seq((1, 1), (2, 2)).toDF("a", "b") > .withColumn("max_b", expr("max(b) OVER (PARTITION BY abs(a))")) > .withColumn("min_b", expr("min(b) OVER (PARTITION BY abs(a))")) > == Optimized Logical Plan == > before > Project [a#7, b#8, max_b#11, min_b#17] > +- Window [min(b#8) windowspecdefinition(_w0#19, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS min_b#17], [_w0#19] >+- Project [a#7, b#8, max_b#11, abs(a#7) AS _w0#19] > +- Window [max(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS max_b#11], [_w0#13] > +- Project [_1#2 AS a#7, _2#3 AS b#8, abs(_1#2) AS _w0#13] > +- LocalRelation [_1#2, _2#3] > after > Project [a#7, b#8, max_b#11, min_b#17] > +- Window [max(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS max_b#11, min(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS min_b#17], [_w0#13] >+- Project [_1#2 AS a#7, _2#3 AS b#8, abs(_1#2) AS _w0#13] > +- LocalRelation [_1#2, _2#3] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42588) collapse two adjacent windows with the equivalent partition/order expression
[ https://issues.apache.org/jira/browse/SPARK-42588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42588: Assignee: (was: Apache Spark) > collapse two adjacent windows with the equivalent partition/order expression > > > Key: SPARK-42588 > URL: https://issues.apache.org/jira/browse/SPARK-42588 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.3, 3.3.2 >Reporter: zhuml >Priority: Major > > Extend the CollapseWindow rule to collapse Window nodes with the equivalent > partition/order expressions > {code:java} > Seq((1, 1), (2, 2)).toDF("a", "b") > .withColumn("max_b", expr("max(b) OVER (PARTITION BY abs(a))")) > .withColumn("min_b", expr("min(b) OVER (PARTITION BY abs(a))")) > == Optimized Logical Plan == > before > Project [a#7, b#8, max_b#11, min_b#17] > +- Window [min(b#8) windowspecdefinition(_w0#19, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS min_b#17], [_w0#19] >+- Project [a#7, b#8, max_b#11, abs(a#7) AS _w0#19] > +- Window [max(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS max_b#11], [_w0#13] > +- Project [_1#2 AS a#7, _2#3 AS b#8, abs(_1#2) AS _w0#13] > +- LocalRelation [_1#2, _2#3] > after > Project [a#7, b#8, max_b#11, min_b#17] > +- Window [max(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS max_b#11, min(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS min_b#17], [_w0#13] >+- Project [_1#2 AS a#7, _2#3 AS b#8, abs(_1#2) AS _w0#13] > +- LocalRelation [_1#2, _2#3] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42588) collapse two adjacent windows with the equivalent partition/order expression
[ https://issues.apache.org/jira/browse/SPARK-42588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42588: Assignee: Apache Spark > collapse two adjacent windows with the equivalent partition/order expression > > > Key: SPARK-42588 > URL: https://issues.apache.org/jira/browse/SPARK-42588 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.3, 3.3.2 >Reporter: zhuml >Assignee: Apache Spark >Priority: Major > > Extend the CollapseWindow rule to collapse Window nodes with the equivalent > partition/order expressions > {code:java} > Seq((1, 1), (2, 2)).toDF("a", "b") > .withColumn("max_b", expr("max(b) OVER (PARTITION BY abs(a))")) > .withColumn("min_b", expr("min(b) OVER (PARTITION BY abs(a))")) > == Optimized Logical Plan == > before > Project [a#7, b#8, max_b#11, min_b#17] > +- Window [min(b#8) windowspecdefinition(_w0#19, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS min_b#17], [_w0#19] >+- Project [a#7, b#8, max_b#11, abs(a#7) AS _w0#19] > +- Window [max(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS max_b#11], [_w0#13] > +- Project [_1#2 AS a#7, _2#3 AS b#8, abs(_1#2) AS _w0#13] > +- LocalRelation [_1#2, _2#3] > after > Project [a#7, b#8, max_b#11, min_b#17] > +- Window [max(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS max_b#11, min(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS min_b#17], [_w0#13] >+- Project [_1#2 AS a#7, _2#3 AS b#8, abs(_1#2) AS _w0#13] > +- LocalRelation [_1#2, _2#3] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42589) Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite
[ https://issues.apache.org/jira/browse/SPARK-42589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693764#comment-17693764 ] Apache Spark commented on SPARK-42589: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40181 > Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite > > > Key: SPARK-42589 > URL: https://issues.apache.org/jira/browse/SPARK-42589 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42589) Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite
[ https://issues.apache.org/jira/browse/SPARK-42589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42589: Assignee: (was: Apache Spark) > Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite > > > Key: SPARK-42589 > URL: https://issues.apache.org/jira/browse/SPARK-42589 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42588) collapse two adjacent windows with the equivalent partition/order expression
[ https://issues.apache.org/jira/browse/SPARK-42588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuml updated SPARK-42588: -- Description: Extend the CollapseWindow rule to collapse Window nodes with the equivalent partition/order expressions {code:java} Seq((1, 1), (2, 2)).toDF("a", "b") .withColumn("max_b", expr("max(b) OVER (PARTITION BY abs(a))")) .withColumn("min_b", expr("min(b) OVER (PARTITION BY abs(a))")) == Optimized Logical Plan == before Project [a#7, b#8, max_b#11, min_b#17] +- Window [min(b#8) windowspecdefinition(_w0#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS min_b#17], [_w0#19] +- Project [a#7, b#8, max_b#11, abs(a#7) AS _w0#19] +- Window [max(b#8) windowspecdefinition(_w0#13, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS max_b#11], [_w0#13] +- Project [_1#2 AS a#7, _2#3 AS b#8, abs(_1#2) AS _w0#13] +- LocalRelation [_1#2, _2#3] after Project [a#7, b#8, max_b#11, min_b#17] +- Window [max(b#8) windowspecdefinition(_w0#13, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS max_b#11, min(b#8) windowspecdefinition(_w0#13, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS min_b#17], [_w0#13] +- Project [_1#2 AS a#7, _2#3 AS b#8, abs(_1#2) AS _w0#13] +- LocalRelation [_1#2, _2#3] {code} > collapse two adjacent windows with the equivalent partition/order expression > > > Key: SPARK-42588 > URL: https://issues.apache.org/jira/browse/SPARK-42588 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.3, 3.3.2 >Reporter: zhuml >Priority: Major > > Extend the CollapseWindow rule to collapse Window nodes with the equivalent > partition/order expressions > {code:java} > Seq((1, 1), (2, 2)).toDF("a", "b") > .withColumn("max_b", expr("max(b) OVER (PARTITION BY abs(a))")) > .withColumn("min_b", expr("min(b) OVER (PARTITION BY abs(a))")) > == Optimized Logical Plan == > before > Project [a#7, b#8, max_b#11, min_b#17] > +- Window [min(b#8) windowspecdefinition(_w0#19, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS min_b#17], [_w0#19] >+- Project [a#7, b#8, max_b#11, abs(a#7) AS _w0#19] > +- Window [max(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS max_b#11], [_w0#13] > +- Project [_1#2 AS a#7, _2#3 AS b#8, abs(_1#2) AS _w0#13] > +- LocalRelation [_1#2, _2#3] > after > Project [a#7, b#8, max_b#11, min_b#17] > +- Window [max(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS max_b#11, min(b#8) windowspecdefinition(_w0#13, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS min_b#17], [_w0#13] >+- Project [_1#2 AS a#7, _2#3 AS b#8, abs(_1#2) AS _w0#13] > +- LocalRelation [_1#2, _2#3] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42589) Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite
[ https://issues.apache.org/jira/browse/SPARK-42589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693763#comment-17693763 ] Apache Spark commented on SPARK-42589: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40181 > Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite > > > Key: SPARK-42589 > URL: https://issues.apache.org/jira/browse/SPARK-42589 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42589) Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite
[ https://issues.apache.org/jira/browse/SPARK-42589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42589: Assignee: Apache Spark > Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite > > > Key: SPARK-42589 > URL: https://issues.apache.org/jira/browse/SPARK-42589 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42589) Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite
Dongjoon Hyun created SPARK-42589: - Summary: Exclude `RelationalGroupedDataset.apply` from CompatibilitySuite Key: SPARK-42589 URL: https://issues.apache.org/jira/browse/SPARK-42589 Project: Spark Issue Type: Test Components: Connect, Tests Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42586) Implement RuntimeConf
[ https://issues.apache.org/jira/browse/SPARK-42586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-42586: - Assignee: Herman van Hövell > Implement RuntimeConf > - > > Key: SPARK-42586 > URL: https://issues.apache.org/jira/browse/SPARK-42586 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > Implement RuntimeConf for the Scala Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42560) Implement ColumnName
[ https://issues.apache.org/jira/browse/SPARK-42560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42560. --- Fix Version/s: 3.4.1 Resolution: Fixed > Implement ColumnName > > > Key: SPARK-42560 > URL: https://issues.apache.org/jira/browse/SPARK-42560 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.1 > > > Implement ColumnName class for connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42587) Use wrapper versions for SBT and Maven in `connect` module tests
[ https://issues.apache.org/jira/browse/SPARK-42587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42587: - Assignee: Dongjoon Hyun > Use wrapper versions for SBT and Maven in `connect` module tests > > > Key: SPARK-42587 > URL: https://issues.apache.org/jira/browse/SPARK-42587 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42587) Use wrapper versions for SBT and Maven in `connect` module tests
[ https://issues.apache.org/jira/browse/SPARK-42587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42587. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40180 [https://github.com/apache/spark/pull/40180] > Use wrapper versions for SBT and Maven in `connect` module tests > > > Key: SPARK-42587 > URL: https://issues.apache.org/jira/browse/SPARK-42587 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42538) `functions#lit` support more types
[ https://issues.apache.org/jira/browse/SPARK-42538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693757#comment-17693757 ] Yang Jie commented on SPARK-42538: -- [~hvanhovell] Should fix version be 3.4.0? It hasn't been officially released yet > `functions#lit` support more types > --- > > Key: SPARK-42538 > URL: https://issues.apache.org/jira/browse/SPARK-42538 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42588) collapse two adjacent windows with the equivalent partition/order expression
zhuml created SPARK-42588: - Summary: collapse two adjacent windows with the equivalent partition/order expression Key: SPARK-42588 URL: https://issues.apache.org/jira/browse/SPARK-42588 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.2, 3.2.3 Reporter: zhuml -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42577) A large stage could run indefinitely due to executor lost
[ https://issues.apache.org/jira/browse/SPARK-42577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693753#comment-17693753 ] Tengfei Huang commented on SPARK-42577: --- I am working on this. Thanks. [~Ngone51] > A large stage could run indefinitely due to executor lost > - > > Key: SPARK-42577 > URL: https://issues.apache.org/jira/browse/SPARK-42577 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.3, 3.1.3, 3.2.3, 3.3.2 >Reporter: wuyi >Priority: Major > > When a stage is extremely large and Spark runs on spot instances or > problematic clusters with frequent worker/executor loss, the stage could run > indefinitely due to task rerun caused by the executor loss. This happens, > when the external shuffle service is on, and the large stages runs hours to > complete, when spark tries to submit a child stage, it will find the parent > stage - the large one, has missed some partitions, so the large stage has to > rerun. When it completes again, it finds new missing partitions due to the > same reason. > We should add a attempt limitation for this kind of scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42587) Use wrapper versions for SBT and Maven in `connect` module tests
[ https://issues.apache.org/jira/browse/SPARK-42587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42587: -- Priority: Minor (was: Major) > Use wrapper versions for SBT and Maven in `connect` module tests > > > Key: SPARK-42587 > URL: https://issues.apache.org/jira/browse/SPARK-42587 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42587) Use wrapper versions for SBT and Maven in `connect` module tests
[ https://issues.apache.org/jira/browse/SPARK-42587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693749#comment-17693749 ] Apache Spark commented on SPARK-42587: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40180 > Use wrapper versions for SBT and Maven in `connect` module tests > > > Key: SPARK-42587 > URL: https://issues.apache.org/jira/browse/SPARK-42587 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42587) Use wrapper versions for SBT and Maven in `connect` module tests
[ https://issues.apache.org/jira/browse/SPARK-42587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42587: Assignee: (was: Apache Spark) > Use wrapper versions for SBT and Maven in `connect` module tests > > > Key: SPARK-42587 > URL: https://issues.apache.org/jira/browse/SPARK-42587 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42587) Use wrapper versions for SBT and Maven in `connect` module tests
[ https://issues.apache.org/jira/browse/SPARK-42587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42587: Assignee: Apache Spark > Use wrapper versions for SBT and Maven in `connect` module tests > > > Key: SPARK-42587 > URL: https://issues.apache.org/jira/browse/SPARK-42587 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42587) Use wrapper versions for SBT and Maven in `connect` module tests
[ https://issues.apache.org/jira/browse/SPARK-42587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693748#comment-17693748 ] Apache Spark commented on SPARK-42587: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40180 > Use wrapper versions for SBT and Maven in `connect` module tests > > > Key: SPARK-42587 > URL: https://issues.apache.org/jira/browse/SPARK-42587 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42587) Use wrapper versions for SBT and Maven in `connect` module tests
Dongjoon Hyun created SPARK-42587: - Summary: Use wrapper versions for SBT and Maven in `connect` module tests Key: SPARK-42587 URL: https://issues.apache.org/jira/browse/SPARK-42587 Project: Spark Issue Type: Test Components: Connect, Tests Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42485) SPIP: Shutting down spark structured streaming when the streaming process completed current process
[ https://issues.apache.org/jira/browse/SPARK-42485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693745#comment-17693745 ] Hyukjin Kwon commented on SPARK-42485: -- For the proper SPIP, should better read https://spark.apache.org/improvement-proposals.html, and answer these questions posted there. >From a cursory look, I think this won't need an SPIP though. > SPIP: Shutting down spark structured streaming when the streaming process > completed current process > --- > > Key: SPARK-42485 > URL: https://issues.apache.org/jira/browse/SPARK-42485 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.2 >Reporter: Mich Talebzadeh >Priority: Major > Labels: SPIP > > Spark Structured Streaming is a very useful tool in dealing with Event Driven > Architecture. In an Event Driven Architecture, there is generally a main loop > that listens for events and then triggers a call-back function when one of > those events is detected. In a streaming application the application waits to > receive the source messages in a set interval or whenever they happen and > reacts accordingly. > There are occasions that you may want to stop the Spark program gracefully. > Gracefully meaning that Spark application handles the last streaming message > completely and terminates the application. This is different from invoking > interrupts such as CTRL-C. > Of course one can terminate the process based on the following > # query.awaitTermination() # Waits for the termination of this query, with > stop() or with error > # query.awaitTermination(timeoutMs) # Returns true if this query is > terminated within the timeout in milliseconds. > So the first one above waits until an interrupt signal is received. The > second one will count the timeout and will exit when timeout in milliseconds > is reached. > The issue is that one needs to predict how long the streaming job needs to > run. Clearly any interrupt at the terminal or OS level (kill process), may > end up the processing terminated without a proper completion of the streaming > process. > I have devised a method that allows one to terminate the spark application > internally after processing the last received message. Within say 2 seconds > of the confirmation of shutdown, the process will invoke a graceful shutdown. > {color:#00}This new feature proposes a solution to handle the topic doing > work for the message being processed gracefully, wait for it to complete and > shutdown the streaming process for a given topic without loss of data or > orphaned transactions{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42517) Add documentation for Protobuf connector
[ https://issues.apache.org/jira/browse/SPARK-42517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693744#comment-17693744 ] Hyukjin Kwon commented on SPARK-42517: -- Duplicate of SPARK-40776? > Add documentation for Protobuf connector > > > Key: SPARK-42517 > URL: https://issues.apache.org/jira/browse/SPARK-42517 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > > Similar to [https://spark.apache.org/docs/latest/sql-data-sources-avro.html,] > we should add documentation for Protobuf connector -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42572) Logic error for StateStore.validateStateRowFormat
[ https://issues.apache.org/jira/browse/SPARK-42572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693742#comment-17693742 ] Hyukjin Kwon commented on SPARK-42572: -- [~WweiL] are you saying that we should revert https://github.com/apache/spark/pull/40073? It won't need a new jira for that > Logic error for StateStore.validateStateRowFormat > - > > Key: SPARK-42572 > URL: https://issues.apache.org/jira/browse/SPARK-42572 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Wei Liu >Priority: Major > > SPARK-42484 Changed the logic of whether to check state store format in > StateStore.validateStateRowFormat. Revert it and add unit test to make sure > this won't happen again -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42499) Support for Runtime SQL configuration
[ https://issues.apache.org/jira/browse/SPARK-42499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42499. -- Resolution: Duplicate > Support for Runtime SQL configuration > - > > Key: SPARK-42499 > URL: https://issues.apache.org/jira/browse/SPARK-42499 > Project: Spark > Issue Type: Umbrella > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42559) Implement DataFrameNaFunctions
[ https://issues.apache.org/jira/browse/SPARK-42559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-42559: - Assignee: BingKun Pan > Implement DataFrameNaFunctions > -- > > Key: SPARK-42559 > URL: https://issues.apache.org/jira/browse/SPARK-42559 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: BingKun Pan >Priority: Major > > Implement DataFrameNaFunctions for connect and hook it up to Dataset. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42586) Implement RuntimeConf
Herman van Hövell created SPARK-42586: - Summary: Implement RuntimeConf Key: SPARK-42586 URL: https://issues.apache.org/jira/browse/SPARK-42586 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell Implement RuntimeConf for the Scala Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42585) Streaming createDataFrame implementation
[ https://issues.apache.org/jira/browse/SPARK-42585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42585: - Description: createDataFrame in Spark Connect is now one protobuf message which doesn't allow creating a large local DataFrame. We should make it streaming. > Streaming createDataFrame implementation > > > Key: SPARK-42585 > URL: https://issues.apache.org/jira/browse/SPARK-42585 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > createDataFrame in Spark Connect is now one protobuf message which doesn't > allow creating a large local DataFrame. We should make it streaming. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42585) Streaming createDataFrame implementation
Hyukjin Kwon created SPARK-42585: Summary: Streaming createDataFrame implementation Key: SPARK-42585 URL: https://issues.apache.org/jira/browse/SPARK-42585 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42581) Add SparkSession implicits
[ https://issues.apache.org/jira/browse/SPARK-42581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693736#comment-17693736 ] Herman van Hövell commented on SPARK-42581: --- Waiting for SPARK-42560 > Add SparkSession implicits > -- > > Key: SPARK-42581 > URL: https://issues.apache.org/jira/browse/SPARK-42581 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42564) Implement Dataset.version and Dataset.time
[ https://issues.apache.org/jira/browse/SPARK-42564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42564. --- Fix Version/s: 3.4.1 Resolution: Fixed > Implement Dataset.version and Dataset.time > -- > > Key: SPARK-42564 > URL: https://issues.apache.org/jira/browse/SPARK-42564 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: BingKun Pan >Priority: Major > Fix For: 3.4.1 > > > Implement Dataset.version and Dataset.time > {code:java} > /** > * The version of Spark on which this application is running. > * > * @since 2.0.0 > */ > def version: String = SPARK_VERSION > /** > * Executes some code block and prints to stdout the time taken to execute > the block. This is > * available in Scala only and is used primarily for interactive testing and > debugging. > * > * @since 2.1.0 > */ > def time[T](f: => T): T = { > val start = System.nanoTime() > val ret = f > val end = System.nanoTime() > // scalastyle:off println > println(s"Time taken: ${NANOSECONDS.toMillis(end - start)} ms") > // scalastyle:on println > ret > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.
[ https://issues.apache.org/jira/browse/SPARK-42419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42419. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39991 [https://github.com/apache/spark/pull/39991] > Migrate `TypeError` into error framework for Spark Connect column API. > -- > > Key: SPARK-42419 > URL: https://issues.apache.org/jira/browse/SPARK-42419 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > We should migrate all errors into PySpark error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.
[ https://issues.apache.org/jira/browse/SPARK-42419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42419: Assignee: Haejoon Lee > Migrate `TypeError` into error framework for Spark Connect column API. > -- > > Key: SPARK-42419 > URL: https://issues.apache.org/jira/browse/SPARK-42419 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > We should migrate all errors into PySpark error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42569) Throw unsupported exceptions for non-supported API
[ https://issues.apache.org/jira/browse/SPARK-42569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42569. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40172 [https://github.com/apache/spark/pull/40172] > Throw unsupported exceptions for non-supported API > -- > > Key: SPARK-42569 > URL: https://issues.apache.org/jira/browse/SPARK-42569 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42574) DataFrame.toPandas should handle duplicated column names
[ https://issues.apache.org/jira/browse/SPARK-42574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42574. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40170 [https://github.com/apache/spark/pull/40170] > DataFrame.toPandas should handle duplicated column names > > > Key: SPARK-42574 > URL: https://issues.apache.org/jira/browse/SPARK-42574 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > > {code:python} > spark.sql("select 1 v, 1 v").toPandas() > {code} > should work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42574) DataFrame.toPandas should handle duplicated column names
[ https://issues.apache.org/jira/browse/SPARK-42574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42574: Assignee: Takuya Ueshin > DataFrame.toPandas should handle duplicated column names > > > Key: SPARK-42574 > URL: https://issues.apache.org/jira/browse/SPARK-42574 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > {code:python} > spark.sql("select 1 v, 1 v").toPandas() > {code} > should work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42560) Implement ColumnName
[ https://issues.apache.org/jira/browse/SPARK-42560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693730#comment-17693730 ] Apache Spark commented on SPARK-42560: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40179 > Implement ColumnName > > > Key: SPARK-42560 > URL: https://issues.apache.org/jira/browse/SPARK-42560 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > Implement ColumnName class for connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42560) Implement ColumnName
[ https://issues.apache.org/jira/browse/SPARK-42560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693729#comment-17693729 ] Apache Spark commented on SPARK-42560: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40179 > Implement ColumnName > > > Key: SPARK-42560 > URL: https://issues.apache.org/jira/browse/SPARK-42560 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > Implement ColumnName class for connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42584) Improve output of Column.explain
Herman van Hövell created SPARK-42584: - Summary: Improve output of Column.explain Key: SPARK-42584 URL: https://issues.apache.org/jira/browse/SPARK-42584 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell We currently display the structure of the proto in both the regular and extended version of explain. We should display a more compact sql-a-like string for the regular version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42407) `with as` executed again
[ https://issues.apache.org/jira/browse/SPARK-42407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693723#comment-17693723 ] Pablo Langa Blanco commented on SPARK-42407: In my opinion, "WITH AS" syntax is intended to simplify sql queries, but not to act at the execution level. To get what you want you can use "CACHE TABLE" combined with 'WITH AS'. > `with as` executed again > > > Key: SPARK-42407 > URL: https://issues.apache.org/jira/browse/SPARK-42407 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.3 >Reporter: yiku123 >Priority: Major > > When 'with as' is used multiple times, it will be executed again each time > without saving the results of' with as', resulting in low efficiency. > Will you consider improving the behavior of 'with as' > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693721#comment-17693721 ] Pablo Langa Blanco edited comment on SPARK-40525 at 2/26/23 10:57 PM: -- Hi [~x/sys] , When you are working with Spark Sql interface you can configure the behavior and you have 3 policies for type coercion rules. ([https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html)] If you set "strict" in spark.sql.storeAssignmentPolicy it's going to happen what you expect, but it's not the policy by default. I hope it help you. was (Author: planga82): Hi [~x/sys] , When you are working with Spark Sql interface you can configure the behavior and you have 3 policies for type coercion rules. ([https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html)] If you set "strict" in spark.sql.storeAssignmentPolicy it's going to happen what you expected, but it's not the policy by default. I hope it help you. > Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame > but evaluates to a rounded value in SparkSQL > -- > > Key: SPARK-40525 > URL: https://issues.apache.org/jira/browse/SPARK-40525 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} > expectedly errors out. However, it is evaluated to a rounded value {{1}} if > the value is inserted into the table via {{{}spark-sql{}}}. > h3. Steps to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql {code} > Execute the following: > {code:java} > spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC; > 22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > Time taken: 0.216 seconds > spark-sql> insert into int_floating_point_vals select 1.1; > Time taken: 1.747 seconds > spark-sql> select * from int_floating_point_vals; > 1 > Time taken: 0.518 seconds, Fetched 1 row(s){code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}INT{}}} and {{{}1.1{}}}). > h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of > the aforementioned value correctly raises an exception: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types._ > val rdd = sc.parallelize(Seq(Row(1.1))) > val schema = new StructType().add(StructField("c1", IntegerType, true)) > val df = spark.createDataFrame(rdd, schema) > df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") > {code} > The following exception is raised: > {code:java} > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of int{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693721#comment-17693721 ] Pablo Langa Blanco commented on SPARK-40525: Hi [~x/sys] , When you are working with Spark Sql interface you can configure the behavior and you have 3 policies for type coercion rules. ([https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html)] If you set "strict" in spark.sql.storeAssignmentPolicy it's going to happen what you expected, but it's not the policy by default. I hope it help you. > Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame > but evaluates to a rounded value in SparkSQL > -- > > Key: SPARK-40525 > URL: https://issues.apache.org/jira/browse/SPARK-40525 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} > expectedly errors out. However, it is evaluated to a rounded value {{1}} if > the value is inserted into the table via {{{}spark-sql{}}}. > h3. Steps to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql {code} > Execute the following: > {code:java} > spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC; > 22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > Time taken: 0.216 seconds > spark-sql> insert into int_floating_point_vals select 1.1; > Time taken: 1.747 seconds > spark-sql> select * from int_floating_point_vals; > 1 > Time taken: 0.518 seconds, Fetched 1 row(s){code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}INT{}}} and {{{}1.1{}}}). > h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of > the aforementioned value correctly raises an exception: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types._ > val rdd = sc.parallelize(Seq(Row(1.1))) > val schema = new StructType().add(StructField("c1", IntegerType, true)) > val df = spark.createDataFrame(rdd, schema) > df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") > {code} > The following exception is raised: > {code:java} > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of int{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42583) Remove outer join if all aggregate functions are distinct
[ https://issues.apache.org/jira/browse/SPARK-42583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-42583: Description: To support more cases: https://github.com/pingcap/tidb/blob/master/planner/core/rule_join_elimination.go#L159 > Remove outer join if all aggregate functions are distinct > - > > Key: SPARK-42583 > URL: https://issues.apache.org/jira/browse/SPARK-42583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > To support more cases: > https://github.com/pingcap/tidb/blob/master/planner/core/rule_join_elimination.go#L159 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42583) Remove outer join if all aggregate functions are distinct
[ https://issues.apache.org/jira/browse/SPARK-42583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693649#comment-17693649 ] Apache Spark commented on SPARK-42583: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40177 > Remove outer join if all aggregate functions are distinct > - > > Key: SPARK-42583 > URL: https://issues.apache.org/jira/browse/SPARK-42583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42583) Remove outer join if all aggregate functions are distinct
[ https://issues.apache.org/jira/browse/SPARK-42583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42583: Assignee: (was: Apache Spark) > Remove outer join if all aggregate functions are distinct > - > > Key: SPARK-42583 > URL: https://issues.apache.org/jira/browse/SPARK-42583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42583) Remove outer join if all aggregate functions are distinct
[ https://issues.apache.org/jira/browse/SPARK-42583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693648#comment-17693648 ] Apache Spark commented on SPARK-42583: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40177 > Remove outer join if all aggregate functions are distinct > - > > Key: SPARK-42583 > URL: https://issues.apache.org/jira/browse/SPARK-42583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42583) Remove outer join if all aggregate functions are distinct
[ https://issues.apache.org/jira/browse/SPARK-42583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42583: Assignee: Apache Spark > Remove outer join if all aggregate functions are distinct > - > > Key: SPARK-42583 > URL: https://issues.apache.org/jira/browse/SPARK-42583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42583) Remove outer join if all aggregate functions are distinct
Yuming Wang created SPARK-42583: --- Summary: Remove outer join if all aggregate functions are distinct Key: SPARK-42583 URL: https://issues.apache.org/jira/browse/SPARK-42583 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42564) Implement Dataset.version and Dataset.time
[ https://issues.apache.org/jira/browse/SPARK-42564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693643#comment-17693643 ] Apache Spark commented on SPARK-42564: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40176 > Implement Dataset.version and Dataset.time > -- > > Key: SPARK-42564 > URL: https://issues.apache.org/jira/browse/SPARK-42564 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: BingKun Pan >Priority: Major > > Implement Dataset.version and Dataset.time > {code:java} > /** > * The version of Spark on which this application is running. > * > * @since 2.0.0 > */ > def version: String = SPARK_VERSION > /** > * Executes some code block and prints to stdout the time taken to execute > the block. This is > * available in Scala only and is used primarily for interactive testing and > debugging. > * > * @since 2.1.0 > */ > def time[T](f: => T): T = { > val start = System.nanoTime() > val ret = f > val end = System.nanoTime() > // scalastyle:off println > println(s"Time taken: ${NANOSECONDS.toMillis(end - start)} ms") > // scalastyle:on println > ret > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42564) Implement Dataset.version and Dataset.time
[ https://issues.apache.org/jira/browse/SPARK-42564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42564: Assignee: BingKun Pan (was: Apache Spark) > Implement Dataset.version and Dataset.time > -- > > Key: SPARK-42564 > URL: https://issues.apache.org/jira/browse/SPARK-42564 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: BingKun Pan >Priority: Major > > Implement Dataset.version and Dataset.time > {code:java} > /** > * The version of Spark on which this application is running. > * > * @since 2.0.0 > */ > def version: String = SPARK_VERSION > /** > * Executes some code block and prints to stdout the time taken to execute > the block. This is > * available in Scala only and is used primarily for interactive testing and > debugging. > * > * @since 2.1.0 > */ > def time[T](f: => T): T = { > val start = System.nanoTime() > val ret = f > val end = System.nanoTime() > // scalastyle:off println > println(s"Time taken: ${NANOSECONDS.toMillis(end - start)} ms") > // scalastyle:on println > ret > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42564) Implement Dataset.version and Dataset.time
[ https://issues.apache.org/jira/browse/SPARK-42564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42564: Assignee: Apache Spark (was: BingKun Pan) > Implement Dataset.version and Dataset.time > -- > > Key: SPARK-42564 > URL: https://issues.apache.org/jira/browse/SPARK-42564 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > > Implement Dataset.version and Dataset.time > {code:java} > /** > * The version of Spark on which this application is running. > * > * @since 2.0.0 > */ > def version: String = SPARK_VERSION > /** > * Executes some code block and prints to stdout the time taken to execute > the block. This is > * available in Scala only and is used primarily for interactive testing and > debugging. > * > * @since 2.1.0 > */ > def time[T](f: => T): T = { > val start = System.nanoTime() > val ret = f > val end = System.nanoTime() > // scalastyle:off println > println(s"Time taken: ${NANOSECONDS.toMillis(end - start)} ms") > // scalastyle:on println > ret > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42564) Implement Dataset.version and Dataset.time
[ https://issues.apache.org/jira/browse/SPARK-42564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693642#comment-17693642 ] Apache Spark commented on SPARK-42564: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40176 > Implement Dataset.version and Dataset.time > -- > > Key: SPARK-42564 > URL: https://issues.apache.org/jira/browse/SPARK-42564 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: BingKun Pan >Priority: Major > > Implement Dataset.version and Dataset.time > {code:java} > /** > * The version of Spark on which this application is running. > * > * @since 2.0.0 > */ > def version: String = SPARK_VERSION > /** > * Executes some code block and prints to stdout the time taken to execute > the block. This is > * available in Scala only and is used primarily for interactive testing and > debugging. > * > * @since 2.1.0 > */ > def time[T](f: => T): T = { > val start = System.nanoTime() > val ret = f > val end = System.nanoTime() > // scalastyle:off println > println(s"Time taken: ${NANOSECONDS.toMillis(end - start)} ms") > // scalastyle:on println > ret > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42559) Implement DataFrameNaFunctions
[ https://issues.apache.org/jira/browse/SPARK-42559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693641#comment-17693641 ] BingKun Pan commented on SPARK-42559: - I work on it. > Implement DataFrameNaFunctions > -- > > Key: SPARK-42559 > URL: https://issues.apache.org/jira/browse/SPARK-42559 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Implement DataFrameNaFunctions for connect and hook it up to Dataset. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org