[jira] [Created] (SPARK-41961) Support table-valued functions with LATERAL
Allison Wang created SPARK-41961: Summary: Support table-valued functions with LATERAL Key: SPARK-41961 URL: https://issues.apache.org/jira/browse/SPARK-41961 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Allison Wang Support table-valued functions with the LATERAL subquery. For example: {{select * from t, lateral explode(array(t.c1, t.c2))}} Currently, this query throws a parse exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission
[ https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656480#comment-17656480 ] Mridul Muralidharan edited comment on SPARK-41953 at 1/10/23 7:48 AM: -- A few things: * Looking at SPARK-27637, we should revisit it - the {{IsExecutorAlive}} request does not make sense in case of dynamic resource allocation (DRA) when an external shuffle service (ESS) is enabled : we should not be making that call. Thoughts [~Ngone51] ? +CC [~csingh] This also means, relying on ExecutorDeadException when DRA is enabled with ESS is configured wont be useful. For rest of the proposal ... For (2), I am not sure about 'Make MapOutputTracker support fetch latest output without epoch provided.' - this could have nontrivial interaction with other things, and I will need to think through it. Not sure if we can model node decommission - where we have block moved from host A to host B without any other change - as not requiring an epoch update (or rather, flag the epoch's as 'compatible' - if there are no interleaving updates), requires analysis Assuming we sort out how to get updated state, (3) looks like a reasonable approach. was (Author: mridulm80): A few things: * Looking at SPARK-27637, we should revisit it - the {{IsExecutorAlive}} request does not make sense in case of dynamic resource allocation (DRA) when an external shuffle service (ESS) is enabled : we should not be making that call. Thoughts [~Ngone51] ? +CC [~csingh] This also means, relying on ExecutorDeadException when DRA is enabled with ESS is configured wont be useful. For rest of the proposal ... For (2), I am not sure about 'Make MapOutputTracker support fetch latest output without epoch provided.' - this could have nontrivial interaction with other things, and I will need to think through it. Not sure if we can model node decommission - where we have block moved from host A to host B without any other change - as not requiring an epoch update, requires analysis Assuming we sort out how to get updated state, (3) looks like a reasonable approach. > Shuffle output location refetch during shuffle migration in decommission > > > Key: SPARK-41953 > URL: https://issues.apache.org/jira/browse/SPARK-41953 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Zhongwei Zhu >Priority: Major > > When shuffle migration enabled during spark decommissionm, shuffle data will > be migrated into live executors, then update latest location to > MapOutputTracker. It has some issues: > # Executors only do map output location fetch in the beginning of the reduce > stage, so any shuffle output location change in the middle of reduce will > cause FetchFailed as reducer fetch from old location. Even stage retries > could solve this, this still cause lots of resource waste as all shuffle read > and compute happened before FetchFailed partition will be wasted. > # During stage retries, less running tasks cause more executors to be > decommissioned and shuffle data location keep changing. In the worst case, > stage could need lots of retries, further breaking SLA. > So I propose to support refetch map output location during reduce phase if > shuffle migration is enabled and FetchFailed is caused by a decommissioned > dead executor. The detailed steps as below: > # When `BlockTransferService` fetch blocks failed from a decommissioned dead > executor, ExecutorDeadException(isDecommission as true) will be thrown. > # Make MapOutputTracker support fetch latest output without epoch provided. > # `ShuffleBlockFetcherIterator` will refetch latest output from > MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, > there should be a new location on another executor. If not, throw exception > as current. If yes, create new local and remote requests to fetch these > migrated shuffle blocks. The flow will be similar as failback fetch when push > merged fetch failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission
[ https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656480#comment-17656480 ] Mridul Muralidharan commented on SPARK-41953: - A few things: * Looking at SPARK-27637, we should revisit it - the {{IsExecutorAlive}} request does not make sense in case of dynamic resource allocation (DRA) when an external shuffle service (ESS) is enabled : we should not be making that call. Thoughts [~Ngone51] ? +CC [~csingh] This also means, relying on ExecutorDeadException when DRA is enabled with ESS is configured wont be useful. For rest of the proposal ... For (2), I am not sure about 'Make MapOutputTracker support fetch latest output without epoch provided.' - this could have nontrivial interaction with other things, and I will need to think through it. Not sure if we can model node decommission - where we have block moved from host A to host B without any other change - as not requiring an epoch update, requires analysis Assuming we sort out how to get updated state, (3) looks like a reasonable approach. > Shuffle output location refetch during shuffle migration in decommission > > > Key: SPARK-41953 > URL: https://issues.apache.org/jira/browse/SPARK-41953 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Zhongwei Zhu >Priority: Major > > When shuffle migration enabled during spark decommissionm, shuffle data will > be migrated into live executors, then update latest location to > MapOutputTracker. It has some issues: > # Executors only do map output location fetch in the beginning of the reduce > stage, so any shuffle output location change in the middle of reduce will > cause FetchFailed as reducer fetch from old location. Even stage retries > could solve this, this still cause lots of resource waste as all shuffle read > and compute happened before FetchFailed partition will be wasted. > # During stage retries, less running tasks cause more executors to be > decommissioned and shuffle data location keep changing. In the worst case, > stage could need lots of retries, further breaking SLA. > So I propose to support refetch map output location during reduce phase if > shuffle migration is enabled and FetchFailed is caused by a decommissioned > dead executor. The detailed steps as below: > # When `BlockTransferService` fetch blocks failed from a decommissioned dead > executor, ExecutorDeadException(isDecommission as true) will be thrown. > # Make MapOutputTracker support fetch latest output without epoch provided. > # `ShuffleBlockFetcherIterator` will refetch latest output from > MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, > there should be a new location on another executor. If not, throw exception > as current. If yes, create new local and remote requests to fetch these > migrated shuffle blocks. The flow will be similar as failback fetch when push > merged fetch failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41934) Add the unsupported function list for `session`
[ https://issues.apache.org/jira/browse/SPARK-41934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656465#comment-17656465 ] Apache Spark commented on SPARK-41934: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39478 > Add the unsupported function list for `session` > --- > > Key: SPARK-41934 > URL: https://issues.apache.org/jira/browse/SPARK-41934 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time
[ https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656459#comment-17656459 ] Enrico Minack edited comment on SPARK-40885 at 1/10/23 7:02 AM: This should be fixed by SPARK-41959 / [#39475|http://example.com|https://github.com/apache/spark/pull/39475]. was (Author: enricomi): This should be fixed by SPARK-41959. > Spark will filter out data field sorting when dynamic partitions and data > fields are sorted at the same time > > > Key: SPARK-40885 > URL: https://issues.apache.org/jira/browse/SPARK-40885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.3.0, 3.2.2 >Reporter: ming95 >Priority: Major > Attachments: 1666494504884.jpg > > > When using dynamic partitions to write data and sort partitions and data > fields, Spark will filter the sorting of data fields. > > reproduce sql: > {code:java} > CREATE TABLE `sort_table`( > `id` int, > `name` string > ) > PARTITIONED BY ( > `dt` string) > stored as textfile > LOCATION 'sort_table';CREATE TABLE `test_table`( > `id` int, > `name` string) > PARTITIONED BY ( > `dt` string) > stored as textfile > LOCATION > 'test_table';//gen test data > insert into test_table partition(dt=20221011) select 10,"15" union all select > 1,"10" union all select 5,"50" union all select 20,"2" union all select > 30,"14" ; > set spark.hadoop.hive.exec.dynamici.partition=true; > set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict; > // this sql sort with partition filed (`dt`) and data filed (`name`), but > sort with `name` can not work > insert overwrite table sort_table partition(dt) select id,name,dt from > test_table order by name,dt; > {code} > > The Sort operator of DAG has only one sort field, but there are actually two > in SQL.(See the attached drawing) > > It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time
[ https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656459#comment-17656459 ] Enrico Minack edited comment on SPARK-40885 at 1/10/23 7:00 AM: This should be fixed by SPARK-41959. was (Author: enricomi): This should be fixed by SPARK-40885. > Spark will filter out data field sorting when dynamic partitions and data > fields are sorted at the same time > > > Key: SPARK-40885 > URL: https://issues.apache.org/jira/browse/SPARK-40885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.3.0, 3.2.2 >Reporter: ming95 >Priority: Major > Attachments: 1666494504884.jpg > > > When using dynamic partitions to write data and sort partitions and data > fields, Spark will filter the sorting of data fields. > > reproduce sql: > {code:java} > CREATE TABLE `sort_table`( > `id` int, > `name` string > ) > PARTITIONED BY ( > `dt` string) > stored as textfile > LOCATION 'sort_table';CREATE TABLE `test_table`( > `id` int, > `name` string) > PARTITIONED BY ( > `dt` string) > stored as textfile > LOCATION > 'test_table';//gen test data > insert into test_table partition(dt=20221011) select 10,"15" union all select > 1,"10" union all select 5,"50" union all select 20,"2" union all select > 30,"14" ; > set spark.hadoop.hive.exec.dynamici.partition=true; > set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict; > // this sql sort with partition filed (`dt`) and data filed (`name`), but > sort with `name` can not work > insert overwrite table sort_table partition(dt) select id,name,dt from > test_table order by name,dt; > {code} > > The Sort operator of DAG has only one sort field, but there are actually two > in SQL.(See the attached drawing) > > It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time
[ https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656459#comment-17656459 ] Enrico Minack commented on SPARK-40885: --- This should be fixed by SPARK-40885. > Spark will filter out data field sorting when dynamic partitions and data > fields are sorted at the same time > > > Key: SPARK-40885 > URL: https://issues.apache.org/jira/browse/SPARK-40885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.3.0, 3.2.2 >Reporter: ming95 >Priority: Major > Attachments: 1666494504884.jpg > > > When using dynamic partitions to write data and sort partitions and data > fields, Spark will filter the sorting of data fields. > > reproduce sql: > {code:java} > CREATE TABLE `sort_table`( > `id` int, > `name` string > ) > PARTITIONED BY ( > `dt` string) > stored as textfile > LOCATION 'sort_table';CREATE TABLE `test_table`( > `id` int, > `name` string) > PARTITIONED BY ( > `dt` string) > stored as textfile > LOCATION > 'test_table';//gen test data > insert into test_table partition(dt=20221011) select 10,"15" union all select > 1,"10" union all select 5,"50" union all select 20,"2" union all select > 30,"14" ; > set spark.hadoop.hive.exec.dynamici.partition=true; > set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict; > // this sql sort with partition filed (`dt`) and data filed (`name`), but > sort with `name` can not work > insert overwrite table sort_table partition(dt) select id,name,dt from > test_table order by name,dt; > {code} > > The Sort operator of DAG has only one sort field, but there are actually two > in SQL.(See the attached drawing) > > It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41960) Assign name to _LEGACY_ERROR_TEMP_1056
Haejoon Lee created SPARK-41960: --- Summary: Assign name to _LEGACY_ERROR_TEMP_1056 Key: SPARK-41960 URL: https://issues.apache.org/jira/browse/SPARK-41960 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Haejoon Lee Assign name to _LEGACY_ERROR_TEMP_1056 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41595) Support generator function explode/explode_outer in the FROM clause
[ https://issues.apache.org/jira/browse/SPARK-41595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41595. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39133 [https://github.com/apache/spark/pull/39133] > Support generator function explode/explode_outer in the FROM clause > --- > > Key: SPARK-41595 > URL: https://issues.apache.org/jira/browse/SPARK-41595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.4.0 > > > Currently, the table-valued generator function explode/explode_outer can only > be used in the SELECT clause of a query: > SELECT explode(array(1, 2)) > This task is to allow table-valued functions to be used in the FROM clause of > a query: > SELECT * FROM explode(array(1, 2)) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41595) Support generator function explode/explode_outer in the FROM clause
[ https://issues.apache.org/jira/browse/SPARK-41595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-41595: --- Assignee: Allison Wang > Support generator function explode/explode_outer in the FROM clause > --- > > Key: SPARK-41595 > URL: https://issues.apache.org/jira/browse/SPARK-41595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Currently, the table-valued generator function explode/explode_outer can only > be used in the SELECT clause of a query: > SELECT explode(array(1, 2)) > This task is to allow table-valued functions to be used in the FROM clause of > a query: > SELECT * FROM explode(array(1, 2)) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41872) Fix DataFrame createDataframe handling of None
[ https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656437#comment-17656437 ] Apache Spark commented on SPARK-41872: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39477 > Fix DataFrame createDataframe handling of None > -- > > Key: SPARK-41872 > URL: https://issues.apache.org/jira/browse/SPARK-41872 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > row = self.spark.createDataFrame([("Alice", None, None, None)], > schema).fillna(True).first() > self.assertEqual(row.age, None){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 231, in test_fillna > self.assertEqual(row.age, None) > AssertionError: nan != None{code} > > {code:java} > row = ( > self.spark.createDataFrame([("Alice", 10, None)], schema) > .replace(10, 20, subset=["name", "height"]) > .first() > ) > self.assertEqual(row.name, "Alice") > self.assertEqual(row.age, 10) > self.assertEqual(row.height, None) {code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 372, in test_replace self.assertEqual(row.height, None) > AssertionError: nan != None > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41872) Fix DataFrame createDataframe handling of None
[ https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41872: Assignee: Apache Spark > Fix DataFrame createDataframe handling of None > -- > > Key: SPARK-41872 > URL: https://issues.apache.org/jira/browse/SPARK-41872 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > row = self.spark.createDataFrame([("Alice", None, None, None)], > schema).fillna(True).first() > self.assertEqual(row.age, None){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 231, in test_fillna > self.assertEqual(row.age, None) > AssertionError: nan != None{code} > > {code:java} > row = ( > self.spark.createDataFrame([("Alice", 10, None)], schema) > .replace(10, 20, subset=["name", "height"]) > .first() > ) > self.assertEqual(row.name, "Alice") > self.assertEqual(row.age, 10) > self.assertEqual(row.height, None) {code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 372, in test_replace self.assertEqual(row.height, None) > AssertionError: nan != None > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41872) Fix DataFrame createDataframe handling of None
[ https://issues.apache.org/jira/browse/SPARK-41872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41872: Assignee: (was: Apache Spark) > Fix DataFrame createDataframe handling of None > -- > > Key: SPARK-41872 > URL: https://issues.apache.org/jira/browse/SPARK-41872 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > row = self.spark.createDataFrame([("Alice", None, None, None)], > schema).fillna(True).first() > self.assertEqual(row.age, None){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 231, in test_fillna > self.assertEqual(row.age, None) > AssertionError: nan != None{code} > > {code:java} > row = ( > self.spark.createDataFrame([("Alice", 10, None)], schema) > .replace(10, 20, subset=["name", "height"]) > .first() > ) > self.assertEqual(row.name, "Alice") > self.assertEqual(row.age, 10) > self.assertEqual(row.height, None) {code} > {code:java} > Traceback (most recent call last): File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 372, in test_replace self.assertEqual(row.height, None) > AssertionError: nan != None > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41957) Enable the doctest for `DataFrame.hint`
[ https://issues.apache.org/jira/browse/SPARK-41957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41957: - Assignee: Ruifeng Zheng > Enable the doctest for `DataFrame.hint` > --- > > Key: SPARK-41957 > URL: https://issues.apache.org/jira/browse/SPARK-41957 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41957) Enable the doctest for `DataFrame.hint`
[ https://issues.apache.org/jira/browse/SPARK-41957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41957. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39472 [https://github.com/apache/spark/pull/39472] > Enable the doctest for `DataFrame.hint` > --- > > Key: SPARK-41957 > URL: https://issues.apache.org/jira/browse/SPARK-41957 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41907) Function `sampleby` return parity
[ https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656431#comment-17656431 ] Apache Spark commented on SPARK-41907: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/39476 > Function `sampleby` return parity > - > > Key: SPARK-41907 > URL: https://issues.apache.org/jira/browse/SPARK-41907 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) > sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) > self.assertTrue(sampled.count() == 35){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 202, in test_sampleby > self.assertTrue(sampled.count() == 35) > AssertionError: False is not true {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41907) Function `sampleby` return parity
[ https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41907: Assignee: (was: Apache Spark) > Function `sampleby` return parity > - > > Key: SPARK-41907 > URL: https://issues.apache.org/jira/browse/SPARK-41907 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) > sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) > self.assertTrue(sampled.count() == 35){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 202, in test_sampleby > self.assertTrue(sampled.count() == 35) > AssertionError: False is not true {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41907) Function `sampleby` return parity
[ https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656430#comment-17656430 ] Apache Spark commented on SPARK-41907: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/39476 > Function `sampleby` return parity > - > > Key: SPARK-41907 > URL: https://issues.apache.org/jira/browse/SPARK-41907 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) > sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) > self.assertTrue(sampled.count() == 35){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 202, in test_sampleby > self.assertTrue(sampled.count() == 35) > AssertionError: False is not true {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41907) Function `sampleby` return parity
[ https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41907: Assignee: Apache Spark > Function `sampleby` return parity > - > > Key: SPARK-41907 > URL: https://issues.apache.org/jira/browse/SPARK-41907 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Apache Spark >Priority: Major > > {code:java} > df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) > sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) > self.assertTrue(sampled.count() == 35){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 202, in test_sampleby > self.assertTrue(sampled.count() == 35) > AssertionError: False is not true {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-41907) Function `sampleby` return parity
[ https://issues.apache.org/jira/browse/SPARK-41907 ] jiaan.geng deleted comment on SPARK-41907: was (Author: beliefer): I want take a look. > Function `sampleby` return parity > - > > Key: SPARK-41907 > URL: https://issues.apache.org/jira/browse/SPARK-41907 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) > sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) > self.assertTrue(sampled.count() == 35){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 202, in test_sampleby > self.assertTrue(sampled.count() == 35) > AssertionError: False is not true {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41907) Function `sampleby` return parity
[ https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656429#comment-17656429 ] jiaan.geng commented on SPARK-41907: The root cause is the plan is different from pyspark, so the result is not determined. The plan come from pyspark {code:java} == Physical Plan == * Filter (2) +- * Scan ExistingRDD (1) (1) Scan ExistingRDD [codegen id : 1] Output [2]: [a#4L, b#5L] Arguments: [a#4L, b#5L], MapPartitionsRDD[9] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0) (2) Filter [codegen id : 1] Input [2]: [a#4L, b#5L] Condition : UDF(b#5L, rand(0)) {code} The plan come from connect {code:java} == Physical Plan == LocalTableScan (1) (1) LocalTableScan Output [2]: [a#5L, b#6L] Arguments: [a#5L, b#6L] {code} > Function `sampleby` return parity > - > > Key: SPARK-41907 > URL: https://issues.apache.org/jira/browse/SPARK-41907 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) > sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) > self.assertTrue(sampled.count() == 35){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 202, in test_sampleby > self.assertTrue(sampled.count() == 35) > AssertionError: False is not true {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41914) Sorting issue with partitioned-writing and planned write optimization disabled
[ https://issues.apache.org/jira/browse/SPARK-41914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41914. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39431 [https://github.com/apache/spark/pull/39431] > Sorting issue with partitioned-writing and planned write optimization disabled > -- > > Key: SPARK-41914 > URL: https://issues.apache.org/jira/browse/SPARK-41914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Assignee: Enrico Minack >Priority: Major > Fix For: 3.4.0 > > > Spark 3.4.0 introduced option > {{{}spark.sql.optimizer.plannedWrite.enabled{}}}, which is enabled by > default. When disabled, partitioned writing loses in-partition order when > spilling occurs. > This is related to SPARK-40885 where setting option > {{spark.sql.optimizer.plannedWrite.enabled}} to {{true}} will remove the > existing sort (for {{day}} and {{{}id{}}}) entirely. > Run this with 512m memory and one executor, e.g.: > {code} > spark-shell --driver-memory 512m --master "local[1]" > {code} > {code:scala} > import org.apache.spark.sql.SaveMode > spark.conf.set("spark.sql.optimizer.plannedWrite.enabled", false) > val ids = 200 > val days = 2 > val parts = 2 > val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", > "day").join(spark.range(0, ids, 1, parts)) > ds.repartition($"day") > .sortWithinPartitions($"day", $"id") > .write > .partitionBy("day") > .mode(SaveMode.Overwrite) > .csv("interleaved.csv") > {code} > Check the written files are sorted (states OK when file is sorted): > {code:bash} > for file in interleaved.csv/day\=*/part-* > do > echo "$(sort -n "$file" | md5sum | cut -d " " -f 1) $file" > done | md5sum -c > {code} > Files should look like this > {code} > 0 > 1 > 2 > ... > 1048576 > 1048577 > 1048578 > ... > {code} > But they look like > {code} > 0 > 1048576 > 1 > 1048577 > 2 > 1048578 > ... > {code} > The cause issue is the same as in SPARK-40588. A sort (for {{{}day{}}}) is > added on top of the existing sort (for {{day}} and {{{}id{}}}). Spilling > interleaves the sorted spill files. > {code} > Sort [input[0, bigint, false] ASC NULLS FIRST], false, 0 > +- AdaptiveSparkPlan isFinalPlan=false >+- Sort [day#2L ASC NULLS FIRST, id#4L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(day#2L, 200), REPARTITION_BY_COL, > [plan_id=30] > +- BroadcastNestedLoopJoin BuildLeft, Inner > :- BroadcastExchange IdentityBroadcastMode, [plan_id=28] > : +- Project [id#0L AS day#2L] > : +- Range (0, 2, step=1, splits=2) > +- Range (0, 200, step=1, splits=2) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41914) Sorting issue with partitioned-writing and planned write optimization disabled
[ https://issues.apache.org/jira/browse/SPARK-41914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-41914: --- Assignee: Enrico Minack > Sorting issue with partitioned-writing and planned write optimization disabled > -- > > Key: SPARK-41914 > URL: https://issues.apache.org/jira/browse/SPARK-41914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Assignee: Enrico Minack >Priority: Major > > Spark 3.4.0 introduced option > {{{}spark.sql.optimizer.plannedWrite.enabled{}}}, which is enabled by > default. When disabled, partitioned writing loses in-partition order when > spilling occurs. > This is related to SPARK-40885 where setting option > {{spark.sql.optimizer.plannedWrite.enabled}} to {{true}} will remove the > existing sort (for {{day}} and {{{}id{}}}) entirely. > Run this with 512m memory and one executor, e.g.: > {code} > spark-shell --driver-memory 512m --master "local[1]" > {code} > {code:scala} > import org.apache.spark.sql.SaveMode > spark.conf.set("spark.sql.optimizer.plannedWrite.enabled", false) > val ids = 200 > val days = 2 > val parts = 2 > val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", > "day").join(spark.range(0, ids, 1, parts)) > ds.repartition($"day") > .sortWithinPartitions($"day", $"id") > .write > .partitionBy("day") > .mode(SaveMode.Overwrite) > .csv("interleaved.csv") > {code} > Check the written files are sorted (states OK when file is sorted): > {code:bash} > for file in interleaved.csv/day\=*/part-* > do > echo "$(sort -n "$file" | md5sum | cut -d " " -f 1) $file" > done | md5sum -c > {code} > Files should look like this > {code} > 0 > 1 > 2 > ... > 1048576 > 1048577 > 1048578 > ... > {code} > But they look like > {code} > 0 > 1048576 > 1 > 1048577 > 2 > 1048578 > ... > {code} > The cause issue is the same as in SPARK-40588. A sort (for {{{}day{}}}) is > added on top of the existing sort (for {{day}} and {{{}id{}}}). Spilling > interleaves the sorted spill files. > {code} > Sort [input[0, bigint, false] ASC NULLS FIRST], false, 0 > +- AdaptiveSparkPlan isFinalPlan=false >+- Sort [day#2L ASC NULLS FIRST, id#4L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(day#2L, 200), REPARTITION_BY_COL, > [plan_id=30] > +- BroadcastNestedLoopJoin BuildLeft, Inner > :- BroadcastExchange IdentityBroadcastMode, [plan_id=28] > : +- Project [id#0L AS day#2L] > : +- Range (0, 2, step=1, splits=2) > +- Range (0, 200, step=1, splits=2) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41959) Improve v1 writes with empty2null
[ https://issues.apache.org/jira/browse/SPARK-41959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656416#comment-17656416 ] Apache Spark commented on SPARK-41959: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/39475 > Improve v1 writes with empty2null > - > > Key: SPARK-41959 > URL: https://issues.apache.org/jira/browse/SPARK-41959 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41959) Improve v1 writes with empty2null
[ https://issues.apache.org/jira/browse/SPARK-41959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41959: Assignee: (was: Apache Spark) > Improve v1 writes with empty2null > - > > Key: SPARK-41959 > URL: https://issues.apache.org/jira/browse/SPARK-41959 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41959) Improve v1 writes with empty2null
[ https://issues.apache.org/jira/browse/SPARK-41959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41959: Assignee: Apache Spark > Improve v1 writes with empty2null > - > > Key: SPARK-41959 > URL: https://issues.apache.org/jira/browse/SPARK-41959 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41959) Improve v1 writes with empty2null
[ https://issues.apache.org/jira/browse/SPARK-41959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656415#comment-17656415 ] Apache Spark commented on SPARK-41959: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/39475 > Improve v1 writes with empty2null > - > > Key: SPARK-41959 > URL: https://issues.apache.org/jira/browse/SPARK-41959 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41959) Improve v1 writes with empty2null
XiDuo You created SPARK-41959: - Summary: Improve v1 writes with empty2null Key: SPARK-41959 URL: https://issues.apache.org/jira/browse/SPARK-41959 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41887) Support DataFrame hint parameter to be list
[ https://issues.apache.org/jira/browse/SPARK-41887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656407#comment-17656407 ] Ruifeng Zheng commented on SPARK-41887: --- I am going to work on this one > Support DataFrame hint parameter to be list > --- > > Key: SPARK-41887 > URL: https://issues.apache.org/jira/browse/SPARK-41887 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.range(10e10).toDF("id") > such_a_nice_list = ["itworks1", "itworks2", "itworks3"] > hinted_df = df.hint("my awesome hint", 1.2345, "what", such_a_nice_list){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656397#comment-17656397 ] Apache Spark commented on SPARK-41958: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/39474 > Disallow arbitrary custom classpath with proxy user in cluster mode > --- > > Key: SPARK-41958 > URL: https://issues.apache.org/jira/browse/SPARK-41958 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3 >Reporter: wuyi >Priority: Major > > To avoid arbitrary classpath in spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656396#comment-17656396 ] Apache Spark commented on SPARK-41958: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/39474 > Disallow arbitrary custom classpath with proxy user in cluster mode > --- > > Key: SPARK-41958 > URL: https://issues.apache.org/jira/browse/SPARK-41958 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3 >Reporter: wuyi >Priority: Major > > To avoid arbitrary classpath in spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41958: Assignee: Apache Spark > Disallow arbitrary custom classpath with proxy user in cluster mode > --- > > Key: SPARK-41958 > URL: https://issues.apache.org/jira/browse/SPARK-41958 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > To avoid arbitrary classpath in spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-41958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41958: Assignee: (was: Apache Spark) > Disallow arbitrary custom classpath with proxy user in cluster mode > --- > > Key: SPARK-41958 > URL: https://issues.apache.org/jira/browse/SPARK-41958 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.3.1, 3.2.3 >Reporter: wuyi >Priority: Major > > To avoid arbitrary classpath in spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41958) Disallow arbitrary custom classpath with proxy user in cluster mode
wuyi created SPARK-41958: Summary: Disallow arbitrary custom classpath with proxy user in cluster mode Key: SPARK-41958 URL: https://issues.apache.org/jira/browse/SPARK-41958 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.3, 3.3.1, 3.1.3, 3.0.3, 2.4.8 Reporter: wuyi To avoid arbitrary classpath in spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41919) Unify the schema or datatype in protos
[ https://issues.apache.org/jira/browse/SPARK-41919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656387#comment-17656387 ] Ruifeng Zheng commented on SPARK-41919: --- actually, JSON is the standard format of datatype. Both client and server already support the conversion between JSON <-> DataType > Unify the schema or datatype in protos > -- > > Key: SPARK-41919 > URL: https://issues.apache.org/jira/browse/SPARK-41919 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > > this ticket only focus on the protos sent from client to server. > we normally use > {code:java} > oneof schema { > DataType datatype = 2; > // Server will use Catalyst parser to parse this string to DataType. > string datatype_str = 3; > } > {code} > to represent a schema or datatype. > actually, we can simplify it with just a string. In the server, we can easily > parse a DDL-formatted schema or a JSON formatted one. > {code:java} > // (Optional) The schema of local data. > // It should be either a DDL-formatted type string or a JSON string. > // > // The server side will update the column names and data types according to > this schema. > // If the 'data' is not provided, then this schema will be required. > optional string schema = 2; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38173) Quoted column cannot be recognized correctly when quotedRegexColumnNames is true
[ https://issues.apache.org/jira/browse/SPARK-38173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656380#comment-17656380 ] Apache Spark commented on SPARK-38173: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/39473 > Quoted column cannot be recognized correctly when quotedRegexColumnNames is > true > > > Key: SPARK-38173 > URL: https://issues.apache.org/jira/browse/SPARK-38173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Tongwei >Assignee: Tongwei >Priority: Major > Fix For: 3.3.0 > > > When spark.sql.parser.quotedRegexColumnNames=true > {code:java} > SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;{code} > The above query will throw an exception > {code:java} > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression > 'multiply' > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:370) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:266) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:44) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:266) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:261) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:275) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in > expression 'multiply' > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:155) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1700) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1671) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:339) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:339) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1671) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1656) >
[jira] [Resolved] (SPARK-41943) Use java api to create files and grant permissions is DiskBlockManager
[ https://issues.apache.org/jira/browse/SPARK-41943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41943. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39448 [https://github.com/apache/spark/pull/39448] > Use java api to create files and grant permissions is DiskBlockManager > -- > > Key: SPARK-41943 > URL: https://issues.apache.org/jira/browse/SPARK-41943 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: jingxiong zhong >Priority: Minor > Fix For: 3.4.0 > > > For method {{{}createDirWithPermission770{}}}, using java api to create files > and grant permissions instead of calling shell commands. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38173) Quoted column cannot be recognized correctly when quotedRegexColumnNames is true
[ https://issues.apache.org/jira/browse/SPARK-38173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656381#comment-17656381 ] Apache Spark commented on SPARK-38173: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/39473 > Quoted column cannot be recognized correctly when quotedRegexColumnNames is > true > > > Key: SPARK-38173 > URL: https://issues.apache.org/jira/browse/SPARK-38173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Tongwei >Assignee: Tongwei >Priority: Major > Fix For: 3.3.0 > > > When spark.sql.parser.quotedRegexColumnNames=true > {code:java} > SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;{code} > The above query will throw an exception > {code:java} > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression > 'multiply' > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:370) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:266) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:44) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:266) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:261) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:275) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in > expression 'multiply' > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:155) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1700) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1671) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:339) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:339) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1671) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1656) >
[jira] [Commented] (SPARK-41957) Enable the doctest for `DataFrame.hint`
[ https://issues.apache.org/jira/browse/SPARK-41957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656378#comment-17656378 ] Apache Spark commented on SPARK-41957: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/39472 > Enable the doctest for `DataFrame.hint` > --- > > Key: SPARK-41957 > URL: https://issues.apache.org/jira/browse/SPARK-41957 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41957) Enable the doctest for `DataFrame.hint`
[ https://issues.apache.org/jira/browse/SPARK-41957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41957: Assignee: (was: Apache Spark) > Enable the doctest for `DataFrame.hint` > --- > > Key: SPARK-41957 > URL: https://issues.apache.org/jira/browse/SPARK-41957 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41957) Enable the doctest for `DataFrame.hint`
[ https://issues.apache.org/jira/browse/SPARK-41957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41957: Assignee: Apache Spark > Enable the doctest for `DataFrame.hint` > --- > > Key: SPARK-41957 > URL: https://issues.apache.org/jira/browse/SPARK-41957 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41907) Function `sampleby` return parity
[ https://issues.apache.org/jira/browse/SPARK-41907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656376#comment-17656376 ] jiaan.geng commented on SPARK-41907: I want take a look. > Function `sampleby` return parity > - > > Key: SPARK-41907 > URL: https://issues.apache.org/jira/browse/SPARK-41907 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)]) > sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0) > self.assertTrue(sampled.count() == 35){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 202, in test_sampleby > self.assertTrue(sampled.count() == 35) > AssertionError: False is not true {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41957) Enable the doctest for `DataFrame.hint`
Ruifeng Zheng created SPARK-41957: - Summary: Enable the doctest for `DataFrame.hint` Key: SPARK-41957 URL: https://issues.apache.org/jira/browse/SPARK-41957 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41904) Fix Function `nth_value` functions output
[ https://issues.apache.org/jira/browse/SPARK-41904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng resolved SPARK-41904. Resolution: Won't Fix > Fix Function `nth_value` functions output > - > > Key: SPARK-41904 > URL: https://issues.apache.org/jira/browse/SPARK-41904 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > from pyspark.sql import Window > from pyspark.sql.functions import nth_value > df = self.spark.createDataFrame( > [ > ("a", 0, None), > ("a", 1, "x"), > ("a", 2, "y"), > ("a", 3, "z"), > ("a", 4, None), > ("b", 1, None), > ("b", 2, None), > ], > schema=("key", "order", "value"), > ) > w = Window.partitionBy("key").orderBy("order") > rs = df.select( > df.key, > df.order, > nth_value("value", 2).over(w), > nth_value("value", 2, False).over(w), > nth_value("value", 2, True).over(w), > ).collect() > expected = [ > ("a", 0, None, None, None), > ("a", 1, "x", "x", None), > ("a", 2, "x", "x", "y"), > ("a", 3, "x", "x", "y"), > ("a", 4, "x", "x", "y"), > ("b", 1, None, None, None), > ("b", 2, None, None, None), > ] > for r, ex in zip(sorted(rs), sorted(expected)): > self.assertEqual(tuple(r), ex[: len(r)]){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 755, in test_nth_value > self.assertEqual(tuple(r), ex[: len(r)]) > AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x') > First differing element 3: > None > 'x' > - ('a', 1, 'x', None) > ? > + ('a', 1, 'x', 'x') > ? ^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41904) Fix Function `nth_value` functions output
[ https://issues.apache.org/jira/browse/SPARK-41904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656374#comment-17656374 ] jiaan.geng commented on SPARK-41904: The root reason caused by https://issues.apache.org/jira/browse/SPARK-41945 and it has been fixed. The issue fixed too. > Fix Function `nth_value` functions output > - > > Key: SPARK-41904 > URL: https://issues.apache.org/jira/browse/SPARK-41904 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Priority: Major > > {code:java} > from pyspark.sql import Window > from pyspark.sql.functions import nth_value > df = self.spark.createDataFrame( > [ > ("a", 0, None), > ("a", 1, "x"), > ("a", 2, "y"), > ("a", 3, "z"), > ("a", 4, None), > ("b", 1, None), > ("b", 2, None), > ], > schema=("key", "order", "value"), > ) > w = Window.partitionBy("key").orderBy("order") > rs = df.select( > df.key, > df.order, > nth_value("value", 2).over(w), > nth_value("value", 2, False).over(w), > nth_value("value", 2, True).over(w), > ).collect() > expected = [ > ("a", 0, None, None, None), > ("a", 1, "x", "x", None), > ("a", 2, "x", "x", "y"), > ("a", 3, "x", "x", "y"), > ("a", 4, "x", "x", "y"), > ("b", 1, None, None, None), > ("b", 2, None, None, None), > ] > for r, ex in zip(sorted(rs), sorted(expected)): > self.assertEqual(tuple(r), ex[: len(r)]){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py", > line 755, in test_nth_value > self.assertEqual(tuple(r), ex[: len(r)]) > AssertionError: Tuples differ: ('a', 1, 'x', None) != ('a', 1, 'x', 'x') > First differing element 3: > None > 'x' > - ('a', 1, 'x', None) > ? > + ('a', 1, 'x', 'x') > ? ^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41956) Shuffle output location refetch in ShuffleBlockFetcherIterator
Zhongwei Zhu created SPARK-41956: Summary: Shuffle output location refetch in ShuffleBlockFetcherIterator Key: SPARK-41956 URL: https://issues.apache.org/jira/browse/SPARK-41956 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41955) Support fetch latest map output from worker
Zhongwei Zhu created SPARK-41955: Summary: Support fetch latest map output from worker Key: SPARK-41955 URL: https://issues.apache.org/jira/browse/SPARK-41955 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41954) Add isDecommissioned in ExecutorDeadException
Zhongwei Zhu created SPARK-41954: Summary: Add isDecommissioned in ExecutorDeadException Key: SPARK-41954 URL: https://issues.apache.org/jira/browse/SPARK-41954 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission
[ https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656365#comment-17656365 ] Zhongwei Zhu commented on SPARK-41953: -- [~dongjoon] [~mridulm80] [~Ngone51] Any comments for this? > Shuffle output location refetch during shuffle migration in decommission > > > Key: SPARK-41953 > URL: https://issues.apache.org/jira/browse/SPARK-41953 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Zhongwei Zhu >Priority: Major > > When shuffle migration enabled during spark decommissionm, shuffle data will > be migrated into live executors, then update latest location to > MapOutputTracker. It has some issues: > # Executors only do map output location fetch in the beginning of the reduce > stage, so any shuffle output location change in the middle of reduce will > cause FetchFailed as reducer fetch from old location. Even stage retries > could solve this, this still cause lots of resource waste as all shuffle read > and compute happened before FetchFailed partition will be wasted. > # During stage retries, less running tasks cause more executors to be > decommissioned and shuffle data location keep changing. In the worst case, > stage could need lots of retries, further breaking SLA. > So I propose to support refetch map output location during reduce phase if > shuffle migration is enabled and FetchFailed is caused by a decommissioned > dead executor. The detailed steps as below: > # When `BlockTransferService` fetch blocks failed from a decommissioned dead > executor, ExecutorDeadException(isDecommission as true) will be thrown. > # Make MapOutputTracker support fetch latest output without epoch provided. > # `ShuffleBlockFetcherIterator` will refetch latest output from > MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, > there should be a new location on another executor. If not, throw exception > as current. If yes, create new local and remote requests to fetch these > migrated shuffle blocks. The flow will be similar as failback fetch when push > merged fetch failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission
[ https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-41953: - Description: When shuffle migration enabled during spark decommissionm, shuffle data will be migrated into live executors, then update latest location to MapOutputTracker. It has some issues: # Executors only do map output location fetch in the beginning of the reduce stage, so any shuffle output location change in the middle of reduce will cause FetchFailed as reducer fetch from old location. Even stage retries could solve this, this still cause lots of resource waste as all shuffle read and compute happened before FetchFailed partition will be wasted. # During stage retries, less running tasks cause more executors to be decommissioned and shuffle data location keep changing. In the worst case, stage could need lots of retries, further breaking SLA. So I propose to support refetch map output location during reduce phase if shuffle migration is enabled and FetchFailed is caused by a decommissioned dead executor. The detailed steps as below: # When `BlockTransferService` fetch blocks failed from a decommissioned dead executor, ExecutorDeadException(isDecommission as true) will be thrown. # Make MapOutputTracker support fetch latest output without epoch provided. # `ShuffleBlockFetcherIterator` will refetch latest output from MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, there should be a new location on another executor. If not, throw exception as current. If yes, create new local and remote requests to fetch these migrated shuffle blocks. The flow will be similar as failback fetch when push merged fetch failed. was: When shuffle migration enabled during spark decommissionm, shuffle data will be migrated into live executors, then update latest location to MapOutputTracker. It has some issues: # Executors only do map output location fetch in the beginning of the reduce stage, so any shuffle output location change in the middle of reduce will cause FetchFailed as reducer fetch from old location. Even stage retries could solve this, this still cause lots of resource waste as all shuffle read and compute happened before FetchFailed partition will be wasted. # During stage retries, less running tasks cause more executors to be decommissioned and shuffle data location keep changing. In the worst case, stage could need lots of retries, further breaking SLA. So I propose to support refetch map output location during reduce phase if shuffle migration is enabled and FetchFailed is caused by a decommissioned dead executor. The detailed steps as > Shuffle output location refetch during shuffle migration in decommission > > > Key: SPARK-41953 > URL: https://issues.apache.org/jira/browse/SPARK-41953 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Zhongwei Zhu >Priority: Major > > When shuffle migration enabled during spark decommissionm, shuffle data will > be migrated into live executors, then update latest location to > MapOutputTracker. It has some issues: > # Executors only do map output location fetch in the beginning of the reduce > stage, so any shuffle output location change in the middle of reduce will > cause FetchFailed as reducer fetch from old location. Even stage retries > could solve this, this still cause lots of resource waste as all shuffle read > and compute happened before FetchFailed partition will be wasted. > # During stage retries, less running tasks cause more executors to be > decommissioned and shuffle data location keep changing. In the worst case, > stage could need lots of retries, further breaking SLA. > So I propose to support refetch map output location during reduce phase if > shuffle migration is enabled and FetchFailed is caused by a decommissioned > dead executor. The detailed steps as below: > # When `BlockTransferService` fetch blocks failed from a decommissioned dead > executor, ExecutorDeadException(isDecommission as true) will be thrown. > # Make MapOutputTracker support fetch latest output without epoch provided. > # `ShuffleBlockFetcherIterator` will refetch latest output from > MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, > there should be a new location on another executor. If not, throw exception > as current. If yes, create new local and remote requests to fetch these > migrated shuffle blocks. The flow will be similar as failback fetch when push > merged fetch failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Assigned] (SPARK-41232) High-order function: array_append
[ https://issues.apache.org/jira/browse/SPARK-41232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41232: Assignee: Ankit Prakash Gupta (was: Senthil Kumar) > High-order function: array_append > - > > Key: SPARK-41232 > URL: https://issues.apache.org/jira/browse/SPARK-41232 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ankit Prakash Gupta >Priority: Major > Fix For: 3.4.0 > > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_append in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission
Zhongwei Zhu created SPARK-41953: Summary: Shuffle output location refetch during shuffle migration in decommission Key: SPARK-41953 URL: https://issues.apache.org/jira/browse/SPARK-41953 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu When shuffle migration enabled during spark decommissionm, shuffle data will be migrated into live executors, then update latest location to MapOutputTracker. It has some issues: # Executors only do map output location fetch in the beginning of the reduce stage, so any shuffle output location change in the middle of reduce will cause FetchFailed as reducer fetch from old location. Even stage retries could solve this, this still cause lots of resource waste as all shuffle read and compute happened before FetchFailed partition will be wasted. # During stage retries, less running tasks cause more executors to be decommissioned and shuffle data location keep changing. In the worst case, stage could need lots of retries, further breaking SLA. So I propose to support refetch map output location during reduce phase if shuffle migration is enabled and FetchFailed is caused by a decommissioned dead executor. The detailed steps as -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41232) High-order function: array_append
[ https://issues.apache.org/jira/browse/SPARK-41232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41232. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38865 [https://github.com/apache/spark/pull/38865] > High-order function: array_append > - > > Key: SPARK-41232 > URL: https://issues.apache.org/jira/browse/SPARK-41232 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Senthil Kumar >Priority: Major > Fix For: 3.4.0 > > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_append in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41232) High-order function: array_append
[ https://issues.apache.org/jira/browse/SPARK-41232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41232: Assignee: Senthil Kumar > High-order function: array_append > - > > Key: SPARK-41232 > URL: https://issues.apache.org/jira/browse/SPARK-41232 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Senthil Kumar >Priority: Major > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_append in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41432) Protobuf serializer for SparkPlanGraphWrapper
[ https://issues.apache.org/jira/browse/SPARK-41432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656340#comment-17656340 ] Apache Spark commented on SPARK-41432: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/39471 > Protobuf serializer for SparkPlanGraphWrapper > - > > Key: SPARK-41432 > URL: https://issues.apache.org/jira/browse/SPARK-41432 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41432) Protobuf serializer for SparkPlanGraphWrapper
[ https://issues.apache.org/jira/browse/SPARK-41432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656339#comment-17656339 ] Apache Spark commented on SPARK-41432: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/39471 > Protobuf serializer for SparkPlanGraphWrapper > - > > Key: SPARK-41432 > URL: https://issues.apache.org/jira/browse/SPARK-41432 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec
[ https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated SPARK-41952: Shepherd: Dongjoon Hyun > Upgrade Parquet to fix off-heap memory leaks in Zstd codec > -- > > Key: SPARK-41952 > URL: https://issues.apache.org/jira/browse/SPARK-41952 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.3, 3.3.1, 3.2.3 >Reporter: Alexey Kudinkin >Priority: Critical > > Recently, native memory leak have been discovered in Parquet in conjunction > of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160). > This is very problematic to a point where we can't use Parquet w/ Zstd due to > pervasive OOMs taking down our executors and disrupting our jobs. > Luckily fix addressing this had already landed in Parquet: > [https://github.com/apache/parquet-mr/pull/982] > > Now, we just need to > # Updated version of Parquet is released in a timely manner > # Spark is upgraded onto this new version in the upcoming release > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec
Alexey Kudinkin created SPARK-41952: --- Summary: Upgrade Parquet to fix off-heap memory leaks in Zstd codec Key: SPARK-41952 URL: https://issues.apache.org/jira/browse/SPARK-41952 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.2.3, 3.3.1, 3.1.3 Reporter: Alexey Kudinkin Recently, native memory leak have been discovered in Parquet in conjunction of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160). This is very problematic to a point where we can't use Parquet w/ Zstd due to pervasive OOMs taking down our executors and disrupting our jobs. Luckily fix addressing this had already landed in Parquet: [https://github.com/apache/parquet-mr/pull/982] Now, we just need to # Updated version of Parquet is released in a timely manner # Spark is upgraded onto this new version in the upcoming release -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41951) Update SQL migration guide and documentations
[ https://issues.apache.org/jira/browse/SPARK-41951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41951. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39470 [https://github.com/apache/spark/pull/39470] > Update SQL migration guide and documentations > - > > Key: SPARK-41951 > URL: https://issues.apache.org/jira/browse/SPARK-41951 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41951) Update SQL migration guide and documentations
[ https://issues.apache.org/jira/browse/SPARK-41951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41951: - Assignee: Dongjoon Hyun > Update SQL migration guide and documentations > - > > Key: SPARK-41951 > URL: https://issues.apache.org/jira/browse/SPARK-41951 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41951) Update SQL migration guide and documentations
[ https://issues.apache.org/jira/browse/SPARK-41951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41951: Assignee: Apache Spark > Update SQL migration guide and documentations > - > > Key: SPARK-41951 > URL: https://issues.apache.org/jira/browse/SPARK-41951 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41951) Update SQL migration guide and documentations
[ https://issues.apache.org/jira/browse/SPARK-41951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656321#comment-17656321 ] Apache Spark commented on SPARK-41951: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39470 > Update SQL migration guide and documentations > - > > Key: SPARK-41951 > URL: https://issues.apache.org/jira/browse/SPARK-41951 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41951) Update SQL migration guide and documentations
[ https://issues.apache.org/jira/browse/SPARK-41951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41951: Assignee: (was: Apache Spark) > Update SQL migration guide and documentations > - > > Key: SPARK-41951 > URL: https://issues.apache.org/jira/browse/SPARK-41951 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41951) Update SQL migration guide and documentations
Dongjoon Hyun created SPARK-41951: - Summary: Update SQL migration guide and documentations Key: SPARK-41951 URL: https://issues.apache.org/jira/browse/SPARK-41951 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41947) Update the contents of error class guidelines
[ https://issues.apache.org/jira/browse/SPARK-41947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41947: Assignee: Haejoon Lee > Update the contents of error class guidelines > - > > Key: SPARK-41947 > URL: https://issues.apache.org/jira/browse/SPARK-41947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > The error class guidelines for `core/src/main/resources/error/README.md` is > out of date, we should update the guidelines to match the current behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41947) Update the contents of error class guidelines
[ https://issues.apache.org/jira/browse/SPARK-41947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41947. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39464 [https://github.com/apache/spark/pull/39464] > Update the contents of error class guidelines > - > > Key: SPARK-41947 > URL: https://issues.apache.org/jira/browse/SPARK-41947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > The error class guidelines for `core/src/main/resources/error/README.md` is > out of date, we should update the guidelines to match the current behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile
[ https://issues.apache.org/jira/browse/SPARK-41848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41848. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39410 [https://github.com/apache/spark/pull/39410] > Tasks are over-scheduled with TaskResourceProfile > - > > Key: SPARK-41848 > URL: https://issues.apache.org/jira/browse/SPARK-41848 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: wuyi >Assignee: huangtengfei >Priority: Blocker > Fix For: 3.4.0 > > > {code:java} > test("SPARK-XXX") { > val conf = new > SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]") > sc = new SparkContext(conf) > val req = new TaskResourceRequests().cpus(3) > val rp = new ResourceProfileBuilder().require(req).build() > val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x => > Thread.sleep(5000) > x * 2 > }.collect() > assert(res === Array(0, 2)) > } {code} > In this test, tasks are supposed to be scheduled in order since each task > requires 3 cores but the executor only has 4 cores. However, we noticed 2 > tasks are launched concurrently from the logs. > It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset > for task scheduling: > {code:java} > val rpId = taskSet.taskSet.resourceProfileId > val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, > conf) {code} > but the ResourceProfile (taskCpus=1) of the executor for updating the free > cores in ExecutorData: > {code:java} > val rpId = executorData.resourceProfileId > val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf) > executorData.freeCores -= taskCpus {code} > which results in the inconsistency of the available cores. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41848) Tasks are over-scheduled with TaskResourceProfile
[ https://issues.apache.org/jira/browse/SPARK-41848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41848: - Assignee: huangtengfei > Tasks are over-scheduled with TaskResourceProfile > - > > Key: SPARK-41848 > URL: https://issues.apache.org/jira/browse/SPARK-41848 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: wuyi >Assignee: huangtengfei >Priority: Blocker > > {code:java} > test("SPARK-XXX") { > val conf = new > SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]") > sc = new SparkContext(conf) > val req = new TaskResourceRequests().cpus(3) > val rp = new ResourceProfileBuilder().require(req).build() > val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x => > Thread.sleep(5000) > x * 2 > }.collect() > assert(res === Array(0, 2)) > } {code} > In this test, tasks are supposed to be scheduled in order since each task > requires 3 cores but the executor only has 4 cores. However, we noticed 2 > tasks are launched concurrently from the logs. > It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset > for task scheduling: > {code:java} > val rpId = taskSet.taskSet.resourceProfileId > val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, > conf) {code} > but the ResourceProfile (taskCpus=1) of the executor for updating the free > cores in ExecutorData: > {code:java} > val rpId = executorData.resourceProfileId > val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId) > val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf) > executorData.freeCores -= taskCpus {code} > which results in the inconsistency of the available cores. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41860) Make AvroScanBuilder and JsonScanBuilder case classes
[ https://issues.apache.org/jira/browse/SPARK-41860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41860. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39366 [https://github.com/apache/spark/pull/39366] > Make AvroScanBuilder and JsonScanBuilder case classes > - > > Key: SPARK-41860 > URL: https://issues.apache.org/jira/browse/SPARK-41860 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Lorenzo Martini >Assignee: Lorenzo Martini >Priority: Trivial > Fix For: 3.4.0 > > > Most of the existing `ScanBuilder`s inheriting from `FileScanBuilder` are > case classes which is very nice and useful. However the Json and the Avro > ones aren't, which makes it quite inconvenient when working with those. I > don't see any particular reason why these two would not be case classes as > well given they are very similar to the other ones and very simple and > stateless -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41860) Make AvroScanBuilder and JsonScanBuilder case classes
[ https://issues.apache.org/jira/browse/SPARK-41860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41860: - Assignee: Lorenzo Martini > Make AvroScanBuilder and JsonScanBuilder case classes > - > > Key: SPARK-41860 > URL: https://issues.apache.org/jira/browse/SPARK-41860 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Lorenzo Martini >Assignee: Lorenzo Martini >Priority: Trivial > > Most of the existing `ScanBuilder`s inheriting from `FileScanBuilder` are > case classes which is very nice and useful. However the Json and the Avro > ones aren't, which makes it quite inconvenient when working with those. I > don't see any particular reason why these two would not be case classes as > well given they are very similar to the other ones and very simple and > stateless -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41147) Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1042
[ https://issues.apache.org/jira/browse/SPARK-41147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee resolved SPARK-41147. - Resolution: Duplicate > Assign a name to the legacy error class _LEGACY_ERROR_TEMP_1042 > --- > > Key: SPARK-41147 > URL: https://issues.apache.org/jira/browse/SPARK-41147 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should make all LEGACY_ERROR_TEMP to the proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41708) Pull v1write information to WriteFiles
[ https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656105#comment-17656105 ] Apache Spark commented on SPARK-41708: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/39468 > Pull v1write information to WriteFiles > -- > > Key: SPARK-41708 > URL: https://issues.apache.org/jira/browse/SPARK-41708 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > Make WriteFiles hold v1 write information -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41708) Pull v1write information to WriteFiles
[ https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656104#comment-17656104 ] Apache Spark commented on SPARK-41708: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/39468 > Pull v1write information to WriteFiles > -- > > Key: SPARK-41708 > URL: https://issues.apache.org/jira/browse/SPARK-41708 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > Make WriteFiles hold v1 write information -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41950) mlflow doctest fails for pandas API on SPark
[ https://issues.apache.org/jira/browse/SPARK-41950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41950: Assignee: Apache Spark > mlflow doctest fails for pandas API on SPark > > > Key: SPARK-41950 > URL: https://issues.apache.org/jira/browse/SPARK-41950 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > {code} > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > prediction_df > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, in > __repr__ > pdf = cast("DataFrame", > self._get_or_create_repr_pandas_cache(max_display_count)) > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + > 1)._to_internal_pandas()} > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, > in to_pandas_frame > pdf = sdf.toPandas() > File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line > 208, in toPandas > pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", > line 1322, in __call__ > return_value = get_return_value( > File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco > raise converted from None > pyspark.sql.utils.PythonException: > An exception was thrown from the Python worker. Please see the stack > trace below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 829, in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 821, in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 345, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 86, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 338, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 519, in func > for result_batch, result_type in result_iter: > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1253, in udf > yield _predict_row_batch(batch_predict_fn, row_batch_args) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1057, in _predict_row_batch > result = predict_fn(pdf) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1237, in batch_predict_fn > return loaded_model.predict(pdf) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 413, > in predict > return self._predict_fn(data) > File > "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line > 355, in predict > return self._decision_function(X) > File > "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line > 338, in _decision_function > X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], > reset=False) > File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line > 518, in _validate_data > self._check_feature_names(X, reset=reset) > File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line > 451, in _check_feature_names > raise ValueError(message) > ValueError: The feature names should match those that were passed during > fit. >
[jira] [Commented] (SPARK-41950) mlflow doctest fails for pandas API on SPark
[ https://issues.apache.org/jira/browse/SPARK-41950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656097#comment-17656097 ] Apache Spark commented on SPARK-41950: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39467 > mlflow doctest fails for pandas API on SPark > > > Key: SPARK-41950 > URL: https://issues.apache.org/jira/browse/SPARK-41950 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > prediction_df > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, in > __repr__ > pdf = cast("DataFrame", > self._get_or_create_repr_pandas_cache(max_display_count)) > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + > 1)._to_internal_pandas()} > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, > in to_pandas_frame > pdf = sdf.toPandas() > File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line > 208, in toPandas > pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", > line 1322, in __call__ > return_value = get_return_value( > File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco > raise converted from None > pyspark.sql.utils.PythonException: > An exception was thrown from the Python worker. Please see the stack > trace below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 829, in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 821, in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 345, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 86, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 338, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 519, in func > for result_batch, result_type in result_iter: > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1253, in udf > yield _predict_row_batch(batch_predict_fn, row_batch_args) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1057, in _predict_row_batch > result = predict_fn(pdf) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1237, in batch_predict_fn > return loaded_model.predict(pdf) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 413, > in predict > return self._predict_fn(data) > File > "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line > 355, in predict > return self._decision_function(X) > File > "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line > 338, in _decision_function > X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], > reset=False) > File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line > 518, in _validate_data > self._check_feature_names(X, reset=reset) > File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line > 451, in _check_feature_names > raise ValueError(message) >
[jira] [Assigned] (SPARK-41950) mlflow doctest fails for pandas API on SPark
[ https://issues.apache.org/jira/browse/SPARK-41950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41950: Assignee: (was: Apache Spark) > mlflow doctest fails for pandas API on SPark > > > Key: SPARK-41950 > URL: https://issues.apache.org/jira/browse/SPARK-41950 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > prediction_df > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, in > __repr__ > pdf = cast("DataFrame", > self._get_or_create_repr_pandas_cache(max_display_count)) > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + > 1)._to_internal_pandas()} > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, > in to_pandas_frame > pdf = sdf.toPandas() > File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line > 208, in toPandas > pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", > line 1322, in __call__ > return_value = get_return_value( > File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco > raise converted from None > pyspark.sql.utils.PythonException: > An exception was thrown from the Python worker. Please see the stack > trace below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 829, in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 821, in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 345, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 86, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 338, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 519, in func > for result_batch, result_type in result_iter: > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1253, in udf > yield _predict_row_batch(batch_predict_fn, row_batch_args) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1057, in _predict_row_batch > result = predict_fn(pdf) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1237, in batch_predict_fn > return loaded_model.predict(pdf) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 413, > in predict > return self._predict_fn(data) > File > "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line > 355, in predict > return self._decision_function(X) > File > "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line > 338, in _decision_function > X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], > reset=False) > File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line > 518, in _validate_data > self._check_feature_names(X, reset=reset) > File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line > 451, in _check_feature_names > raise ValueError(message) > ValueError: The feature names should match those that were passed during > fit. > Feature names unseen
[jira] [Updated] (SPARK-41950) mlflow doctest fails for pandas API on SPark
[ https://issues.apache.org/jira/browse/SPARK-41950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41950: - Issue Type: Test (was: Bug) > mlflow doctest fails for pandas API on SPark > > > Key: SPARK-41950 > URL: https://issues.apache.org/jira/browse/SPARK-41950 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > prediction_df > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, in > __repr__ > pdf = cast("DataFrame", > self._get_or_create_repr_pandas_cache(max_display_count)) > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + > 1)._to_internal_pandas()} > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, > in to_pandas_frame > pdf = sdf.toPandas() > File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line > 208, in toPandas > pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", > line 1322, in __call__ > return_value = get_return_value( > File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco > raise converted from None > pyspark.sql.utils.PythonException: > An exception was thrown from the Python worker. Please see the stack > trace below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 829, in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 821, in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 345, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 86, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 338, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 519, in func > for result_batch, result_type in result_iter: > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1253, in udf > yield _predict_row_batch(batch_predict_fn, row_batch_args) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1057, in _predict_row_batch > result = predict_fn(pdf) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1237, in batch_predict_fn > return loaded_model.predict(pdf) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 413, > in predict > return self._predict_fn(data) > File > "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line > 355, in predict > return self._decision_function(X) > File > "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line > 338, in _decision_function > X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], > reset=False) > File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line > 518, in _validate_data > self._check_feature_names(X, reset=reset) > File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line > 451, in _check_feature_names > raise ValueError(message) > ValueError: The feature names should match those that were passed during > fit. > Feature names unseen at fit time:
[jira] [Created] (SPARK-41950) mlflow doctest fails for pandas API on SPark
Hyukjin Kwon created SPARK-41950: Summary: mlflow doctest fails for pandas API on SPark Key: SPARK-41950 URL: https://issues.apache.org/jira/browse/SPARK-41950 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Hyukjin Kwon {code} File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in pyspark.pandas.mlflow.load_model Failed example: prediction_df Exception raised: Traceback (most recent call last): File "/usr/lib/python3.9/doctest.py", line 1336, in __run exec(compile(example.source, filename, "single", File "", line 1, in prediction_df File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, in __repr__ pdf = cast("DataFrame", self._get_or_create_repr_pandas_cache(max_display_count)) File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, in _get_or_create_repr_pandas_cache self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()} File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, in _to_internal_pandas return self._internal.to_pandas_frame File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in wrapped_lazy_property setattr(self, attr_name, fn(self)) File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, in to_pandas_frame pdf = sdf.toPandas() File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 208, in toPandas pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, in collect sock_info = self._jdf.collectToPython() File "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco raise converted from None pyspark.sql.utils.PythonException: An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 829, in main process() File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 821, in process serializer.dump_stream(out_iter, outfile) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 345, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 86, in dump_stream for batch in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 338, in init_stream_yield_batches for series in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 519, in func for result_batch, result_type in result_iter: File "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 1253, in udf yield _predict_row_batch(batch_predict_fn, row_batch_args) File "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 1057, in _predict_row_batch result = predict_fn(pdf) File "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 1237, in batch_predict_fn return loaded_model.predict(pdf) File "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 413, in predict return self._predict_fn(data) File "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 355, in predict return self._decision_function(X) File "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line 338, in _decision_function X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], reset=False) File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 518, in _validate_data self._check_feature_names(X, reset=reset) File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line 451, in _check_feature_names raise ValueError(message) ValueError: The feature names should match those that were passed during fit. Feature names unseen at fit time: - 0 - 1 Feature names seen at fit time, yet now missing: - x1 - x2 {code} https://github.com/apache/spark/actions/runs/3871715040/jobs/6600578830 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41944) Pass configurations when local remote mode is on
[ https://issues.apache.org/jira/browse/SPARK-41944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41944: Assignee: Hyukjin Kwon > Pass configurations when local remote mode is on > > > Key: SPARK-41944 > URL: https://issues.apache.org/jira/browse/SPARK-41944 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > local remote mode does not currently pass the configurations properly to the > server is that's specified. We should pass them properly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41944) Pass configurations when local remote mode is on
[ https://issues.apache.org/jira/browse/SPARK-41944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41944. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39463 [https://github.com/apache/spark/pull/39463] > Pass configurations when local remote mode is on > > > Key: SPARK-41944 > URL: https://issues.apache.org/jira/browse/SPARK-41944 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > local remote mode does not currently pass the configurations properly to the > server is that's specified. We should pass them properly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41945) Python: connect client lost column data with pyarrow.Table.to_pylist
[ https://issues.apache.org/jira/browse/SPARK-41945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41945: - Assignee: jiaan.geng > Python: connect client lost column data with pyarrow.Table.to_pylist > > > Key: SPARK-41945 > URL: https://issues.apache.org/jira/browse/SPARK-41945 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > Python: connect client should not use pyarrow.Table.to_pylist to transform > fetched data. > For example: > the data in pyarrow.Table show below. > {code:java} > pyarrow.Table > key: string > order: int64 > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string > nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC > NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string > > key: [["a","a","a","a","a","b","b"]] > order: [[0,1,2,3,4,1,2]] > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): > [[null,"x","x","x","x",null,null]] > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): > [[null,"x","x","x","x",null,null]] > nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC > NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): > [[null,null,"y","y","y",null,null]] > {code} > The table have five columns show above. > But the data after call pyarrow.Table.to_pylist() show below. > {code:java} > [{ > 'key': 'a', > 'order': 0, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }, { > 'key': 'a', > 'order': 1, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }, { > 'key': 'a', > 'order': 2, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' > }, { > 'key': 'a', > 'order': 3, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' > }, { > 'key': 'a', > 'order': 4, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' > }, { > 'key': 'b', > 'order': 1, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }, { > 'key': 'b', > 'order': 2, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }] > {code} > There are only four columns left. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41945) Python: connect client lost column data with pyarrow.Table.to_pylist
[ https://issues.apache.org/jira/browse/SPARK-41945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41945. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39461 [https://github.com/apache/spark/pull/39461] > Python: connect client lost column data with pyarrow.Table.to_pylist > > > Key: SPARK-41945 > URL: https://issues.apache.org/jira/browse/SPARK-41945 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > > Python: connect client should not use pyarrow.Table.to_pylist to transform > fetched data. > For example: > the data in pyarrow.Table show below. > {code:java} > pyarrow.Table > key: string > order: int64 > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string > nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC > NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string > > key: [["a","a","a","a","a","b","b"]] > order: [[0,1,2,3,4,1,2]] > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): > [[null,"x","x","x","x",null,null]] > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): > [[null,"x","x","x","x",null,null]] > nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC > NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): > [[null,null,"y","y","y",null,null]] > {code} > The table have five columns show above. > But the data after call pyarrow.Table.to_pylist() show below. > {code:java} > [{ > 'key': 'a', > 'order': 0, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }, { > 'key': 'a', > 'order': 1, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }, { > 'key': 'a', > 'order': 2, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' > }, { > 'key': 'a', > 'order': 3, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' > }, { > 'key': 'a', > 'order': 4, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' > }, { > 'key': 'b', > 'order': 1, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }, { > 'key': 'b', > 'order': 2, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }] > {code} > There are only four columns left. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41842) Support data type Timestamp(NANOSECOND, null)
[ https://issues.apache.org/jira/browse/SPARK-41842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41842: - Fix Version/s: (was: 3.4.0) > Support data type Timestamp(NANOSECOND, null) > - > > Key: SPARK-41842 > URL: https://issues.apache.org/jira/browse/SPARK-41842 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1966, in pyspark.sql.connect.functions.hour > Failed example: > df.select(hour('ts').alias('hour')).collect() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.select(hour('ts').alias('hour')).collect() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1017, in collect > pdf = self.toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: > Timestamp(NANOSECOND, null){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-41842) Support data type Timestamp(NANOSECOND, null)
[ https://issues.apache.org/jira/browse/SPARK-41842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-41842: -- Let me reopen this since https://github.com/apache/spark/blob/56c7cf33929d7d42b7d299c0bb7e895963241214/python/pyspark/sql/tests/connect/test_parity_dataframe.py#L40-L48 doesn't pass yet. > Support data type Timestamp(NANOSECOND, null) > - > > Key: SPARK-41842 > URL: https://issues.apache.org/jira/browse/SPARK-41842 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", > line 1966, in pyspark.sql.connect.functions.hour > Failed example: > df.select(hour('ts').alias('hour')).collect() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line 1, in > > df.select(hour('ts').alias('hour')).collect() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1017, in collect > pdf = self.toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 623, in _handle_error > raise SparkConnectException(status.message, info.reason) from None > pyspark.sql.connect.client.SparkConnectException: > (org.apache.spark.SparkUnsupportedOperationException) Unsupported data type: > Timestamp(NANOSECOND, null){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
[ https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656023#comment-17656023 ] Jiale He commented on SPARK-41741: -- [~yumwang] Do you mean this? !image-2023-01-09-18-27-53-479.png|width=967,height=519! > [SQL] ParquetFilters StringStartsWith push down matching string do not use > UTF-8 > > > Key: SPARK-41741 > URL: https://issues.apache.org/jira/browse/SPARK-41741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jiale He >Priority: Major > Attachments: image-2022-12-28-18-00-00-861.png, > image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, > image-2023-01-09-18-27-53-479.png, > part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet > > > Hello ~ > > I found a problem, but there are two ways to solve it. > > The parquet filter is pushed down. When using the like '***%' statement to > query, if the system default encoding is not UTF-8, it may cause an error. > > There are two ways to bypass this problem as far as I know > 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8" > 2. spark.sql.parquet.filterPushdown.string.startsWith=false > > The following is the information to reproduce this problem > The parquet sample file is in the attachment > {code:java} > spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”) > spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code} > !image-2022-12-28-18-00-00-861.png|width=879,height=430! > > !image-2022-12-28-18-00-21-586.png|width=799,height=731! > > I think the correct code should be: > {code:java} > private val strToBinary = > Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
[ https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiale He updated SPARK-41741: - Attachment: image-2023-01-09-18-27-53-479.png > [SQL] ParquetFilters StringStartsWith push down matching string do not use > UTF-8 > > > Key: SPARK-41741 > URL: https://issues.apache.org/jira/browse/SPARK-41741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jiale He >Priority: Major > Attachments: image-2022-12-28-18-00-00-861.png, > image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, > image-2023-01-09-18-27-53-479.png, > part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet > > > Hello ~ > > I found a problem, but there are two ways to solve it. > > The parquet filter is pushed down. When using the like '***%' statement to > query, if the system default encoding is not UTF-8, it may cause an error. > > There are two ways to bypass this problem as far as I know > 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8" > 2. spark.sql.parquet.filterPushdown.string.startsWith=false > > The following is the information to reproduce this problem > The parquet sample file is in the attachment > {code:java} > spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”) > spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code} > !image-2022-12-28-18-00-00-861.png|width=879,height=430! > > !image-2022-12-28-18-00-21-586.png|width=799,height=731! > > I think the correct code should be: > {code:java} > private val strToBinary = > Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41949) Make stage scheduling support local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-41949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41949: Assignee: (was: Apache Spark) > Make stage scheduling support local-cluster mode > > > Key: SPARK-41949 > URL: https://issues.apache.org/jira/browse/SPARK-41949 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Priority: Major > > Make stage scheduling support local-cluster mode. > This is useful in testing, especially for test code of third-party python > libraries that depends on pyspark, many tests are written with pytest, but > pytest is hard to integrate with a standalone spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41949) Make stage scheduling support local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-41949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655989#comment-17655989 ] Apache Spark commented on SPARK-41949: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/39424 > Make stage scheduling support local-cluster mode > > > Key: SPARK-41949 > URL: https://issues.apache.org/jira/browse/SPARK-41949 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Priority: Major > > Make stage scheduling support local-cluster mode. > This is useful in testing, especially for test code of third-party python > libraries that depends on pyspark, many tests are written with pytest, but > pytest is hard to integrate with a standalone spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41949) Make stage scheduling support local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-41949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41949: Assignee: Apache Spark > Make stage scheduling support local-cluster mode > > > Key: SPARK-41949 > URL: https://issues.apache.org/jira/browse/SPARK-41949 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Major > > Make stage scheduling support local-cluster mode. > This is useful in testing, especially for test code of third-party python > libraries that depends on pyspark, many tests are written with pytest, but > pytest is hard to integrate with a standalone spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41949) Make stage scheduling support local-cluster mode
Weichen Xu created SPARK-41949: -- Summary: Make stage scheduling support local-cluster mode Key: SPARK-41949 URL: https://issues.apache.org/jira/browse/SPARK-41949 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 3.4.0 Reporter: Weichen Xu Make stage scheduling support local-cluster mode. This is useful in testing, especially for test code of third-party python libraries that depends on pyspark, many tests are written with pytest, but pytest is hard to integrate with a standalone spark cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41948) Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD
[ https://issues.apache.org/jira/browse/SPARK-41948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655982#comment-17655982 ] Apache Spark commented on SPARK-41948: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/39466 > Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD > -- > > Key: SPARK-41948 > URL: https://issues.apache.org/jira/browse/SPARK-41948 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41948) Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD
[ https://issues.apache.org/jira/browse/SPARK-41948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655981#comment-17655981 ] Apache Spark commented on SPARK-41948: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/39466 > Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD > -- > > Key: SPARK-41948 > URL: https://issues.apache.org/jira/browse/SPARK-41948 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41948) Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD
[ https://issues.apache.org/jira/browse/SPARK-41948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41948: Assignee: (was: Apache Spark) > Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD > -- > > Key: SPARK-41948 > URL: https://issues.apache.org/jira/browse/SPARK-41948 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41948) Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD
[ https://issues.apache.org/jira/browse/SPARK-41948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41948: Assignee: Apache Spark > Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD > -- > > Key: SPARK-41948 > URL: https://issues.apache.org/jira/browse/SPARK-41948 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41780) `regexp_replace('', '[a\\\\d]{0, 2}', 'x')` causes an internal error
[ https://issues.apache.org/jira/browse/SPARK-41780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41780: Assignee: BingKun Pan > `regexp_replace('', '[ad]{0, 2}', 'x')` causes an internal error > > > Key: SPARK-41780 > URL: https://issues.apache.org/jira/browse/SPARK-41780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 > Environment: Spark3.3.0 local mode >Reporter: Remzi Yang >Assignee: BingKun Pan >Priority: Major > Attachments: image-2023-01-04-14-12-26-126.png > > > {code:scala} > scala> spark.sql("select regexp_replace('', '[ad]{0,2}', 'x')").show > +--+ > |regexp_replace(, [a\d]\{0,2}, x, 1)| > +--+ > | x| > +--+ > > > scala> spark.sql("select regexp_replace('', '[ad]{0, 2}', 'x')").show > org.apache.spark.SparkException: The Spark SQL phase optimization failed with > an internal error. Please, fill a bug report in, and provide the full stack > trace. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41780) `regexp_replace('', '[a\\\\d]{0, 2}', 'x')` causes an internal error
[ https://issues.apache.org/jira/browse/SPARK-41780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41780. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39383 [https://github.com/apache/spark/pull/39383] > `regexp_replace('', '[ad]{0, 2}', 'x')` causes an internal error > > > Key: SPARK-41780 > URL: https://issues.apache.org/jira/browse/SPARK-41780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 > Environment: Spark3.3.0 local mode >Reporter: Remzi Yang >Assignee: BingKun Pan >Priority: Major > Fix For: 3.4.0 > > Attachments: image-2023-01-04-14-12-26-126.png > > > {code:scala} > scala> spark.sql("select regexp_replace('', '[ad]{0,2}', 'x')").show > +--+ > |regexp_replace(, [a\d]\{0,2}, x, 1)| > +--+ > | x| > +--+ > > > scala> spark.sql("select regexp_replace('', '[ad]{0, 2}', 'x')").show > org.apache.spark.SparkException: The Spark SQL phase optimization failed with > an internal error. Please, fill a bug report in, and provide the full stack > trace. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41948) Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD
BingKun Pan created SPARK-41948: --- Summary: Fix NPE for error classes: CANNOT_PARSE_JSON_FIELD Key: SPARK-41948 URL: https://issues.apache.org/jira/browse/SPARK-41948 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-41945) Python: connect client lost column data with pyarrow.Table.to_pylist
[ https://issues.apache.org/jira/browse/SPARK-41945 ] jiaan.geng deleted comment on SPARK-41945: was (Author: beliefer): I'm working on. > Python: connect client lost column data with pyarrow.Table.to_pylist > > > Key: SPARK-41945 > URL: https://issues.apache.org/jira/browse/SPARK-41945 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Python: connect client should not use pyarrow.Table.to_pylist to transform > fetched data. > For example: > the data in pyarrow.Table show below. > {code:java} > pyarrow.Table > key: string > order: int64 > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string > nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC > NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string > > key: [["a","a","a","a","a","b","b"]] > order: [[0,1,2,3,4,1,2]] > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): > [[null,"x","x","x","x",null,null]] > nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST > RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): > [[null,"x","x","x","x",null,null]] > nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC > NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): > [[null,null,"y","y","y",null,null]] > {code} > The table have five columns show above. > But the data after call pyarrow.Table.to_pylist() show below. > {code:java} > [{ > 'key': 'a', > 'order': 0, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }, { > 'key': 'a', > 'order': 1, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }, { > 'key': 'a', > 'order': 2, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' > }, { > 'key': 'a', > 'order': 3, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' > }, { > 'key': 'a', > 'order': 4, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' > }, { > 'key': 'b', > 'order': 1, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }, { > 'key': 'b', > 'order': 2, > 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS > FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, > 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order > ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None > }] > {code} > There are only four columns left. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41945) Python: connect client lost column data with pyarrow.Table.to_pylist
[ https://issues.apache.org/jira/browse/SPARK-41945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-41945: --- Description: Python: connect client should not use pyarrow.Table.to_pylist to transform fetched data. For example: the data in pyarrow.Table show below. {code:java} pyarrow.Table key: string order: int64 nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string key: [["a","a","a","a","a","b","b"]] order: [[0,1,2,3,4,1,2]] nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): [[null,"x","x","x","x",null,null]] nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): [[null,"x","x","x","x",null,null]] nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): [[null,null,"y","y","y",null,null]] {code} The table have five columns show above. But the data after call pyarrow.Table.to_pylist() show below. {code:java} [{ 'key': 'a', 'order': 0, 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None }, { 'key': 'a', 'order': 1, 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None }, { 'key': 'a', 'order': 2, 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' }, { 'key': 'a', 'order': 3, 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' }, { 'key': 'a', 'order': 4, 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'x', 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': 'y' }, { 'key': 'b', 'order': 1, 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None }, { 'key': 'b', 'order': 2, 'nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None, 'nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)': None }] {code} There are only four columns left. was: Python: connect client should not use pyarrow.Table.to_pylist to transform fetched data. For example: the data in pyarrow.Table show below. {code:java} pyarrow.Table key: string order: int64 nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string nth_value(value, 2) ignore nulls OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): string key: [["a","a","a","a","a","b","b"]] order: [[0,1,2,3,4,1,2]] nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): [[null,"x","x","x","x",null,null]] nth_value(value, 2) OVER (PARTITION BY key ORDER BY order ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): [[null,"x","x","x","x",null,null]] nth_value(value, 2) ignore nulls OVER