[jira] [Resolved] (SPARK-44920) Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()
[ https://issues.apache.org/jira/browse/SPARK-44920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-44920. -- Fix Version/s: 3.3.4 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42619 [https://github.com/apache/spark/pull/42619] > Use await() instead of awaitUninterruptibly() in > TransportClientFactory.createClient() > --- > > Key: SPARK-44920 > URL: https://issues.apache.org/jira/browse/SPARK-44920 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.3, 3.4.2, 3.5.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Fix For: 3.3.4, 3.5.0, 4.0.0, 3.4.2 > > > This is a follow up for SPARK-44241: > That call added an `awaitUninterruptibly()` call, which I think should be a > plain `await()` instead. This will prevent issues when cancelling tasks with > hanging network connections. > This issue is similar to SPARK-19529 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44925) K8s default service token file should not be materialized into token
[ https://issues.apache.org/jira/browse/SPARK-44925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44925: - Assignee: Dongjoon Hyun > K8s default service token file should not be materialized into token > > > Key: SPARK-44925 > URL: https://issues.apache.org/jira/browse/SPARK-44925 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44925) K8s default service token file should not be materialized into token
[ https://issues.apache.org/jira/browse/SPARK-44925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44925. --- Fix Version/s: 3.3.4 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42624 [https://github.com/apache/spark/pull/42624] > K8s default service token file should not be materialized into token > > > Key: SPARK-44925 > URL: https://issues.apache.org/jira/browse/SPARK-44925 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.4, 3.5.0, 4.0.0, 3.4.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44922) Disable o.a.p.h.InternalParquetRecordWriter logs for tests to reduce the log volume
[ https://issues.apache.org/jira/browse/SPARK-44922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-44922. -- Fix Version/s: 3.4.2 3.5.0 Assignee: Kent Yao Resolution: Fixed Issue resolved by https://github.com/apache/spark/pull/42614 > Disable o.a.p.h.InternalParquetRecordWriter logs for tests to reduce the log > volume > --- > > Key: SPARK-44922 > URL: https://issues.apache.org/jira/browse/SPARK-44922 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.4.2, 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44905) NullPointerException on stateful expression evaluation
[ https://issues.apache.org/jira/browse/SPARK-44905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-44905. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by https://github.com/apache/spark/pull/42601 > NullPointerException on stateful expression evaluation > -- > > Key: SPARK-44905 > URL: https://issues.apache.org/jira/browse/SPARK-44905 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44840) array_insert() give wrong results for ngative index
[ https://issues.apache.org/jira/browse/SPARK-44840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44840: -- Fix Version/s: 3.5.1 > array_insert() give wrong results for ngative index > --- > > Key: SPARK-44840 > URL: https://issues.apache.org/jira/browse/SPARK-44840 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: Max Gekk >Priority: Major > Fix For: 4.0.0, 3.5.1 > > > Unlike in Snowflake we decided that array_inert() is 1 based. > This means 1 is the first element in an array and -1 is the last. > This matches the behavior of functions such as substr() and element_at(). > > {code:java} > > SELECT array_insert(array('a', 'b', 'c'), 1, 'z'); > ["z","a","b","c"] > > SELECT array_insert(array('a', 'b', 'c'), 0, 'z'); > Error > > SELECT array_insert(array('a', 'b', 'c'), -1, 'z'); > ["a","b","c","z"] > > SELECT array_insert(array('a', 'b', 'c'), 5, 'z'); > ["a","b","c",NULL,"z"] > > SELECT array_insert(array('a', 'b', 'c'), -5, 'z'); > ["z",NULL,"a","b","c"] > > SELECT array_insert(array('a', 'b', 'c'), 2, cast(NULL AS STRING)); > ["a",NULL,"b","c"] > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44878) Address LRU cache insertion failure for RocksDB with strict limit
[ https://issues.apache.org/jira/browse/SPARK-44878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757801#comment-17757801 ] Snoot.io commented on SPARK-44878: -- User 'anishshri-db' has created a pull request for this issue: https://github.com/apache/spark/pull/42567 > Address LRU cache insertion failure for RocksDB with strict limit > - > > Key: SPARK-44878 > URL: https://issues.apache.org/jira/browse/SPARK-44878 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.1 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Fix For: 4.0.0 > > > Address LRU cache insertion failure for RocksDB with strict limit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44878) Address LRU cache insertion failure for RocksDB with strict limit
[ https://issues.apache.org/jira/browse/SPARK-44878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757800#comment-17757800 ] Snoot.io commented on SPARK-44878: -- User 'anishshri-db' has created a pull request for this issue: https://github.com/apache/spark/pull/42567 > Address LRU cache insertion failure for RocksDB with strict limit > - > > Key: SPARK-44878 > URL: https://issues.apache.org/jira/browse/SPARK-44878 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.1 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Fix For: 4.0.0 > > > Address LRU cache insertion failure for RocksDB with strict limit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44878) Address LRU cache insertion failure for RocksDB with strict limit
[ https://issues.apache.org/jira/browse/SPARK-44878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-44878. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42567 [https://github.com/apache/spark/pull/42567] > Address LRU cache insertion failure for RocksDB with strict limit > - > > Key: SPARK-44878 > URL: https://issues.apache.org/jira/browse/SPARK-44878 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.1 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Fix For: 4.0.0 > > > Address LRU cache insertion failure for RocksDB with strict limit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44878) Address LRU cache insertion failure for RocksDB with strict limit
[ https://issues.apache.org/jira/browse/SPARK-44878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-44878: Assignee: Anish Shrigondekar > Address LRU cache insertion failure for RocksDB with strict limit > - > > Key: SPARK-44878 > URL: https://issues.apache.org/jira/browse/SPARK-44878 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.1 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > > Address LRU cache insertion failure for RocksDB with strict limit -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44750) SparkSession.Builder should respect the options
[ https://issues.apache.org/jira/browse/SPARK-44750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757797#comment-17757797 ] Snoot.io commented on SPARK-44750: -- User 'michaelzhan-db' has created a pull request for this issue: https://github.com/apache/spark/pull/42548 > SparkSession.Builder should respect the options > --- > > Key: SPARK-44750 > URL: https://issues.apache.org/jira/browse/SPARK-44750 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Michael Zhang >Priority: Major > > In connect session builder, we use {{config}} method to set options. > However, the options are actually ignored when we create a new session. > {code} > def create(self) -> "SparkSession": > has_channel_builder = self._channel_builder is not None > has_spark_remote = "spark.remote" in self._options > if has_channel_builder and has_spark_remote: > raise ValueError( > "Only one of connection string or channelBuilder " > "can be used to create a new SparkSession." > ) > if not has_channel_builder and not has_spark_remote: > raise ValueError( > "Needs either connection string or channelBuilder to > create a new SparkSession." > ) > if has_channel_builder: > assert self._channel_builder is not None > session = SparkSession(connection=self._channel_builder) > else: > spark_remote = to_str(self._options.get("spark.remote")) > assert spark_remote is not None > session = SparkSession(connection=spark_remote) > SparkSession._set_default_and_active_session(session) > return session > {code} > we should respect the options by invoking {{session.conf.set}} after creation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42017) df["bad_key"] does not raise AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-42017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757794#comment-17757794 ] Snoot.io commented on SPARK-42017: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/42608 > df["bad_key"] does not raise AnalysisException > -- > > Key: SPARK-42017 > URL: https://issues.apache.org/jira/browse/SPARK-42017 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > e.g.) > {code} > 23/01/12 14:33:43 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > FAILED [ 8%] > pyspark/sql/tests/test_column.py:105 (ColumnParityTests.test_access_column) > self = testMethod=test_access_column> > def test_access_column(self): > df = self.df > self.assertTrue(isinstance(df.key, Column)) > self.assertTrue(isinstance(df["key"], Column)) > self.assertTrue(isinstance(df[0], Column)) > self.assertRaises(IndexError, lambda: df[2]) > > self.assertRaises(AnalysisException, lambda: df["bad_key"]) > E AssertionError: AnalysisException not raised by > ../test_column.py:112: AssertionError > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44903) Refine docstring of `approx_count_distinct`
[ https://issues.apache.org/jira/browse/SPARK-44903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757790#comment-17757790 ] Snoot.io commented on SPARK-44903: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/42596 > Refine docstring of `approx_count_distinct` > --- > > Key: SPARK-44903 > URL: https://issues.apache.org/jira/browse/SPARK-44903 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44860) Implement SESSION_USER function
[ https://issues.apache.org/jira/browse/SPARK-44860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757788#comment-17757788 ] Snoot.io commented on SPARK-44860: -- User 'vitaliili-db' has created a pull request for this issue: https://github.com/apache/spark/pull/42549 > Implement SESSION_USER function > --- > > Key: SPARK-44860 > URL: https://issues.apache.org/jira/browse/SPARK-44860 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Vitalii Li >Priority: Major > > According to SQL standard SESSION_USER and CURRENT_USER behavior differs for > routines: > - CURRENT_USER inside a routine should return security definer of a routine, > e.g. owner identity > - SESSION_USER inside a routine should return connected user. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44913) DS V2 supports push down V2 UDF that has magic method
[ https://issues.apache.org/jira/browse/SPARK-44913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757787#comment-17757787 ] Snoot.io commented on SPARK-44913: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/42612 > DS V2 supports push down V2 UDF that has magic method > - > > Key: SPARK-44913 > URL: https://issues.apache.org/jira/browse/SPARK-44913 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: Xianyang Liu >Priority: Major > > Right now we only support pushing down the V2 UDF that has not a magic > method. Because the V2 UDF will be analyzed into the > `ApplyFunctionExpression` which could be translated and pushed down. However, > a V2 UDF that has the magic method will be analyzed into `StaticInvoke` or > `Invoke` that can not be translated into V2 expression and then can not be > pushed down to the data source. The magic method is suggested. So this PR > adds the support of pushing down the V2 UDF that has a magic method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44925) K8s default service token file should not be materialized into token
[ https://issues.apache.org/jira/browse/SPARK-44925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757786#comment-17757786 ] Snoot.io commented on SPARK-44925: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/42624 > K8s default service token file should not be materialized into token > > > Key: SPARK-44925 > URL: https://issues.apache.org/jira/browse/SPARK-44925 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44925) K8s default service token file should not be materialized into token
[ https://issues.apache.org/jira/browse/SPARK-44925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757785#comment-17757785 ] Snoot.io commented on SPARK-44925: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/42624 > K8s default service token file should not be materialized into token > > > Key: SPARK-44925 > URL: https://issues.apache.org/jira/browse/SPARK-44925 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44923) Some directories should be cleared when regenerating files
[ https://issues.apache.org/jira/browse/SPARK-44923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-44923: Component/s: Build > Some directories should be cleared when regenerating files > -- > > Key: SPARK-44923 > URL: https://issues.apache.org/jira/browse/SPARK-44923 > Project: Spark > Issue Type: Sub-task > Components: Build, Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44925) K8s default service token file should not be materialized into token
Dongjoon Hyun created SPARK-44925: - Summary: K8s default service token file should not be materialized into token Key: SPARK-44925 URL: https://issues.apache.org/jira/browse/SPARK-44925 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.4.1, 3.3.2, 3.2.4, 3.1.3, 3.0.3, 2.4.8, 3.5.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44742) Add Spark version drop down to the PySpark doc site
[ https://issues.apache.org/jira/browse/SPARK-44742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44742. -- Fix Version/s: 4.0.0 3.5.0 Resolution: Fixed Issue resolved by pull request 42428 [https://github.com/apache/spark/pull/42428] > Add Spark version drop down to the PySpark doc site > --- > > Key: SPARK-44742 > URL: https://issues.apache.org/jira/browse/SPARK-44742 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: BingKun Pan >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Currently, PySpark documentation does not have a version dropdown. While by > default we want people to land on the latest version, it will be helpful and > easier for people to find docs if we have this version dropdown. > Other libraries such as numpy have such version dropdown. > !image-2023-08-09-09-38-00-805.png|width=214,height=189! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44742) Add Spark version drop down to the PySpark doc site
[ https://issues.apache.org/jira/browse/SPARK-44742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44742: Assignee: BingKun Pan > Add Spark version drop down to the PySpark doc site > --- > > Key: SPARK-44742 > URL: https://issues.apache.org/jira/browse/SPARK-44742 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: BingKun Pan >Priority: Major > > Currently, PySpark documentation does not have a version dropdown. While by > default we want people to land on the latest version, it will be helpful and > easier for people to find docs if we have this version dropdown. > Other libraries such as numpy have such version dropdown. > !image-2023-08-09-09-38-00-805.png|width=214,height=189! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44924) Add configurations for FileStreamSource cached files
kevin nacios created SPARK-44924: Summary: Add configurations for FileStreamSource cached files Key: SPARK-44924 URL: https://issues.apache.org/jira/browse/SPARK-44924 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.1.0 Reporter: kevin nacios With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed files was added for structured streaming to reduce cost of relisting from filesystem each batch. The settings that drive this are currently hardcoded and there is no way to change them. This impacts some of our workloads where we process large datasets where its unknown how "heavy" some files are, so a single batch can take a long period of time. When we set maxFilesPerTrigger to 100k files, a subsequent batch using the cached max of 10k files is causing the job to take longer since the cluster is capable of handling the 100k files but is stuck doing 10% of the workload. The benefit of the caching doesn't outweigh the cost of the performance on the rest of the job. With config settings available for this, we could either absorb some increased driver memory usage for caching the next 100k files, or opt to disable caching entirely and just relist files each batch by setting the cache amount to 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44923) Some directories should be cleared when regenerating files
BingKun Pan created SPARK-44923: --- Summary: Some directories should be cleared when regenerating files Key: SPARK-44923 URL: https://issues.apache.org/jira/browse/SPARK-44923 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44922) Disable o.a.p.h.InternalParquetRecordWriter logs for tests to reduce the log volume
Kent Yao created SPARK-44922: Summary: Disable o.a.p.h.InternalParquetRecordWriter logs for tests to reduce the log volume Key: SPARK-44922 URL: https://issues.apache.org/jira/browse/SPARK-44922 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44921) Remove SqlBaseLexer.tokens from codebase
Rui Wang created SPARK-44921: Summary: Remove SqlBaseLexer.tokens from codebase Key: SPARK-44921 URL: https://issues.apache.org/jira/browse/SPARK-44921 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.5.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44920) Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()
[ https://issues.apache.org/jira/browse/SPARK-44920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-44920: -- Assignee: Josh Rosen > Use await() instead of awaitUninterruptibly() in > TransportClientFactory.createClient() > --- > > Key: SPARK-44920 > URL: https://issues.apache.org/jira/browse/SPARK-44920 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.3, 3.4.2, 3.5.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > This is a follow up for SPARK-44241: > That call added an `awaitUninterruptibly()` call, which I think should be a > plain `await()` instead. This will prevent issues when cancelling tasks with > hanging network connections. > This issue is similar to SPARK-19529 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44920) Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()
Josh Rosen created SPARK-44920: -- Summary: Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient() Key: SPARK-44920 URL: https://issues.apache.org/jira/browse/SPARK-44920 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.3, 3.4.2, 3.5.0 Reporter: Josh Rosen This is a follow up for SPARK-44241: That call added an `awaitUninterruptibly()` call, which I think should be a plain `await()` instead. This will prevent issues when cancelling tasks with hanging network connections. This issue is similar to SPARK-19529 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44907) `DataFrame.join` should throw IllegalArgumentException for invalid join types
[ https://issues.apache.org/jira/browse/SPARK-44907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-44907: - Assignee: Ruifeng Zheng > `DataFrame.join` should throw IllegalArgumentException for invalid join types > - > > Key: SPARK-44907 > URL: https://issues.apache.org/jira/browse/SPARK-44907 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44907) `DataFrame.join` should throw IllegalArgumentException for invalid join types
[ https://issues.apache.org/jira/browse/SPARK-44907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-44907. --- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42603 [https://github.com/apache/spark/pull/42603] > `DataFrame.join` should throw IllegalArgumentException for invalid join types > - > > Key: SPARK-44907 > URL: https://issues.apache.org/jira/browse/SPARK-44907 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44919) Avro connector: convert a union of a single primitive type to a StructType
Tianhan Hu created SPARK-44919: -- Summary: Avro connector: convert a union of a single primitive type to a StructType Key: SPARK-44919 URL: https://issues.apache.org/jira/browse/SPARK-44919 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.1 Reporter: Tianhan Hu Spark Avro data source schema converter currently converts union with a single primitive type to a Spark primitive type instead of a StructType. While for more complex union types that consists of multiple primitive types, the schema converter translate them into StructTypes. For example, import scala.collection.JavaConverters._ import org.apache.avro._ import org.apache.spark.sql.avro._ // ["string", "null"] SchemaConverters.toSqlType( Schema.createUnion(Seq(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)).asJava) ).dataType // ["string", "int", "null"] SchemaConverters.toSqlType( Schema.createUnion(Seq(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.INT), Schema.create(Schema.Type.NULL)).asJava) ).dataType The first one would return StringType, the second would return StructType(StringType, IntegerType). We hope to add a new configuration to control the conversion behavior. The default behavior would still be the same. When the config is altered, a union with single primitive type would be translated into StructType. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44857) Fix getBaseURI error in Spark Worker LogPage UI buttons
[ https://issues.apache.org/jira/browse/SPARK-44857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757691#comment-17757691 ] Dongjoon Hyun commented on SPARK-44857: --- Thank you for fixing the `Fix Version`, [~yumwang]. > Fix getBaseURI error in Spark Worker LogPage UI buttons > --- > > Key: SPARK-44857 > URL: https://issues.apache.org/jira/browse/SPARK-44857 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.2.0, 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4 > > Attachments: Screenshot 2023-08-17 at 2.38.45 PM.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44918) Add named argument support for scalar Python/Pandas UDFs
Takuya Ueshin created SPARK-44918: - Summary: Add named argument support for scalar Python/Pandas UDFs Key: SPARK-44918 URL: https://issues.apache.org/jira/browse/SPARK-44918 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44916) Document Spark Driver Live Log UI
[ https://issues.apache.org/jira/browse/SPARK-44916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44916. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42615 [https://github.com/apache/spark/pull/42615] > Document Spark Driver Live Log UI > - > > Key: SPARK-44916 > URL: https://issues.apache.org/jira/browse/SPARK-44916 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44916) Document Spark Driver Live Log UI
[ https://issues.apache.org/jira/browse/SPARK-44916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44916: - Assignee: Dongjoon Hyun > Document Spark Driver Live Log UI > - > > Key: SPARK-44916 > URL: https://issues.apache.org/jira/browse/SPARK-44916 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44917) PySpark Streaming DataStreamWriter table API
[ https://issues.apache.org/jira/browse/SPARK-44917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu resolved SPARK-44917. - Resolution: Not A Problem > PySpark Streaming DataStreamWriter table API > > > Key: SPARK-44917 > URL: https://issues.apache.org/jira/browse/SPARK-44917 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44914) Upgrade Apache ivy to 2.5.2
[ https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-44914: Affects Version/s: 3.5.0 > Upgrade Apache ivy to 2.5.2 > > > Key: SPARK-44914 > URL: https://issues.apache.org/jira/browse/SPARK-44914 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0, 4.0.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44917) PySpark Streaming DataStreamWriter table API
Wei Liu created SPARK-44917: --- Summary: PySpark Streaming DataStreamWriter table API Key: SPARK-44917 URL: https://issues.apache.org/jira/browse/SPARK-44917 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44914) Upgrade Apache ivy to 2.5.2
[ https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757647#comment-17757647 ] Hudson commented on SPARK-44914: User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/42613 > Upgrade Apache ivy to 2.5.2 > > > Key: SPARK-44914 > URL: https://issues.apache.org/jira/browse/SPARK-44914 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44840) array_insert() give wrong results for ngative index
[ https://issues.apache.org/jira/browse/SPARK-44840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-44840. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42564 [https://github.com/apache/spark/pull/42564] > array_insert() give wrong results for ngative index > --- > > Key: SPARK-44840 > URL: https://issues.apache.org/jira/browse/SPARK-44840 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: Max Gekk >Priority: Major > Fix For: 4.0.0 > > > Unlike in Snowflake we decided that array_inert() is 1 based. > This means 1 is the first element in an array and -1 is the last. > This matches the behavior of functions such as substr() and element_at(). > > {code:java} > > SELECT array_insert(array('a', 'b', 'c'), 1, 'z'); > ["z","a","b","c"] > > SELECT array_insert(array('a', 'b', 'c'), 0, 'z'); > Error > > SELECT array_insert(array('a', 'b', 'c'), -1, 'z'); > ["a","b","c","z"] > > SELECT array_insert(array('a', 'b', 'c'), 5, 'z'); > ["a","b","c",NULL,"z"] > > SELECT array_insert(array('a', 'b', 'c'), -5, 'z'); > ["z",NULL,"a","b","c"] > > SELECT array_insert(array('a', 'b', 'c'), 2, cast(NULL AS STRING)); > ["a",NULL,"b","c"] > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44916) Document Spark Driver Live Log UI
Dongjoon Hyun created SPARK-44916: - Summary: Document Spark Driver Live Log UI Key: SPARK-44916 URL: https://issues.apache.org/jira/browse/SPARK-44916 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44915) Validate checksum of remounted PVC's shuffle data before recovery
[ https://issues.apache.org/jira/browse/SPARK-44915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44915: -- Component/s: (was: Spark Core) > Validate checksum of remounted PVC's shuffle data before recovery > - > > Key: SPARK-44915 > URL: https://issues.apache.org/jira/browse/SPARK-44915 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44915) Validate checksum of remounted PVC's shuffle data before recovery
Dongjoon Hyun created SPARK-44915: - Summary: Validate checksum of remounted PVC's shuffle data before recovery Key: SPARK-44915 URL: https://issues.apache.org/jira/browse/SPARK-44915 Project: Spark Issue Type: Improvement Components: Kubernetes, Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-44871: Assignee: Peter Toth > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Critical > Fix For: 3.5.0, 4.0.0 > > > Currently {{percentile_disc()}} returns incorrect results in some cases: > E.g.: > {code:java} > SELECT > percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, > percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, > percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, > percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, > percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, > percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, > percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, > percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, > percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, > percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, > percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 > FROM VALUES (0), (1), (2), (3), (4) AS v(a) > {code} > returns: > {code:java} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {code} > but it should return: > {noformat} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-44871. -- Fix Version/s: 3.4.2 (was: 3.5.0) (was: 4.0.0) Resolution: Fixed Issue resolved by pull request 42610 [https://github.com/apache/spark/pull/42610] > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Critical > Fix For: 3.4.2 > > > Currently {{percentile_disc()}} returns incorrect results in some cases: > E.g.: > {code:java} > SELECT > percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, > percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, > percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, > percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, > percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, > percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, > percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, > percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, > percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, > percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, > percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 > FROM VALUES (0), (1), (2), (3), (4) AS v(a) > {code} > returns: > {code:java} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {code} > but it should return: > {noformat} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44914) Upgrade Apache ivy to 2.5.2
Bjørn Jørgensen created SPARK-44914: --- Summary: Upgrade Apache ivy to 2.5.2 Key: SPARK-44914 URL: https://issues.apache.org/jira/browse/SPARK-44914 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 4.0.0 Reporter: Bjørn Jørgensen [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44884) Spark doesn't create SUCCESS file when external path is passed
[ https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757550#comment-17757550 ] Dipayan Dev commented on SPARK-44884: - [~ste...@apache.org] , I am running on DataProc but I am able to replicate the same from my local machine as well. _SUCCESS file is created * Spark 2.x (2.4.0 I am using) with .saveAsTable() with or without external path. * Spark 3.3.0 with .saveAsTable() without external path. _SUCCESS file is not created * Spark 3.3.0 with .saveAsTable() with external path. As mentioned, I have set the following config, but no help. spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs", true) Are you able to replicate the issue with the snippet I have shared or the _SUCCESS file is generating at your end when an external path is passed? > Spark doesn't create SUCCESS file when external path is passed > -- > > Key: SPARK-44884 > URL: https://issues.apache.org/jira/browse/SPARK-44884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dipayan Dev >Priority: Critical > Attachments: image-2023-08-20-18-08-38-531.png, > image-2023-08-20-18-46-53-342.png > > > The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 > Code to reproduce the issue. > > {code:java} > scala> spark.conf.set("spark.sql.orc.char.enabled", true) > scala> val DF = Seq(("test1", 123)).toDF("name", "num") > scala> DF.write.option("path", > "gs://test_dd123/").mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.table_name") > 23/08/20 12:31:43 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. {code} > The above code succeeds and creates the External Hive table, but {*}there is > no SUCCESS file generated{*}. The same code when running spark 2.4.0, > generating a SUCCESS file. > Adding the content of the bucket after table creation > > !image-2023-08-20-18-08-38-531.png|width=453,height=162! > > But when I don’t pass the external path as following, the SUCCESS file is > generated > {code:java} > scala> > DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("us_wm_supply_chain_rcv_pre_prod.test_tb1") > {code} > !image-2023-08-20-18-46-53-342.png|width=465,height=166! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44884) Spark doesn't create SUCCESS file when external path is passed
[ https://issues.apache.org/jira/browse/SPARK-44884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757547#comment-17757547 ] Steve Loughran commented on SPARK-44884: [~dipayandev] i don't think think anyone has disabled the option; doesn't surface in my test setup (manifest and s3a committers). Afraid you are going to have to debug it yourself, as it is your env which has the problem. does everything work if you use .saveAs() rather than .saveAsTable()? > Spark doesn't create SUCCESS file when external path is passed > -- > > Key: SPARK-44884 > URL: https://issues.apache.org/jira/browse/SPARK-44884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dipayan Dev >Priority: Critical > Attachments: image-2023-08-20-18-08-38-531.png, > image-2023-08-20-18-46-53-342.png > > > The issue is not happening in Spark 2.x (I am using 2.4.0), but only in 3.3.0 > Code to reproduce the issue. > > {code:java} > scala> spark.conf.set("spark.sql.orc.char.enabled", true) > scala> val DF = Seq(("test1", 123)).toDF("name", "num") > scala> DF.write.option("path", > "gs://test_dd123/").mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("test_schema.table_name") > 23/08/20 12:31:43 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. {code} > The above code succeeds and creates the External Hive table, but {*}there is > no SUCCESS file generated{*}. The same code when running spark 2.4.0, > generating a SUCCESS file. > Adding the content of the bucket after table creation > > !image-2023-08-20-18-08-38-531.png|width=453,height=162! > > But when I don’t pass the external path as following, the SUCCESS file is > generated > {code:java} > scala> > DF.write.mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("us_wm_supply_chain_rcv_pre_prod.test_tb1") > {code} > !image-2023-08-20-18-46-53-342.png|width=465,height=166! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757539#comment-17757539 ] Rakesh Raushan commented on SPARK-44817: Sure. I would try to come up with a SPIP by this weekend. > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-44871: --- Affects Version/s: 3.4.0 3.3.2 3.3.1 3.3.0 > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1 >Reporter: Peter Toth >Priority: Critical > Fix For: 3.5.0, 4.0.0 > > > Currently {{percentile_disc()}} returns incorrect results in some cases: > E.g.: > {code:java} > SELECT > percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, > percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, > percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, > percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, > percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, > percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, > percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, > percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, > percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, > percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, > percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 > FROM VALUES (0), (1), (2), (3), (4) AS v(a) > {code} > returns: > {code:java} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {code} > but it should return: > {noformat} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-44871: --- Affects Version/s: 3.4.1 3.3.3 (was: 3.3.0) (was: 3.4.0) (was: 3.5.0) (was: 4.0.0) > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1 >Reporter: Peter Toth >Priority: Critical > Fix For: 3.5.0, 4.0.0 > > > Currently {{percentile_disc()}} returns incorrect results in some cases: > E.g.: > {code:java} > SELECT > percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, > percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, > percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, > percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, > percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, > percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, > percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, > percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, > percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, > percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, > percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 > FROM VALUES (0), (1), (2), (3), (4) AS v(a) > {code} > returns: > {code:java} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {code} > but it should return: > {noformat} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44871) Fix PERCENTILE_DISC behaviour
[ https://issues.apache.org/jira/browse/SPARK-44871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-44871: --- Fix Version/s: 3.5.0 4.0.0 > Fix PERCENTILE_DISC behaviour > - > > Key: SPARK-44871 > URL: https://issues.apache.org/jira/browse/SPARK-44871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0, 4.0.0 >Reporter: Peter Toth >Priority: Critical > Fix For: 3.5.0, 4.0.0 > > > Currently {{percentile_disc()}} returns incorrect results in some cases: > E.g.: > {code:java} > SELECT > percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, > percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, > percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, > percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, > percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, > percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, > percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, > percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, > percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, > percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, > percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 > FROM VALUES (0), (1), (2), (3), (4) AS v(a) > {code} > returns: > {code:java} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {code} > but it should return: > {noformat} > +---+---+---+---+---+---+---+---+---+---+---+ > | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| > +---+---+---+---+---+---+---+---+---+---+---+ > |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| > +---+---+---+---+---+---+---+---+---+---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44913) DS V2 supports push down V2 UDF that has magic method
Xianyang Liu created SPARK-44913: Summary: DS V2 supports push down V2 UDF that has magic method Key: SPARK-44913 URL: https://issues.apache.org/jira/browse/SPARK-44913 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.1 Reporter: Xianyang Liu Right now we only support pushing down the V2 UDF that has not a magic method. Because the V2 UDF will be analyzed into the `ApplyFunctionExpression` which could be translated and pushed down. However, a V2 UDF that has the magic method will be analyzed into `StaticInvoke` or `Invoke` that can not be translated into V2 expression and then can not be pushed down to the data source. The magic method is suggested. So this PR adds the support of pushing down the V2 UDF that has a magic method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44912) Spark 3.4 multi-column sum slows with many columns
Brady Bickel created SPARK-44912: Summary: Spark 3.4 multi-column sum slows with many columns Key: SPARK-44912 URL: https://issues.apache.org/jira/browse/SPARK-44912 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.4.1, 3.4.0 Reporter: Brady Bickel The code below is a minimal reproducible example of an issue I discovered with Pyspark 3.4.x. I want to sum the values of multiple columns and put the sum of those columns (per row) into a new column. This code works and returns in a reasonable amount of time in Pyspark 3.3.x, but is extremely slow in Pyspark 3.4.x when the number of columns grows. See below for execution timing summary as N varies. {code:java} import pyspark.sql.functions as F import random import string from functools import reduce from operator import add from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() # generate a dataframe N columns by M rows with random 8 digit column # names and random integers in [-5,10] N = 30 M = 100 columns = [''.join(random.choices(string.ascii_uppercase + string.digits, k=8)) for _ in range(N)] data = [tuple([random.randint(-5,10) for _ in range(N)]) for _ in range(M)] df = spark.sparkContext.parallelize(data).toDF(columns) # 3 ways to add a sum column, all of them slow for high N in spark 3.4 df = df.withColumn("col_sum1", sum(df[col] for col in columns)) df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns])) df = df.withColumn("col_sum3", F.expr("+".join(columns))) {code} Timing results for Spark 3.3: ||N||Exe Time (s)|| |5|0.514| |10|0.248| |15|0.327| |20|0.403| |25|0.279| |30|0.322| |50|0.430| Timing results for Spark 3.4: ||N||Exe Time (s)|| |5|0.379| |10|0.318| |15|0.405| |20|1.32| |25|28.8| |30|448| |50|>1 (did not finish)| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44911) create hive table with invalid column should return error class
ming95 created SPARK-44911: -- Summary: create hive table with invalid column should return error class Key: SPARK-44911 URL: https://issues.apache.org/jira/browse/SPARK-44911 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.1 Reporter: ming95 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40156) url_decode() exposes a Java error
[ https://issues.apache.org/jira/browse/SPARK-40156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ming95 resolved SPARK-40156. Fix Version/s: 3.4.0 Resolution: Fixed > url_decode() exposes a Java error > - > > Key: SPARK-40156 > URL: https://issues.apache.org/jira/browse/SPARK-40156 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > Fix For: 3.4.0 > > > Given a badly encode string Spark returns a Java error. > It should the return an ERROR_CLASS > spark-sql> SELECT url_decode('http%3A%2F%2spark.apache.org'); > 22/08/20 17:17:20 ERROR SparkSQLDriver: Failed in [SELECT > url_decode('http%3A%2F%2spark.apache.org')] > java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in > escape (%) pattern - Error at index 1 in: "2s" > at java.base/java.net.URLDecoder.decode(URLDecoder.java:232) > at java.base/java.net.URLDecoder.decode(URLDecoder.java:142) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:113) > at > org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44910) Encoders.bean does not support superclasses with generic type arguments
Giambattista Bloisi created SPARK-44910: --- Summary: Encoders.bean does not support superclasses with generic type arguments Key: SPARK-44910 URL: https://issues.apache.org/jira/browse/SPARK-44910 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1, 3.5.0, 4.0.0 Reporter: Giambattista Bloisi As per SPARK-44634 another unsupported feature of bean encoder is when the superclass of the bean has generic type arguments. For example: {code:java} class JavaBeanWithGenericsA { public T getPropertyA() { return null; } public void setPropertyA(T a) { } } class JavaBeanWithGenericBase extends JavaBeanWithGenericsA { } Encoders.bean(JavaBeanWithGenericBase.class); // Exception {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44885) NullPointerException is thrown when column with ROWID type contains NULL values
[ https://issues.apache.org/jira/browse/SPARK-44885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44885: Assignee: Tim Nieradzik > NullPointerException is thrown when column with ROWID type contains NULL > values > --- > > Key: SPARK-44885 > URL: https://issues.apache.org/jira/browse/SPARK-44885 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.4.1 >Reporter: Tim Nieradzik >Assignee: Tim Nieradzik >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > A row ID may be NULL in an Oracle table. When this is the case, the following > exception is thrown: > {{[info] Cause: java.lang.NullPointerException:}} > {{{}[info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$12(JdbcUtils.scala:452){}}}{{{}[info] > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$12$adapted(JdbcUtils.scala:451){}}} > {{{}[info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:361){}}}{{{}[info] > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:343){}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44885) NullPointerException is thrown when column with ROWID type contains NULL values
[ https://issues.apache.org/jira/browse/SPARK-44885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44885. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42576 [https://github.com/apache/spark/pull/42576] > NullPointerException is thrown when column with ROWID type contains NULL > values > --- > > Key: SPARK-44885 > URL: https://issues.apache.org/jira/browse/SPARK-44885 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.4.1 >Reporter: Tim Nieradzik >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > A row ID may be NULL in an Oracle table. When this is the case, the following > exception is thrown: > {{[info] Cause: java.lang.NullPointerException:}} > {{{}[info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$12(JdbcUtils.scala:452){}}}{{{}[info] > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$12$adapted(JdbcUtils.scala:451){}}} > {{{}[info] at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:361){}}}{{{}[info] > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:343){}}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44786) Convert common Spark exceptions
[ https://issues.apache.org/jira/browse/SPARK-44786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44786: - Fix Version/s: 3.5.0 > Convert common Spark exceptions > --- > > Key: SPARK-44786 > URL: https://issues.apache.org/jira/browse/SPARK-44786 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Major > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44786) Convert common Spark exceptions
[ https://issues.apache.org/jira/browse/SPARK-44786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44786. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42472 [https://github.com/apache/spark/pull/42472] > Convert common Spark exceptions > --- > > Key: SPARK-44786 > URL: https://issues.apache.org/jira/browse/SPARK-44786 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44786) Convert common Spark exceptions
[ https://issues.apache.org/jira/browse/SPARK-44786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44786: Assignee: Yihong He > Convert common Spark exceptions > --- > > Key: SPARK-44786 > URL: https://issues.apache.org/jira/browse/SPARK-44786 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38958) Override S3 Client in Spark Write/Read calls
[ https://issues.apache.org/jira/browse/SPARK-38958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757364#comment-17757364 ] Steve Loughran commented on SPARK-38958: [~hershalb] we are about to merge the v2 sdk feature set; it'd be good for you to see if your changes work there. as for static headers, I could imagine something like we added in HADOOP-17833 for adding headers to created files. # Define a well know prefix, e.g {{fs.s3a.request.headers.)) # every key which matches fs.s3a.request.headers.* becomes a header; the value the header value. the alternative is as done for custom signers, a list of key=value separated by commas. > Override S3 Client in Spark Write/Read calls > > > Key: SPARK-38958 > URL: https://issues.apache.org/jira/browse/SPARK-38958 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Hershal >Priority: Major > > Hello, > I have been working to use spark to read and write data to S3. Unfortunately, > there are a few S3 headers that I need to add to my spark read/write calls. > After much looking, I have not found a way to replace the S3 client that > spark uses to make the read/write calls. I also have not found a > configuration that allows me to pass in S3 headers. Here is an example of > some common S3 request headers > ([https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonRequestHeaders.html).] > Does there already exist functionality to add S3 headers to spark read/write > calls or pass in a custom client that would pass these headers on every > read/write request? Appreciate the help and feedback > > Thanks, -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44909) Skip starting torch distributor log streaming server when it is not available
Weichen Xu created SPARK-44909: -- Summary: Skip starting torch distributor log streaming server when it is not available Key: SPARK-44909 URL: https://issues.apache.org/jira/browse/SPARK-44909 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.5 Reporter: Weichen Xu Skip starting torch distributor log streaming server when it is not available. In some cases, e.g., in a databricks connect cluster, there is some network limitation that casues starting log streaming server failure, but, this does not need to break torch distributor training routine. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44905) NullPointerException on stateful expression evaluation
[ https://issues.apache.org/jira/browse/SPARK-44905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757297#comment-17757297 ] ASF GitHub Bot commented on SPARK-44905: User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/42601 > NullPointerException on stateful expression evaluation > -- > > Key: SPARK-44905 > URL: https://issues.apache.org/jira/browse/SPARK-44905 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26398) Support building GPU docker images
[ https://issues.apache.org/jira/browse/SPARK-26398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757293#comment-17757293 ] comet commented on SPARK-26398: --- I see #23347 is closed without merge. Does that mean support for GPU is not available in spark and we need to build the docker image ourself? Any guide step available? > Support building GPU docker images > -- > > Key: SPARK-26398 > URL: https://issues.apache.org/jira/browse/SPARK-26398 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Rong Ou >Priority: Minor > > To run Spark on Kubernetes, a user first needs to build docker images using > the `bin/docker-image-tool.sh` script. However, this script only supports > building images for running on CPUs. As parts of Spark and related libraries > (e.g. XGBoost) get accelerated on GPUs, it's desirable to build base images > that can take advantage of GPU acceleration. > This issue only addresses building docker images with CUDA support. Actually > accelerating Spark on GPUs is outside the scope, as is supporting other types > of GPUs. > Today if anyone wants to experiment with running Spark on Kubernetes with GPU > support, they have to write their own custom `Dockerfile`. By providing an > "official" way to build GPU-enabled docker images, we can make it easier to > get started. > For now probably not that many people care about this, but it's a necessary > first step towards GPU acceleration for Spark on Kubernetes. > The risks are minimal as we only need to make minor changes to > `bin/docker-image-tool.sh`. The PR is already done and will be attached. > Success means anyone can easily build Spark docker images with GPU support. > Proposed API changes: add an optional `-g` flag to > `bin/docker-image-tool.sh` for building GPU versions of the JVM/Python/R > docker images. When the `-g` is omitted, existing behavior is preserved. > Design sketch: when the `-g` flag is specified, we append `-gpu` to the > docker image names, and switch to dockerfiles based on the official CUDA > images. Since the CUDA images are based on Ubuntu while the Spark dockerfiles > are based on Alpine, steps for setting up additional packages are different, > so there are a parallel set of `Dockerfile.gpu` files. > Alternative: if we are willing to forego Alpine and switch to Ubuntu for the > CPU-only images, the two sets of dockerfiles can be unified, and we can just > pass in a different base image depending on whether the `-g` flag is present > or not. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44908) Fix spark connect ML crossvalidator "foldCol" param
[ https://issues.apache.org/jira/browse/SPARK-44908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757292#comment-17757292 ] ASF GitHub Bot commented on SPARK-44908: User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/42605 > Fix spark connect ML crossvalidator "foldCol" param > --- > > Key: SPARK-44908 > URL: https://issues.apache.org/jira/browse/SPARK-44908 > Project: Spark > Issue Type: Bug > Components: Connect, ML >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > > Fix spark connect ML crossvalidator "foldCol" param. > > Currently it calls `df.rdd` APIs but it is not supported in spark connect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44908) Fix spark connect ML crossvalidator "foldCol" param
Weichen Xu created SPARK-44908: -- Summary: Fix spark connect ML crossvalidator "foldCol" param Key: SPARK-44908 URL: https://issues.apache.org/jira/browse/SPARK-44908 Project: Spark Issue Type: Bug Components: Connect, ML Affects Versions: 3.5.0 Reporter: Weichen Xu Fix spark connect ML crossvalidator "foldCol" param. Currently it calls `df.rdd` APIs but it is not supported in spark connect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44908) Fix spark connect ML crossvalidator "foldCol" param
[ https://issues.apache.org/jira/browse/SPARK-44908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-44908: -- Assignee: Weichen Xu > Fix spark connect ML crossvalidator "foldCol" param > --- > > Key: SPARK-44908 > URL: https://issues.apache.org/jira/browse/SPARK-44908 > Project: Spark > Issue Type: Bug > Components: Connect, ML >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > > Fix spark connect ML crossvalidator "foldCol" param. > > Currently it calls `df.rdd` APIs but it is not supported in spark connect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44906) Move substituteAppNExecIds logic into kubernetesConf.annotations method
[ https://issues.apache.org/jira/browse/SPARK-44906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757238#comment-17757238 ] GridGain Integration commented on SPARK-44906: -- User 'zwangsheng' has created a pull request for this issue: https://github.com/apache/spark/pull/42600 > Move substituteAppNExecIds logic into kubernetesConf.annotations method > > > Key: SPARK-44906 > URL: https://issues.apache.org/jira/browse/SPARK-44906 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.1 >Reporter: Binjie Yang >Priority: Major > > Move Utils. SubstituteAppNExecIds logic into KubernetesConf.annotations as > the default logic, easy for users to reuse, rather than to rewrite it again > at the same logic. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42768) Enable cached plan apply AQE by default
[ https://issues.apache.org/jira/browse/SPARK-42768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42768. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40390 [https://github.com/apache/spark/pull/40390] > Enable cached plan apply AQE by default > --- > > Key: SPARK-42768 > URL: https://issues.apache.org/jira/browse/SPARK-42768 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42768) Enable cached plan apply AQE by default
[ https://issues.apache.org/jira/browse/SPARK-42768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42768: --- Assignee: XiDuo You > Enable cached plan apply AQE by default > --- > > Key: SPARK-42768 > URL: https://issues.apache.org/jira/browse/SPARK-42768 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44905) NullPointerException on stateful expression evaluation
[ https://issues.apache.org/jira/browse/SPARK-44905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-44905: Assignee: Kent Yao > NullPointerException on stateful expression evaluation > -- > > Key: SPARK-44905 > URL: https://issues.apache.org/jira/browse/SPARK-44905 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44907) `DataFrame.join` should throw IllegalArgumentException for invalid join types
[ https://issues.apache.org/jira/browse/SPARK-44907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-44907: -- Component/s: PySpark (was: Tests) > `DataFrame.join` should throw IllegalArgumentException for invalid join types > - > > Key: SPARK-44907 > URL: https://issues.apache.org/jira/browse/SPARK-44907 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44907) `DataFrame.join` should throw IllegalArgumentException for invalid join types
Ruifeng Zheng created SPARK-44907: - Summary: `DataFrame.join` should throw IllegalArgumentException for invalid join types Key: SPARK-44907 URL: https://issues.apache.org/jira/browse/SPARK-44907 Project: Spark Issue Type: Bug Components: Connect, Tests Affects Versions: 3.5.0, 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44892: Fix Version/s: (was: 4.0.0) > Add official image Dockerfile for Spark 3.3.3 > - > > Key: SPARK-44892 > URL: https://issues.apache.org/jira/browse/SPARK-44892 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.3.3 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44892. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 54 [https://github.com/apache/spark-docker/pull/54] > Add official image Dockerfile for Spark 3.3.3 > - > > Key: SPARK-44892 > URL: https://issues.apache.org/jira/browse/SPARK-44892 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.3.3 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44892: --- Assignee: Yuming Wang > Add official image Dockerfile for Spark 3.3.3 > - > > Key: SPARK-44892 > URL: https://issues.apache.org/jira/browse/SPARK-44892 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.3.3 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44890) Miswritten remarks in pom file
[ https://issues.apache.org/jira/browse/SPARK-44890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756746#comment-17756746 ] chenyu edited comment on SPARK-44890 at 8/22/23 7:11 AM: - I had submit a patch [https://github.com/apache/spark/pull/42598|https://github.com/apache/spark/pull/42583] was (Author: JIRAUSER299988): I had submit a patch https://github.com/apache/spark/pull/42583 > Miswritten remarks in pom file > -- > > Key: SPARK-44890 > URL: https://issues.apache.org/jira/browse/SPARK-44890 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 >Reporter: chenyu >Priority: Minor > Attachments: screenshot-1.png > > > Spelling issues in pom files affect understanding which uses 'dont update'. > It needs to maintain the same writing style as other places -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44728) Improve PySpark documentations
[ https://issues.apache.org/jira/browse/SPARK-44728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757209#comment-17757209 ] Ruifeng Zheng edited comment on SPARK-44728 at 8/22/23 6:20 AM: A docstring should contain the following sections: # Brief Description: A concise summary explaining the function's purpose. # Version Annotations: Annotations like versionadded and versionchanged to signify the addition or modifications of the function in different versions of the software. # Parameters: This section should list and describe all input parameters. If the function doesn't accept any parameters, this section can be omitted. # Returns: Detail what the function returns. If the function doesn't return anything, this section can be omitted. # See Also: A list of related API functions or methods. This section can be omitted if no related APIs exist. # Notes: Include additional information or warnings about the function's usage here. # Examples: ## A docstring contains 3~5 examples if poosible. ## Every example should begin with a brief description, followed by the example code, and conclude with the expected output. ## Every example should be copy-paste able; ## Any necessary import statements should be included at the beginning of each example. was (Author: podongfeng): A docstring should contain the following sections: # Brief Description: A concise summary explaining the function's purpose. # Version Annotations: Annotations like versionadded and versionchanged to signify the addition or modifications of the function in different versions of the software. # Parameters: This section should list and describe all input parameters. If the function doesn't accept any parameters, this section can be omitted. # Returns: Detail what the function returns. If the function doesn't return anything, this section can be omitted. # See Also: A list of related API functions or methods. This section can be omitted if no related APIs exist. # Notes: Include additional information or warnings about the function's usage here. # Examples: A docstring contains 3~5 examples if poosible. ## Every example should begin with a brief description, followed by the example code, and conclude with the expected output. ## Every example should be copy-paste able; ## Any necessary import statements should be included at the beginning of each example. > Improve PySpark documentations > -- > > Key: SPARK-44728 > URL: https://issues.apache.org/jira/browse/SPARK-44728 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > > An umbrella Jira ticket to improve the PySpark documentation. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44728) Improve PySpark documentations
[ https://issues.apache.org/jira/browse/SPARK-44728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757209#comment-17757209 ] Ruifeng Zheng edited comment on SPARK-44728 at 8/22/23 6:10 AM: A docstring should contain the following sections: # Brief Description: A concise summary explaining the function's purpose. # Version Annotations: Annotations like versionadded and versionchanged to signify the addition or modifications of the function in different versions of the software. # Parameters: This section should list and describe all input parameters. If the function doesn't accept any parameters, this section can be omitted. # Returns: Detail what the function returns. If the function doesn't return anything, this section can be omitted. # See Also: A list of related API functions or methods. This section can be omitted if no related APIs exist. # Notes: Include additional information or warnings about the function's usage here. # Examples: A docstring contains 3~5 examples if poosible. ## Every example should begin with a brief description, followed by the example code, and conclude with the expected output. ## Every example should be copy-paste able; ## Any necessary import statements should be included at the beginning of each example. was (Author: podongfeng): A good docstring should contain the following sections: # Brief Description: A concise summary explaining the function's purpose. # Version Annotations: Annotations like versionadded and versionchanged to signify the addition or modifications of the function in different versions of the software. # Parameters: This section should list and describe all input parameters. If the function doesn't accept any parameters, this section can be omitted. # Returns: Detail what the function returns. If the function doesn't return anything, this section can be omitted. # See Also: A list of related API functions or methods. This section can be omitted if no related APIs exist. # Notes: Include additional information or warnings about the function's usage here. # Examples: A docstring contains 3~5 examples if poosible. ## Every example should begin with a brief description, followed by the example code, and conclude with the expected output. ## Every example should be copy-paste able; ## Any necessary import statements should be included at the beginning of each example. > Improve PySpark documentations > -- > > Key: SPARK-44728 > URL: https://issues.apache.org/jira/browse/SPARK-44728 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > > An umbrella Jira ticket to improve the PySpark documentation. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44904) Correct the ‘versionadded’ of `sql.functions.approx_percentile` to 3.5.0.
[ https://issues.apache.org/jira/browse/SPARK-44904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44904: Assignee: Yang Jie > Correct the ‘versionadded’ of `sql.functions.approx_percentile` to 3.5.0. > - > > Key: SPARK-44904 > URL: https://issues.apache.org/jira/browse/SPARK-44904 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44904) Correct the ‘versionadded’ of `sql.functions.approx_percentile` to 3.5.0.
[ https://issues.apache.org/jira/browse/SPARK-44904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44904. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42597 [https://github.com/apache/spark/pull/42597] > Correct the ‘versionadded’ of `sql.functions.approx_percentile` to 3.5.0. > - > > Key: SPARK-44904 > URL: https://issues.apache.org/jira/browse/SPARK-44904 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44728) Improve PySpark documentations
[ https://issues.apache.org/jira/browse/SPARK-44728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757209#comment-17757209 ] Ruifeng Zheng edited comment on SPARK-44728 at 8/22/23 6:01 AM: A good docstring should contain the following sections: # Brief Description: A concise summary explaining the function's purpose. # Version Annotations: Annotations like versionadded and versionchanged to signify the addition or modifications of the function in different versions of the software. # Parameters: This section should list and describe all input parameters. If the function doesn't accept any parameters, this section can be omitted. # Returns: Detail what the function returns. If the function doesn't return anything, this section can be omitted. # See Also: A list of related API functions or methods. This section can be omitted if no related APIs exist. # Notes: Include additional information or warnings about the function's usage here. # Examples: A docstring contains 3~5 examples if poosible. ## Every example should begin with a brief description, followed by the example code, and conclude with the expected output. ## Every example should be copy-paste able; ## Any necessary import statements should be included at the beginning of each example. was (Author: podongfeng): A good docstring should contain the following sections: # Brief Description: A concise summary explaining the function's purpose. # xVersion Annotations: Annotations like versionadded and versionchanged to signify the addition or modifications of the function in different versions of the software. # Parameters: This section should list and describe all input parameters. If the function doesn't accept any parameters, this section can be omitted. # Returns: Detail what the function returns. If the function doesn't return anything, this section can be omitted. # See Also: A list of related API functions or methods. This section can be omitted if no related APIs exist. # Notes: Include additional information or warnings about the function's usage here. # Examples: A docstring contains 3~5 examples if poosible. ## Every example should begin with a brief description, followed by the example code, and conclude with the expected output. ## Every example should be copy-paste able; ## Any necessary import statements should be included at the beginning of each example. > Improve PySpark documentations > -- > > Key: SPARK-44728 > URL: https://issues.apache.org/jira/browse/SPARK-44728 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > > An umbrella Jira ticket to improve the PySpark documentation. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44728) Improve PySpark documentations
[ https://issues.apache.org/jira/browse/SPARK-44728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757209#comment-17757209 ] Ruifeng Zheng edited comment on SPARK-44728 at 8/22/23 5:59 AM: A good docstring should contain the following sections: # Brief Description: A concise summary explaining the function's purpose. # xVersion Annotations: Annotations like versionadded and versionchanged to signify the addition or modifications of the function in different versions of the software. # Parameters: This section should list and describe all input parameters. If the function doesn't accept any parameters, this section can be omitted. # Returns: Detail what the function returns. If the function doesn't return anything, this section can be omitted. # See Also: A list of related API functions or methods. This section can be omitted if no related APIs exist. # Notes: Include additional information or warnings about the function's usage here. # Examples: A docstring contains 3~5 examples if poosible. ## Every example should begin with a brief description, followed by the example code, and conclude with the expected output. ## Every example should be copy-paste able; ## Any necessary import statements should be included at the beginning of each example. was (Author: podongfeng): A good docstring should contain the following sections: # Brief Description: A concise summary explaining the function's purpose. # xVersion Annotations: Annotations like versionadded and versionchanged to signify the addition or modifications of the function in different versions of the software. # Parameters: This section should list and describe all input parameters. If the function doesn't accept any parameters, this section can be omitted. # Returns: Detail what the function returns. If the function doesn't return anything, this section can be omitted. # See Also: A list of related API functions or methods. This section can be omitted if no related APIs exist. # Notes: Include additional information or warnings about the function's usage here. # Examples: A docstring contains 3~5 examples if poosible. Every example should begin with a brief description, followed by the example code, and conclude with the expected output. ## An example should be copy-paste able; ## Any necessary import statements should be included at the beginning of each example. > Improve PySpark documentations > -- > > Key: SPARK-44728 > URL: https://issues.apache.org/jira/browse/SPARK-44728 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > > An umbrella Jira ticket to improve the PySpark documentation. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44728) Improve PySpark documentations
[ https://issues.apache.org/jira/browse/SPARK-44728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757209#comment-17757209 ] Ruifeng Zheng edited comment on SPARK-44728 at 8/22/23 5:59 AM: A good docstring should contain the following sections: # Brief Description: A concise summary explaining the function's purpose. # xVersion Annotations: Annotations like versionadded and versionchanged to signify the addition or modifications of the function in different versions of the software. # Parameters: This section should list and describe all input parameters. If the function doesn't accept any parameters, this section can be omitted. # Returns: Detail what the function returns. If the function doesn't return anything, this section can be omitted. # See Also: A list of related API functions or methods. This section can be omitted if no related APIs exist. # Notes: Include additional information or warnings about the function's usage here. # Examples: A docstring contains 3~5 examples if poosible. Every example should begin with a brief description, followed by the example code, and conclude with the expected output. ## An example should be copy-paste able; ## Any necessary import statements should be included at the beginning of each example. was (Author: podongfeng): A good docstring should contain the following sections: # Brief Description: A concise summary explaining the function's purpose. # xVersion Annotations: Annotations like versionadded and versionchanged to signify the addition or modifications of the function in different versions of the software. # Parameters: This section should list and describe all input parameters. If the function doesn't accept any parameters, this section can be omitted. # Returns: Detail what the function returns. If the function doesn't return anything, this section can be omitted. # See Also: A list of related API functions or methods. This section can be omitted if no related APIs exist. # Notes: Include additional information or warnings about the function's usage here. # Examples: Every example should begin with a brief description, followed by the example code, and conclude with the expected output. Any necessary import statements should be included at the beginning of each example. > Improve PySpark documentations > -- > > Key: SPARK-44728 > URL: https://issues.apache.org/jira/browse/SPARK-44728 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > > An umbrella Jira ticket to improve the PySpark documentation. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org