[jira] [Created] (SPARK-43247) Add apache/spark docker image overview

2023-04-23 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-43247:
---

 Summary: Add apache/spark docker image overview
 Key: SPARK-43247
 URL: https://issues.apache.org/jira/browse/SPARK-43247
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, Spark Docker
Affects Versions: 3.5.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43234) Migrate ValueError from Connect DataFrame into error class

2023-04-23 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43234.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40910
[https://github.com/apache/spark/pull/40910]

> Migrate ValueError from Connect DataFrame into error class
> --
>
> Key: SPARK-43234
> URL: https://issues.apache.org/jira/browse/SPARK-43234
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Migrate ValueError from Connect DataFrame into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43234) Migrate ValueError from Connect DataFrame into error class

2023-04-23 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43234:
-

Assignee: Haejoon Lee

> Migrate ValueError from Connect DataFrame into error class
> --
>
> Key: SPARK-43234
> URL: https://issues.apache.org/jira/browse/SPARK-43234
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Migrate ValueError from Connect DataFrame into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43230) Simplify `DataFrameNaFunctions.fillna`

2023-04-23 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43230.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40898
[https://github.com/apache/spark/pull/40898]

> Simplify `DataFrameNaFunctions.fillna`
> --
>
> Key: SPARK-43230
> URL: https://issues.apache.org/jira/browse/SPARK-43230
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43230) Simplify `DataFrameNaFunctions.fillna`

2023-04-23 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43230:
-

Assignee: Ruifeng Zheng

> Simplify `DataFrameNaFunctions.fillna`
> --
>
> Key: SPARK-43230
> URL: https://issues.apache.org/jira/browse/SPARK-43230
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43246) Make `CheckConnectJvmClientCompatibility` filter the check on `private[packageName]` scope member as default

2023-04-23 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715592#comment-17715592
 ] 

Yang Jie edited comment on SPARK-43246 at 4/24/23 6:12 AM:
---

As [https://github.com/apache/spark/pull/40898/files#r1174782564]  [~ruifengz]  
mentioned


was (Author: luciferyang):
As [https://github.com/apache/spark/pull/40898/files#r1174782564]  mentioned

> Make `CheckConnectJvmClientCompatibility`  filter the check on 
> `private[packageName]` scope member as default
> -
>
> Key: SPARK-43246
> URL: https://issues.apache.org/jira/browse/SPARK-43246
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43246) Make `CheckConnectJvmClientCompatibility` filter the check on `private[packageName]` scope member as default

2023-04-23 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715592#comment-17715592
 ] 

Yang Jie commented on SPARK-43246:
--

As [https://github.com/apache/spark/pull/40898/files#r1174782564]  mentioned

> Make `CheckConnectJvmClientCompatibility`  filter the check on 
> `private[packageName]` scope member as default
> -
>
> Key: SPARK-43246
> URL: https://issues.apache.org/jira/browse/SPARK-43246
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43246) Make `CheckConnectJvmClientCompatibility` filter the check on `private[packageName]` scope member as default

2023-04-23 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43246:
-
Epic Link: SPARK-42554

> Make `CheckConnectJvmClientCompatibility`  filter the check on 
> `private[packageName]` scope member as default
> -
>
> Key: SPARK-43246
> URL: https://issues.apache.org/jira/browse/SPARK-43246
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43246) Make `CheckConnectJvmClientCompatibility` filter the check on `private[packageName]` scope member as default

2023-04-23 Thread Yang Jie (Jira)
Yang Jie created SPARK-43246:


 Summary: Make `CheckConnectJvmClientCompatibility`  filter the 
check on `private[packageName]` scope member as default
 Key: SPARK-43246
 URL: https://issues.apache.org/jira/browse/SPARK-43246
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43209) Migrate Expression errors into error class

2023-04-23 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43209.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40869
[https://github.com/apache/spark/pull/40869]

> Migrate Expression errors into error class
> --
>
> Key: SPARK-43209
> URL: https://issues.apache.org/jira/browse/SPARK-43209
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Migrate Expression errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43209) Migrate Expression errors into error class

2023-04-23 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43209:
-

Assignee: Haejoon Lee

> Migrate Expression errors into error class
> --
>
> Key: SPARK-43209
> URL: https://issues.apache.org/jira/browse/SPARK-43209
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Migrate Expression errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43240) df.describe() method may- return wrong result if the last RDD is RDD[UnsafeRow]

2023-04-23 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-43240:
---
Affects Version/s: 3.3.2
   (was: 3.2.2)

> df.describe() method may- return wrong result if the last RDD is 
> RDD[UnsafeRow]
> ---
>
> Key: SPARK-43240
> URL: https://issues.apache.org/jira/browse/SPARK-43240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: Ke Jia
>Priority: Major
>
> When calling the df.describe() method, the result  maybe wrong when the last 
> RDD is RDD[UnsafeRow]. It is because the UnsafeRow will be released after the 
> row is used. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43245) Fix DatetimeIndex.microsecond to return 'int32' instead of 'int64' type of Index.

2023-04-23 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-43245:
---

 Summary: Fix DatetimeIndex.microsecond to return 'int32' instead 
of 'int64' type of Index.
 Key: SPARK-43245
 URL: https://issues.apache.org/jira/browse/SPARK-43245
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#index-can-now-hold-numpy-numeric-dtypes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43210) Introduce PySparkAssersionError

2023-04-23 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43210:
-

Assignee: Haejoon Lee

> Introduce PySparkAssersionError
> ---
>
> Key: SPARK-43210
> URL: https://issues.apache.org/jira/browse/SPARK-43210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Introduce PySparkAssersionError



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43210) Introduce PySparkAssersionError

2023-04-23 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43210.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40868
[https://github.com/apache/spark/pull/40868]

> Introduce PySparkAssersionError
> ---
>
> Key: SPARK-43210
> URL: https://issues.apache.org/jira/browse/SPARK-43210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Introduce PySparkAssersionError



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43113) Codegen error when full outer join's bound condition has multiple references to the same stream-side column

2023-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-43113:
-
Fix Version/s: 3.3.3

> Codegen error when full outer join's bound condition has multiple references 
> to the same stream-side column
> ---
>
> Key: SPARK-43113
> URL: https://issues.apache.org/jira/browse/SPARK-43113
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.3.3, 3.4.1, 3.5.0
>
>
> Example # 1 (sort merge join):
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 1),
> (2, 2),
> (3, 1)
> as v1(key, value);
> create or replace temp view v2 as
> select * from values
> (1, 22, 22),
> (3, -1, -1),
> (7, null, null)
> as v2(a, b, c);
> select *
> from v1
> full outer join v2
> on key = a
> and value > b
> and value > c;
> {noformat}
> The join's generated code causes the following compilation error:
> {noformat}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 277, Column 9: Redefinition of local variable "smj_isNull_7"
> {noformat}
> Example #2 (shuffle hash join):
> {noformat}
> select /*+ SHUFFLE_HASH(v2) */ *
> from v1
> full outer join v2
> on key = a
> and value > b
> and value > c;
> {noformat}
> The shuffle hash join's generated code causes the following compilation error:
> {noformat}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 174, Column 5: Redefinition of local variable "shj_value_1" 
> {noformat}
> With default configuration, both queries end up succeeding, since Spark falls 
> back to running each query with whole-stage codegen disabled.
> The issue happens only when the join's bound condition refers to the same 
> stream-side column more than once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42945) Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect

2023-04-23 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42945.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40575
[https://github.com/apache/spark/pull/40575]

> Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect
> ---
>
> Key: SPARK-42945
> URL: https://issues.apache.org/jira/browse/SPARK-42945
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.5.0
>
>
> Make the PySpark setting PYSPARK_JVM_STACKTRACE_ENABLED work with Spark 
> Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43212) Migrate Structured Streaming errors into error class

2023-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43212.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40880
[https://github.com/apache/spark/pull/40880]

> Migrate Structured Streaming errors into error class
> 
>
> Key: SPARK-43212
> URL: https://issues.apache.org/jira/browse/SPARK-43212
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Migrate Structured Streaming errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43212) Migrate Structured Streaming errors into error class

2023-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43212:


Assignee: Haejoon Lee

> Migrate Structured Streaming errors into error class
> 
>
> Key: SPARK-43212
> URL: https://issues.apache.org/jira/browse/SPARK-43212
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Migrate Structured Streaming errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43239) Remove null_counts from info()

2023-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43239:


Assignee: Bjørn Jørgensen

> Remove null_counts from info()
> --
>
> Key: SPARK-43239
> URL: https://issues.apache.org/jira/browse/SPARK-43239
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> df.info() is broken now. 
> It prints 
> TypeError Traceback (most recent call last)
> Cell In[12], line 1
> > 1 F05.info()
> File /opt/spark/python/pyspark/pandas/frame.py:12167, in DataFrame.info(self, 
> verbose, buf, max_cols, null_counts)
>   12163 count_func = self.count
>   12164 self.count = (  # type: ignore[assignment]
>   12165 lambda: count_func()._to_pandas()  # type: ignore[assignment, 
> misc, union-attr]
>   12166 )
> > 12167 return pd.DataFrame.info(
>   12168 self,  # type: ignore[arg-type]
>   12169 verbose=verbose,
>   12170 buf=buf,
>   12171 max_cols=max_cols,
>   12172 memory_usage=False,
>   12173 null_counts=null_counts,
>   12174 )
>   12175 finally:
>   12176 del self._data
> TypeError: DataFrame.info() got an unexpected keyword argument 'null_counts'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43239) Remove null_counts from info()

2023-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43239.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40913
[https://github.com/apache/spark/pull/40913]

> Remove null_counts from info()
> --
>
> Key: SPARK-43239
> URL: https://issues.apache.org/jira/browse/SPARK-43239
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.5.0
>
>
> df.info() is broken now. 
> It prints 
> TypeError Traceback (most recent call last)
> Cell In[12], line 1
> > 1 F05.info()
> File /opt/spark/python/pyspark/pandas/frame.py:12167, in DataFrame.info(self, 
> verbose, buf, max_cols, null_counts)
>   12163 count_func = self.count
>   12164 self.count = (  # type: ignore[assignment]
>   12165 lambda: count_func()._to_pandas()  # type: ignore[assignment, 
> misc, union-attr]
>   12166 )
> > 12167 return pd.DataFrame.info(
>   12168 self,  # type: ignore[arg-type]
>   12169 verbose=verbose,
>   12170 buf=buf,
>   12171 max_cols=max_cols,
>   12172 memory_usage=False,
>   12173 null_counts=null_counts,
>   12174 )
>   12175 finally:
>   12176 del self._data
> TypeError: DataFrame.info() got an unexpected keyword argument 'null_counts'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43244) RocksDB State Store can accumulate unbounded native memory

2023-04-23 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715525#comment-17715525
 ] 

Adam Binford commented on SPARK-43244:
--

Read through [https://github.com/apache/spark/pull/30344] for all the comments 
about close/abort. Adding a new method (like close) in addition to abort seems 
like the only real option here?

> RocksDB State Store can accumulate unbounded native memory
> --
>
> Key: SPARK-43244
> URL: https://issues.apache.org/jira/browse/SPARK-43244
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Adam Binford
>Priority: Major
>
> We noticed in one of our production stateful streaming jobs using RocksDB 
> that an executor with 20g of heap was using around 40g of resident memory. I 
> noticed a single RocksDB instance was using around 150 MiB of memory, and 
> only 5 MiB or so of this was from the write batch (which is now cleared after 
> committing).
> After reading about RocksDB memory usage (this link was helpful: 
> [https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md])
>  I realized a lot of this was likely the "Index and Filters" memory usage. 
> This job is doing a streaming duplicate with a lot of unique keys so it makes 
> sense these block usages would be high. The problem is that, because as it is 
> now the underlying RocksDB instance stays open on an executor as long as that 
> executor is assigned that stateful partition (to be reused across batches). 
> So a single executor can accumulate a large number of RocksDB instances open 
> at once, each using a certain amount of native memory. In the worst case, a 
> single executor could need to keep open every single partitions' RocksDB 
> instance at once. 
> There are a couple ways you can control the amount of memory used, such as 
> limiting the max open files, or adding the option to use the block cache for 
> the indices and filters, but neither of these solve the underlying problem of 
> accumulating native memory from multiple partitions on an executor.
> The real fix needs to be a mechanism and option to close the underlying 
> RocksDB instance at the end of each task, so you have the option to only ever 
> have one RocksDB instance open at a time, thus having predictable memory 
> usage no matter the size of your data or number of shuffle partitions. 
> We are running this on Spark 3.3, but just kicked off a test to see if things 
> are any different in Spark 3.4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43244) RocksDB State Store can accumulate unbounded native memory

2023-04-23 Thread Adam Binford (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Binford updated SPARK-43244:
-
Description: 
We noticed in one of our production stateful streaming jobs using RocksDB that 
an executor with 20g of heap was using around 40g of resident memory. I noticed 
a single RocksDB instance was using around 150 MiB of memory, and only 5 MiB or 
so of this was from the write batch (which is now cleared after committing).

After reading about RocksDB memory usage (this link was helpful: 
[https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md])
 I realized a lot of this was likely the "Index and Filters" memory usage. This 
job is doing a streaming duplicate with a lot of unique keys so it makes sense 
these block usages would be high. The problem is that, because as it is now the 
underlying RocksDB instance stays open on an executor as long as that executor 
is assigned that stateful partition (to be reused across batches). So a single 
executor can accumulate a large number of RocksDB instances open at once, each 
using a certain amount of native memory. In the worst case, a single executor 
could need to keep open every single partitions' RocksDB instance at once. 

There are a couple ways you can control the amount of memory used, such as 
limiting the max open files, or adding the option to use the block cache for 
the indices and filters, but neither of these solve the underlying problem of 
accumulating native memory from multiple partitions on an executor.

The real fix needs to be a mechanism and option to close the underlying RocksDB 
instance at the end of each task, so you have the option to only ever have one 
RocksDB instance open at a time, thus having predictable memory usage no matter 
the size of your data or number of shuffle partitions. 

We are running this on Spark 3.3, but just kicked off a test to see if things 
are any different in Spark 3.4.

  was:
We noticed in one of our production stateful streaming jobs using RocksDB that 
an executor with 20g of heap was using around 40g of resident memory. I noticed 
a single RocksDB instance was using around 150 MiB of memory, and only 5 MiB or 
so of this was from the write batch (which is now cleared after committing).

After reading about RocksDB memory usage (this link was helpful: 
[https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md])
 I realized a lot of this was likely the "Index and Filters" memory usage. This 
job is doing a streaming duplicate with a lot of unique values so it makes 
sense these block usages would be high. The problem is that, because as it is 
now the underlying RocksDB instance stays open on an executor as long as that 
executor is assigned that stateful partition (to be reused across batches). So 
a single executor can accumulate a large number of RocksDB instances open at 
once, each using a certain amount of native memory. In the worst case, a single 
executor could need to keep open every single partitions' RocksDB instance at 
once. 

There are a couple ways you can control the amount of memory used, such as 
limiting the max open files, or adding the option to use the block cache for 
the indices and filters, but neither of these solve the underlying problem of 
accumulating native memory from multiple partitions on an executor.

The real fix needs to be a mechanism and option to close the underlying RocksDB 
instance at the end of each task, so you have the option to only ever have one 
RocksDB instance open at a time, thus having predictable memory usage no matter 
the size of your data or number of shuffle partitions. 

We are running this on Spark 3.3, but just kicked off a test to see if things 
are any different in Spark 3.4.


> RocksDB State Store can accumulate unbounded native memory
> --
>
> Key: SPARK-43244
> URL: https://issues.apache.org/jira/browse/SPARK-43244
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Adam Binford
>Priority: Major
>
> We noticed in one of our production stateful streaming jobs using RocksDB 
> that an executor with 20g of heap was using around 40g of resident memory. I 
> noticed a single RocksDB instance was using around 150 MiB of memory, and 
> only 5 MiB or so of this was from the write batch (which is now cleared after 
> committing).
> After reading about RocksDB memory usage (this link was helpful: 
> [https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md])
>  I realized a lot of this was likely the "Index and Filters" memory usage. 
> This job is doing a streaming duplicate with a lot of unique keys so it makes 
> sense these block usages would be high. The pr

[jira] [Updated] (SPARK-43244) RocksDB State Store can accumulate unbounded native memory

2023-04-23 Thread Adam Binford (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Binford updated SPARK-43244:
-
Description: 
We noticed in one of our production stateful streaming jobs using RocksDB that 
an executor with 20g of heap was using around 40g of resident memory. I noticed 
a single RocksDB instance was using around 150 MiB of memory, and only 5 MiB or 
so of this was from the write batch (which is now cleared after committing).

After reading about RocksDB memory usage (this link was helpful: 
[https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md])
 I realized a lot of this was likely the "Index and Filters" memory usage. This 
job is doing a streaming duplicate with a lot of unique values so it makes 
sense these block usages would be high. The problem is that, because as it is 
now the underlying RocksDB instance stays open on an executor as long as that 
executor is assigned that stateful partition (to be reused across batches). So 
a single executor can accumulate a large number of RocksDB instances open at 
once, each using a certain amount of native memory. In the worst case, a single 
executor could need to keep open every single partitions' RocksDB instance at 
once. 

There are a couple ways you can control the amount of memory used, such as 
limiting the max open files, or adding the option to use the block cache for 
the indices and filters, but neither of these solve the underlying problem of 
accumulating native memory from multiple partitions on an executor.

The real fix needs to be a mechanism and option to close the underlying RocksDB 
instance at the end of each task, so you have the option to only ever have one 
RocksDB instance open at a time, thus having predictable memory usage no matter 
the size of your data or number of shuffle partitions. 

We are running this on Spark 3.3, but just kicked off a test to see if things 
are any different in Spark 3.4.

  was:
We noticed in one of our production stateful streaming jobs using RocksDB that 
an executor with 20g of heap was using around 40g of resident memory. I noticed 
a single RocksDB instance was using around 150 MiB of memory, and only 5 MiB or 
so of this was from the write batch (which is now cleared after committing).

After reading about RocksDB memory usage ([this link was 
helpful]([https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md])
 I realized a lot of this was likely the "Index and Filters" memory usage. This 
job is doing a streaming duplicate with a lot of unique values so it makes 
sense these block usages would be high. The problem is that, because as it is 
now the underlying RocksDB instance stays open on an executor as long as that 
executor is assigned that stateful partition (to be reused across batches). So 
a single executor can accumulate a large number of RocksDB instances open at 
once, each using a certain amount of native memory. In the worst case, a single 
executor could need to keep open every single partitions' RocksDB instance at 
once. 

There are a couple ways you can control the amount of memory used, such as 
limiting the max open files, or adding the option to use the block cache for 
the indices and filters, but neither of these solve the underlying problem of 
accumulating native memory from multiple partitions on an executor.

The real fix needs to be a mechanism and option to close the underlying RocksDB 
instance at the end of each task, so you have the option to only ever have one 
RocksDB instance open at a time, thus having predictable memory usage no matter 
the size of your data or number of shuffle partitions. 

We are running this on Spark 3.3, but just kicked off a test to see if things 
are any different in Spark 3.4.


> RocksDB State Store can accumulate unbounded native memory
> --
>
> Key: SPARK-43244
> URL: https://issues.apache.org/jira/browse/SPARK-43244
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Adam Binford
>Priority: Major
>
> We noticed in one of our production stateful streaming jobs using RocksDB 
> that an executor with 20g of heap was using around 40g of resident memory. I 
> noticed a single RocksDB instance was using around 150 MiB of memory, and 
> only 5 MiB or so of this was from the write batch (which is now cleared after 
> committing).
> After reading about RocksDB memory usage (this link was helpful: 
> [https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md])
>  I realized a lot of this was likely the "Index and Filters" memory usage. 
> This job is doing a streaming duplicate with a lot of unique values so it 
> makes sense these block usages would be high. 

[jira] [Updated] (SPARK-43244) RocksDB State Store can accumulate unbounded native memory

2023-04-23 Thread Adam Binford (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Binford updated SPARK-43244:
-
Description: 
We noticed in one of our production stateful streaming jobs using RocksDB that 
an executor with 20g of heap was using around 40g of resident memory. I noticed 
a single RocksDB instance was using around 150 MiB of memory, and only 5 MiB or 
so of this was from the write batch (which is now cleared after committing).

After reading about RocksDB memory usage ([this link was 
helpful]([https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md])
 I realized a lot of this was likely the "Index and Filters" memory usage. This 
job is doing a streaming duplicate with a lot of unique values so it makes 
sense these block usages would be high. The problem is that, because as it is 
now the underlying RocksDB instance stays open on an executor as long as that 
executor is assigned that stateful partition (to be reused across batches). So 
a single executor can accumulate a large number of RocksDB instances open at 
once, each using a certain amount of native memory. In the worst case, a single 
executor could need to keep open every single partitions' RocksDB instance at 
once. 

There are a couple ways you can control the amount of memory used, such as 
limiting the max open files, or adding the option to use the block cache for 
the indices and filters, but neither of these solve the underlying problem of 
accumulating native memory from multiple partitions on an executor.

The real fix needs to be a mechanism and option to close the underlying RocksDB 
instance at the end of each task, so you have the option to only ever have one 
RocksDB instance open at a time, thus having predictable memory usage no matter 
the size of your data or number of shuffle partitions. 

We are running this on Spark 3.3, but just kicked off a test to see if things 
are any different in Spark 3.4.

  was:
We noticed in one of our production stateful streaming jobs using RocksDB that 
an executor with 20g of heap was using around 40g of resident memory. I noticed 
a single RocksDB instance was using around 150 MiB of memory, and only 5 MiB or 
so of this was from the write batch (which is now cleared after committing).

After reading about RocksDB memory usage ([this link was 
helpful|[https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md]),]
 I realized a lot of this was likely the "Index and Filters" memory usage. This 
job is doing a streaming duplicate with a lot of unique values so it makes 
sense these block usages would be high. The problem is that, because as it is 
now the underlying RocksDB instance stays open on an executor as long as that 
executor is assigned that stateful partition (to be reused across batches). So 
a single executor can accumulate a large number of RocksDB instances open at 
once, each using a certain amount of native memory. In the worst case, a single 
executor could need to keep open every single partitions' RocksDB instance at 
once. 

There are a couple ways you can control the amount of memory used, such as 
limiting the max open files, or adding the option to use the block cache for 
the indices and filters, but neither of these solve the underlying problem of 
accumulating native memory from multiple partitions on an executor.

The real fix needs to be a mechanism and option to close the underlying RocksDB 
instance at the end of each task, so you have the option to only ever have one 
RocksDB instance open at a time, thus having predictable memory usage no matter 
the size of your data or number of shuffle partitions. 

We are running this on Spark 3.3, but just kicked off a test to see if things 
are any different in Spark 3.4.


> RocksDB State Store can accumulate unbounded native memory
> --
>
> Key: SPARK-43244
> URL: https://issues.apache.org/jira/browse/SPARK-43244
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Adam Binford
>Priority: Major
>
> We noticed in one of our production stateful streaming jobs using RocksDB 
> that an executor with 20g of heap was using around 40g of resident memory. I 
> noticed a single RocksDB instance was using around 150 MiB of memory, and 
> only 5 MiB or so of this was from the write batch (which is now cleared after 
> committing).
> After reading about RocksDB memory usage ([this link was 
> helpful]([https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md])
>  I realized a lot of this was likely the "Index and Filters" memory usage. 
> This job is doing a streaming duplicate with a lot of unique values so it 
> makes sense these block usages would be hig

[jira] [Commented] (SPARK-43244) RocksDB State Store can accumulate unbounded native memory

2023-04-23 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715495#comment-17715495
 ] 

Adam Binford commented on SPARK-43244:
--

[~kabhwan] curious your thoughts on this. Seems somewhat straightforward to 
have an option to close the RocksDB instance after a task completes. Things 
like maintenance wouldn't be affected because it doesn't use the RocksDB 
instance to clean up old files. During the next batch I think it can reuse a 
lot of the files that have already been downloaded, it will just need to 
re-create the underlying RocksDB instance.

The main question is how to actually do the closing. It can't be done at the 
end of a commit because some things iterate over the items after committing. 
Abort is only called in certain situations. So the options either seem like 
 # Do it via the abort mechanism but change abort to be called every time even 
for write stores. This seems like it would be fine for the two existing state 
stores, but could cause issues for any custom state stores people have?
 # Add a new function to the StateStore classes like "finished" or "complete" 
or something that says the task is done, do any cleanup you want to do that's 
not a full unload.

> RocksDB State Store can accumulate unbounded native memory
> --
>
> Key: SPARK-43244
> URL: https://issues.apache.org/jira/browse/SPARK-43244
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Adam Binford
>Priority: Major
>
> We noticed in one of our production stateful streaming jobs using RocksDB 
> that an executor with 20g of heap was using around 40g of resident memory. I 
> noticed a single RocksDB instance was using around 150 MiB of memory, and 
> only 5 MiB or so of this was from the write batch (which is now cleared after 
> committing).
> After reading about RocksDB memory usage ([this link was 
> helpful|[https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md]),]
>  I realized a lot of this was likely the "Index and Filters" memory usage. 
> This job is doing a streaming duplicate with a lot of unique values so it 
> makes sense these block usages would be high. The problem is that, because as 
> it is now the underlying RocksDB instance stays open on an executor as long 
> as that executor is assigned that stateful partition (to be reused across 
> batches). So a single executor can accumulate a large number of RocksDB 
> instances open at once, each using a certain amount of native memory. In the 
> worst case, a single executor could need to keep open every single 
> partitions' RocksDB instance at once. 
> There are a couple ways you can control the amount of memory used, such as 
> limiting the max open files, or adding the option to use the block cache for 
> the indices and filters, but neither of these solve the underlying problem of 
> accumulating native memory from multiple partitions on an executor.
> The real fix needs to be a mechanism and option to close the underlying 
> RocksDB instance at the end of each task, so you have the option to only ever 
> have one RocksDB instance open at a time, thus having predictable memory 
> usage no matter the size of your data or number of shuffle partitions. 
> We are running this on Spark 3.3, but just kicked off a test to see if things 
> are any different in Spark 3.4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43244) RocksDB State Store can accumulate unbounded native memory

2023-04-23 Thread Adam Binford (Jira)
Adam Binford created SPARK-43244:


 Summary: RocksDB State Store can accumulate unbounded native memory
 Key: SPARK-43244
 URL: https://issues.apache.org/jira/browse/SPARK-43244
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.3.2
Reporter: Adam Binford


We noticed in one of our production stateful streaming jobs using RocksDB that 
an executor with 20g of heap was using around 40g of resident memory. I noticed 
a single RocksDB instance was using around 150 MiB of memory, and only 5 MiB or 
so of this was from the write batch (which is now cleared after committing).

After reading about RocksDB memory usage ([this link was 
helpful|[https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md]),]
 I realized a lot of this was likely the "Index and Filters" memory usage. This 
job is doing a streaming duplicate with a lot of unique values so it makes 
sense these block usages would be high. The problem is that, because as it is 
now the underlying RocksDB instance stays open on an executor as long as that 
executor is assigned that stateful partition (to be reused across batches). So 
a single executor can accumulate a large number of RocksDB instances open at 
once, each using a certain amount of native memory. In the worst case, a single 
executor could need to keep open every single partitions' RocksDB instance at 
once. 

There are a couple ways you can control the amount of memory used, such as 
limiting the max open files, or adding the option to use the block cache for 
the indices and filters, but neither of these solve the underlying problem of 
accumulating native memory from multiple partitions on an executor.

The real fix needs to be a mechanism and option to close the underlying RocksDB 
instance at the end of each task, so you have the option to only ever have one 
RocksDB instance open at a time, thus having predictable memory usage no matter 
the size of your data or number of shuffle partitions. 

We are running this on Spark 3.3, but just kicked off a test to see if things 
are any different in Spark 3.4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43243) Add Level param to df.printSchema for Python API

2023-04-23 Thread Khalid Mammadov (Jira)
Khalid Mammadov created SPARK-43243:
---

 Summary: Add Level param to df.printSchema for Python API
 Key: SPARK-43243
 URL: https://issues.apache.org/jira/browse/SPARK-43243
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Khalid Mammadov


Python printSchema in DataFrame API is missing level parameter which is 
available in Scala API. This is to add that



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43008) Upgrade jodatime to 2.12.5

2023-04-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-43008:
-
Priority: Minor  (was: Major)

> Upgrade jodatime to 2.12.5
> --
>
> Key: SPARK-43008
> URL: https://issues.apache.org/jira/browse/SPARK-43008
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://www.joda.org/joda-time/changes-report.html#a2.12.5



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43008) Upgrade jodatime to 2.12.5

2023-04-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-43008.
--
Fix Version/s: 3.5.0
 Assignee: Yang Jie
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/40640

> Upgrade jodatime to 2.12.5
> --
>
> Key: SPARK-43008
> URL: https://issues.apache.org/jira/browse/SPARK-43008
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> https://www.joda.org/joda-time/changes-report.html#a2.12.5



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43242) diagnoseCorruption should not throw Unexpected type of BlockId for ShuffleBlockBatchId

2023-04-23 Thread Zhang Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Liang updated SPARK-43242:

Summary: diagnoseCorruption should not throw Unexpected type of BlockId for 
ShuffleBlockBatchId  (was: shuffle diagnoseCorruption should not throw 
Unexpected type of BlockId for ShuffleBlockBatchId)

> diagnoseCorruption should not throw Unexpected type of BlockId for 
> ShuffleBlockBatchId
> --
>
> Key: SPARK-43242
> URL: https://issues.apache.org/jira/browse/SPARK-43242
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.4
>Reporter: Zhang Liang
>Priority: Minor
>
> Some of our spark app throw "Unexpected type of BlockId" exception as shown 
> below
> According to BlockId.scala, we can found format such as 
> *shuffle_12_5868_518_523* is type of `ShuffleBlockBatchId`, which is not 
> handled properly in `ShuffleBlockFetcherIterator.diagnoseCorruption`.
>  
> Moreover, the new exception thrown in `diagnose` swallow the real exception 
> in certain cases. Since diagnoseCorruption is always used in exception 
> handling as a side dish, I think it shouldn't throw exception at all
>  
> {code:java}
> 23/03/07 03:01:24,485 [task-result-getter-1] WARN TaskSetManager: Lost task 
> 104.0 in stage 36.0 (TID 6169): java.lang.IllegalArgumentException: 
> Unexpected type of BlockId, shuffle_12_5868_518_523 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.diagnoseCorruption(ShuffleBlockFetcherIterator.scala:1079)at
>  
> org.apache.spark.storage.BufferReleasingInputStream.$anonfun$tryOrFetchFailedException$1(ShuffleBlockFetcherIterator.scala:1314)
>  at scala.Option.map(Option.scala:230)at 
> org.apache.spark.storage.BufferReleasingInputStream.tryOrFetchFailedException(ShuffleBlockFetcherIterator.scala:1313)
>  at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:1299)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
> java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:345) at 
> java.io.DataInputStream.read(DataInputStream.java:149) at 
> org.sparkproject.guava.io.ByteStreams.read(ByteStreams.java:899) at 
> org.sparkproject.guava.io.ByteStreams.readFully(ByteStreams.java:733) at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:496) at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) 
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.sort_addToSorter_0$(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>  at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:82)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:1065)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:1024)
>  at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:1201)
>  at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:1240)
>  at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
>  at 
> org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119)
>  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsIn

[jira] [Updated] (SPARK-43242) shuffle diagnoseCorruption should not throw Unexpected type of BlockId for ShuffleBlockBatchId

2023-04-23 Thread Zhang Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Liang updated SPARK-43242:

Description: 
Some of our spark app throw "Unexpected type of BlockId" exception as shown 
below

According to BlockId.scala, we can found format such as 
*shuffle_12_5868_518_523* is type of `ShuffleBlockBatchId`, which is not 
handled properly in `ShuffleBlockFetcherIterator.diagnoseCorruption`.

 

Moreover, the new exception thrown in `diagnose` swallow the real exception in 
certain cases. Since diagnoseCorruption is always used in exception handling as 
a side dish, I think it shouldn't throw exception at all

 
{code:java}
23/03/07 03:01:24,485 [task-result-getter-1] WARN TaskSetManager: Lost task 
104.0 in stage 36.0 (TID 6169): java.lang.IllegalArgumentException: Unexpected 
type of BlockId, shuffle_12_5868_518_523 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.diagnoseCorruption(ShuffleBlockFetcherIterator.scala:1079)at
 
org.apache.spark.storage.BufferReleasingInputStream.$anonfun$tryOrFetchFailedException$1(ShuffleBlockFetcherIterator.scala:1314)
 at scala.Option.map(Option.scala:230)at 
org.apache.spark.storage.BufferReleasingInputStream.tryOrFetchFailedException(ShuffleBlockFetcherIterator.scala:1313)
 at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:1299)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at 
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at 
java.io.DataInputStream.read(DataInputStream.java:149) at 
org.sparkproject.guava.io.ByteStreams.read(ByteStreams.java:899) at 
org.sparkproject.guava.io.ByteStreams.readFully(ByteStreams.java:733) at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
 at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:496) at 
scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) at 
scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.sort_addToSorter_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
 at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:82)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:1065)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:1024)
 at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:1201)
 at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:1240)
 at 
org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
 at 
org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:137) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1510) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745

[jira] [Updated] (SPARK-43242) shuffle diagnoseCorruption should not throw Unexpected type of BlockId for ShuffleBlockBatchId

2023-04-23 Thread Zhang Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Liang updated SPARK-43242:

Description: 
Some of our spark app throw "Unexpected type of BlockId" exception as shown 
below

According to BlockId.scala, we can found format such as 
*{*}shuffle_12_5868_518_523{*}* is type of `ShuffleBlockBatchId`, which is not 
handled properly in `ShuffleBlockFetcherIterator.diagnoseCorruption`.

 

Moreover, the new exception thrown in `diagnose` swallow the real exception in 
certain cases. Since diagnoseCorruption is always used in exception handling as 
a sidedish, I think it shouldn't throw exception at all

 
{code:java}
// code placeholder
23/03/07 03:01:24,485 [task-result-getter-1] WARN TaskSetManager: Lost task 
104.0 in stage 36.0 (TID 6169): java.lang.IllegalArgumentException: Unexpected 
type of BlockId, shuffle_12_5868_518_523 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.diagnoseCorruption(ShuffleBlockFetcherIterator.scala:1079)at
 
org.apache.spark.storage.BufferReleasingInputStream.$anonfun$tryOrFetchFailedException$1(ShuffleBlockFetcherIterator.scala:1314)
 at scala.Option.map(Option.scala:230)at 
org.apache.spark.storage.BufferReleasingInputStream.tryOrFetchFailedException(ShuffleBlockFetcherIterator.scala:1313)
 at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:1299)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at 
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at 
java.io.DataInputStream.read(DataInputStream.java:149) at 
org.sparkproject.guava.io.ByteStreams.read(ByteStreams.java:899) at 
org.sparkproject.guava.io.ByteStreams.readFully(ByteStreams.java:733) at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
 at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:496) at 
scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) at 
scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.sort_addToSorter_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
 at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:82)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:1065)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:1024)
 at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:1201)
 at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:1240)
 at 
org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
 at 
org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:137) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1510) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.T

[jira] [Created] (SPARK-43242) shuffle diagnoseCorruption should not throw Unexpected type of BlockId for ShuffleBlockBatchId

2023-04-23 Thread Zhang Liang (Jira)
Zhang Liang created SPARK-43242:
---

 Summary: shuffle diagnoseCorruption should not throw Unexpected 
type of BlockId for ShuffleBlockBatchId
 Key: SPARK-43242
 URL: https://issues.apache.org/jira/browse/SPARK-43242
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.4
Reporter: Zhang Liang


Some of our spark app throw "Unexpected type of BlockId" exception as shown 
below

According to BlockId.scala, we can found format such as 
**shuffle_12_5868_518_523** is type of `ShuffleBlockBatchId`, which is not 
handled properly in `ShuffleBlockFetcherIterator.diagnoseCorruption`.

 

Moreover, the new exception thrown in `diagnose` swallow the real exception in 
certain cases. Since diagnoseCorruption is always used in exception handling as 
a sidedish, I think it shouldn't throw exception at all



 
23/03/07 03:01:24,485 [task-result-getter-1] WARN TaskSetManager: Lost task 
104.0 in stage 36.0 (TID 6169): java.lang.IllegalArgumentException: Unexpected 
type of BlockId, shuffle_12_5868_518_523 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.diagnoseCorruption(ShuffleBlockFetcherIterator.scala:1079)at
 
org.apache.spark.storage.BufferReleasingInputStream.$anonfun$tryOrFetchFailedException$1(ShuffleBlockFetcherIterator.scala:1314)
   at scala.Option.map(Option.scala:230)at 
org.apache.spark.storage.BufferReleasingInputStream.tryOrFetchFailedException(ShuffleBlockFetcherIterator.scala:1313)
   at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:1299)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)   at 
java.io.BufferedInputStream.read1(BufferedInputStream.java:286)  at 
java.io.BufferedInputStream.read(BufferedInputStream.java:345)   at 
java.io.DataInputStream.read(DataInputStream.java:149)   at 
org.sparkproject.guava.io.ByteStreams.read(ByteStreams.java:899) at 
org.sparkproject.guava.io.ByteStreams.readFully(ByteStreams.java:733)at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
   at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)  at 
scala.collection.Iterator$$anon$10.next(Iterator.scala:461)  at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)   at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)  at 
scala.collection.Iterator$$anon$10.next(Iterator.scala:461)  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.sort_addToSorter_0$(Unknown
 Source)   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
 Source)   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:82)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:1065)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:1024)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:1201)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:1240)
  at 
org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67) 
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown
 Source)   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)   
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
at 
org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119)
 at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)   at 
org.apache.spark.scheduler.Task.run(Task.scala:137)  at 
org.apache.spark.executor.Execu

[jira] [Commented] (SPARK-43232) Improve ObjectHashAggregateExec performance for high cardinality

2023-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715417#comment-17715417
 ] 

ASF GitHub Bot commented on SPARK-43232:


User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/40915

> Improve ObjectHashAggregateExec performance for high cardinality
> 
>
> Key: SPARK-43232
> URL: https://issues.apache.org/jira/browse/SPARK-43232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> The `ObjectHashAggregateExec` has three preformance issues:
>  - heavy overhead of scala sugar in `createNewAggregationBuffer`
>  - unnecessary grouping key comparation after fallback to sort based 
> aggregator
>  - the aggregation buffer in sort based aggregator is not reused for all rest 
> rows
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance for high cardinality

2023-04-23 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-43232:
--
Description: 
The `ObjectHashAggregateExec` has three preformance issues:
 - heavy overhead of scala sugar in `createNewAggregationBuffer`
 - unnecessary grouping key comparation after fallback to sort based aggregator
 - the aggregation buffer in sort based aggregator is not reused for all rest 
rows

 

  was:
The `ObjectHashAggregateExec` has three preformance issues:
 - heavy overhead of scala sugar in `createNewAggregationBuffer`
 - unnecessary grouping key comparation after fallback to sort based aggregator
 - the aggregation buffer in sort based aggregator is not actually reused

 


> Improve ObjectHashAggregateExec performance for high cardinality
> 
>
> Key: SPARK-43232
> URL: https://issues.apache.org/jira/browse/SPARK-43232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> The `ObjectHashAggregateExec` has three preformance issues:
>  - heavy overhead of scala sugar in `createNewAggregationBuffer`
>  - unnecessary grouping key comparation after fallback to sort based 
> aggregator
>  - the aggregation buffer in sort based aggregator is not reused for all rest 
> rows
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance for high cardinality

2023-04-23 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-43232:
--
Summary: Improve ObjectHashAggregateExec performance for high cardinality  
(was: Improve ObjectHashAggregateExec performance with high cardinality)

> Improve ObjectHashAggregateExec performance for high cardinality
> 
>
> Key: SPARK-43232
> URL: https://issues.apache.org/jira/browse/SPARK-43232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> The `ObjectHashAggregateExec` has three preformance issues:
>  - heavy overhead of scala sugar in `createNewAggregationBuffer`
>  - unnecessary grouping key comparation after fallback to sort based 
> aggregator
>  - the aggregation buffer in sort based aggregator is not actually reused
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance with high cardinality

2023-04-23 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-43232:
--
Summary: Improve ObjectHashAggregateExec performance with high cardinality  
(was: Improve ObjectHashAggregateExec performance)

> Improve ObjectHashAggregateExec performance with high cardinality
> -
>
> Key: SPARK-43232
> URL: https://issues.apache.org/jira/browse/SPARK-43232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> The `ObjectHashAggregateExec` has three preformance issues:
>  - heavy overhead of scala sugar in `createNewAggregationBuffer`
>  - unnecessary grouping key comparation after fallback to sort based 
> aggregator
>  - the aggregation buffer in sort based aggregator is not actually reused
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org