date:20230804

[jira] [Created] (SPARK-44687) Fix Scala 2.13 mima check

2023-08-04 Thread Yang Jie (Jira)

Yang Jie created SPARK-44687:


 Summary: Fix Scala 2.13 mima check
 Key: SPARK-44687
 URL: https://issues.apache.org/jira/browse/SPARK-44687
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie


[https://github.com/apache/spark/actions/runs/5695413124/job/15438535023]

 
{code:java}
[error] spark-core: Failed binary compatibility check against 
org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems (filtered 
4013)
8218[error]  * the type hierarchy of object 
org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages#SparkAppConfig 
is different in current version. Missing types {scala.runtime.AbstractFunction4}
8219[error]filter with: 
ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$SparkAppConfig$")
8220[error] java.lang.RuntimeException: Failed binary compatibility check 
against org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems 
(filtered 4013)
8221[error] at scala.sys.package$.error(package.scala:30)
8222[error] at 
com.typesafe.tools.mima.plugin.SbtMima$.reportModuleErrors(SbtMima.scala:89)
8223[error] at 
com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2(MimaPlugin.scala:36)
8224[error] at 
com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2$adapted(MimaPlugin.scala:26)
8225[error] at scala.collection.Iterator.foreach(Iterator.scala:943)
8226[error] at scala.collection.Iterator.foreach$(Iterator.scala:943)
8227[error] at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
8228[error] at 
com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1(MimaPlugin.scala:26)
8229[error] at 
com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1$adapted(MimaPlugin.scala:25)
8230[error] at scala.Function1.$anonfun$compose$1(Function1.scala:49)
8231[error] at 
sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63)
8232[error] at sbt.std.Transform$$anon$4.work(Transform.scala:69)
8233[error] at sbt.Execute.$anonfun$submit$2(Execute.scala:283)
8234[error] at 
sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24)
8235[error] at sbt.Execute.work(Execute.scala:292)
8236[error] at sbt.Execute.$anonfun$submit$1(Execute.scala:283)
8237[error] at 
sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
8238[error] at 
sbt.CompletionService$$anon$2.call(CompletionService.scala:65)
8239[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
8240[error] at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
8241[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
8242[error] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
8243[error] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
8244[error] at java.lang.Thread.run(Thread.java:750)
8245[error] (core / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems 
(filtered 4013)
8246[error] Total time: 172 s (02:52), completed Jul 28, 2023 7:26:06 PM {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44629) Publish PySpark Test Guidelines webpage

2023-08-04 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751260#comment-17751260
 ] 

Snoot.io commented on SPARK-44629:
--

User 'asl3' has created a pull request for this issue:
https://github.com/apache/spark/pull/42284

> Publish PySpark Test Guidelines webpage
> ---
>
> Key: SPARK-44629
> URL: https://issues.apache.org/jira/browse/SPARK-44629
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44665) Add support for pandas DataFrame assertDataFrameEqual

2023-08-04 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751257#comment-17751257
 ] 

Snoot.io commented on SPARK-44665:
--

User 'asl3' has created a pull request for this issue:
https://github.com/apache/spark/pull/42332

> Add support for pandas DataFrame assertDataFrameEqual
> -
>
> Key: SPARK-44665
> URL: https://issues.apache.org/jira/browse/SPARK-44665
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44665) Add support for pandas DataFrame assertDataFrameEqual

2023-08-04 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751256#comment-17751256
 ] 

Snoot.io commented on SPARK-44665:
--

User 'asl3' has created a pull request for this issue:
https://github.com/apache/spark/pull/42332

> Add support for pandas DataFrame assertDataFrameEqual
> -
>
> Key: SPARK-44665
> URL: https://issues.apache.org/jira/browse/SPARK-44665
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44306) Group FileStatus with few RPC calls within Yarn Client

2023-08-04 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751255#comment-17751255
 ] 

Snoot.io commented on SPARK-44306:
--

User 'shuwang21' has created a pull request for this issue:
https://github.com/apache/spark/pull/42357

> Group FileStatus with few RPC calls within Yarn Client
> --
>
> Key: SPARK-44306
> URL: https://issues.apache.org/jira/browse/SPARK-44306
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Affects Versions: 0.9.2, 2.3.0, 3.5.0
>Reporter: SHU WANG
>Priority: Major
>
> It's inefficient to obtain *FileStatus* for each resource [one by 
> one|https://github.com/apache/spark/blob/531ec8bddc8dd22ca39486dbdd31e62e989ddc15/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71C1].
>  In our company setting, we are running Spark with Hadoop Yarn and HDFS. We 
> noticed the current behavior has two major drawbacks:
>  # Since each *getFileStatus* call involves network delays, the overall delay 
> can be *large* and add *uncertainty* to the overall Spark job runtime. 
> Specifically, we quantify this overhead within our cluster. We see the p50 
> overhead is around 10s, p80 is 1 min, and p100 is up to 15 mins. When HDFS is 
> overloaded, the delays become more severe. 
>  # In our cluster, we have nearly 100 million *getFileStatus* call to HDFS 
> daily. We noticed that in our cluster, most resources come from the same HDFS 
> directory for each user (See our [engineer blog 
> post|https://engineering.linkedin.com/blog/2023/reducing-apache-spark-application-dependencies-upload-by-99-]
>  about why we took this approach). Therefore, we can greatly reduce nearly 
> 100 million *getFileStatus* call to 0.1 million *listStatus* calls daily. 
> This will further reduce overhead from the HDFS side. 
> All in all, a more efficient way to fetch the *FileStatus* for each resource 
> is highly needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44686) Add option to create RowEncoder in Encoders helper class.

2023-08-04 Thread Jira

Herman van Hövell created SPARK-44686:
-

 Summary: Add option to create RowEncoder in Encoders helper class.
 Key: SPARK-44686
 URL: https://issues.apache.org/jira/browse/SPARK-44686
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SQL
Affects Versions: 3.5.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-08-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44662:

Target Version/s:   (was: 3.3.3)

> SPIP: Improving performance of BroadcastHashJoin queries with stream side 
> join key on non partition columns
> ---
>
> Key: SPARK-44662
> URL: https://issues.apache.org/jira/browse/SPARK-44662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.3
>Reporter: Asif
>Priority: Major
>
> h2. *Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.*
> On the lines of DPP which helps DataSourceV2 relations when the joining key 
> is a partition column, the same concept can be extended over to the case 
> where joining key is not a partition column.
> The Keys of BroadcastHashJoin are already available before actual evaluation 
> of the stream iterator. These keys can be pushed down to the DataSource as a 
> SortedSet.
> For non partition columns, the DataSources like iceberg have max/min stats on 
> column available at manifest level, and for formats like parquet , they have 
> max/min stats at various storage level. The passed SortedSet can be used to 
> prune using ranges at both driver level ( manifests files) as well as 
> executor level ( while actually going through chunks , row groups etc at 
> parquet level)
> If the data is stored as Columnar Batch format , then it would not be 
> possible to filter out individual row at DataSource level, even though we 
> have keys.
> But at the scan level, ( ColumnToRowExec) it is still possible to filter out 
> as many rows as possible , if the query involves nested joins. Thus reducing 
> the number of rows to join at the higher join levels.
> Will be adding more details..
> h2. *Q2. What problem is this proposal NOT designed to solve?*
> This can only help in BroadcastHashJoin's performance if the join is Inner or 
> Left Semi.
> This will also not work if there are nodes like Expand, Generator , Aggregate 
> (without group by on keys not part of joining column etc) below the 
> BroadcastHashJoin node being targeted.
> h2. *Q3. How is it done today, and what are the limits of current practice?*
> Currently this sort of pruning at DataSource level is being done using DPP 
> (Dynamic Partition Pruning ) and IFF one of the join key column is a 
> Partitioning column ( so that cost of DPP query is justified and way less 
> than amount of data it will be filtering by skipping partitions).
> The limitation is that DPP type approach is not implemented ( intentionally I 
> believe), if the join column is a non partition column ( because of cost of 
> "DPP type" query would most likely be way high as compared to any possible 
> pruning ( especially if the column is not stored in a sorted manner).
> h2. *Q4. What is new in your approach and why do you think it will be 
> successful?*
> 1) This allows pruning on non partition column based joins. 
> 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP 
> type" query. 
> 3) The Data can be used by DataSource to prune at driver (possibly) and also 
> at executor level ( as in case of parquet which has max/min at various 
> structure levels)
> 4) The big benefit should be seen in multilevel nested join queries. In the 
> current code base, if I am correct, only one join's pruning filter would get 
> pushed at scan level. Since it is on partition key may be that is sufficient. 
> But if it is a nested Join query , and may be involving different columns on 
> streaming side for join, each such filter push could do significant pruning. 
> This requires some handling in case of AQE, as the stream side iterator ( & 
> hence stage evaluation needs to be delayed, till all the available join 
> filters in the nested tree are pushed at their respective target 
> BatchScanExec).
> h4. *Single Row Filteration*
> 5) In case of nested broadcasted joins, if the datasource is column vector 
> oriented , then what spark would get is a ColumnarBatch. But because scans 
> have Filters from multiple joins, they can be retrieved and can be applied in 
> code generated at ColumnToRowExec level, using a new "containsKey" method on 
> HashedRelation. Thus only those rows which satisfy all the 
> BroadcastedHashJoins ( whose keys have been pushed) , will be used for join 
> evaluation.
> The code is already there , will be opening a PR. For non partition table 
> TPCDS run on laptop with TPCDS data size of ( scale factor 4), I am seeing 
> 15% gain.
> For partition table TPCDS, there is improvement in 4 - 5 queries to the tune 
> of 10% to 37%.
> h2. *Q5. Who cares? If you are successful, what difference will it make?*
> If use cases involve

[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-08-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44662:

Fix Version/s: (was: 3.3.3)

> SPIP: Improving performance of BroadcastHashJoin queries with stream side 
> join key on non partition columns
> ---
>
> Key: SPARK-44662
> URL: https://issues.apache.org/jira/browse/SPARK-44662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.3
>Reporter: Asif
>Priority: Major
>
> h2. *Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.*
> On the lines of DPP which helps DataSourceV2 relations when the joining key 
> is a partition column, the same concept can be extended over to the case 
> where joining key is not a partition column.
> The Keys of BroadcastHashJoin are already available before actual evaluation 
> of the stream iterator. These keys can be pushed down to the DataSource as a 
> SortedSet.
> For non partition columns, the DataSources like iceberg have max/min stats on 
> column available at manifest level, and for formats like parquet , they have 
> max/min stats at various storage level. The passed SortedSet can be used to 
> prune using ranges at both driver level ( manifests files) as well as 
> executor level ( while actually going through chunks , row groups etc at 
> parquet level)
> If the data is stored as Columnar Batch format , then it would not be 
> possible to filter out individual row at DataSource level, even though we 
> have keys.
> But at the scan level, ( ColumnToRowExec) it is still possible to filter out 
> as many rows as possible , if the query involves nested joins. Thus reducing 
> the number of rows to join at the higher join levels.
> Will be adding more details..
> h2. *Q2. What problem is this proposal NOT designed to solve?*
> This can only help in BroadcastHashJoin's performance if the join is Inner or 
> Left Semi.
> This will also not work if there are nodes like Expand, Generator , Aggregate 
> (without group by on keys not part of joining column etc) below the 
> BroadcastHashJoin node being targeted.
> h2. *Q3. How is it done today, and what are the limits of current practice?*
> Currently this sort of pruning at DataSource level is being done using DPP 
> (Dynamic Partition Pruning ) and IFF one of the join key column is a 
> Partitioning column ( so that cost of DPP query is justified and way less 
> than amount of data it will be filtering by skipping partitions).
> The limitation is that DPP type approach is not implemented ( intentionally I 
> believe), if the join column is a non partition column ( because of cost of 
> "DPP type" query would most likely be way high as compared to any possible 
> pruning ( especially if the column is not stored in a sorted manner).
> h2. *Q4. What is new in your approach and why do you think it will be 
> successful?*
> 1) This allows pruning on non partition column based joins. 
> 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP 
> type" query. 
> 3) The Data can be used by DataSource to prune at driver (possibly) and also 
> at executor level ( as in case of parquet which has max/min at various 
> structure levels)
> 4) The big benefit should be seen in multilevel nested join queries. In the 
> current code base, if I am correct, only one join's pruning filter would get 
> pushed at scan level. Since it is on partition key may be that is sufficient. 
> But if it is a nested Join query , and may be involving different columns on 
> streaming side for join, each such filter push could do significant pruning. 
> This requires some handling in case of AQE, as the stream side iterator ( & 
> hence stage evaluation needs to be delayed, till all the available join 
> filters in the nested tree are pushed at their respective target 
> BatchScanExec).
> h4. *Single Row Filteration*
> 5) In case of nested broadcasted joins, if the datasource is column vector 
> oriented , then what spark would get is a ColumnarBatch. But because scans 
> have Filters from multiple joins, they can be retrieved and can be applied in 
> code generated at ColumnToRowExec level, using a new "containsKey" method on 
> HashedRelation. Thus only those rows which satisfy all the 
> BroadcastedHashJoins ( whose keys have been pushed) , will be used for join 
> evaluation.
> The code is already there , will be opening a PR. For non partition table 
> TPCDS run on laptop with TPCDS data size of ( scale factor 4), I am seeing 
> 15% gain.
> For partition table TPCDS, there is improvement in 4 - 5 queries to the tune 
> of 10% to 37%.
> h2. *Q5. Who cares? If you are successful, what difference will it make?*
> If use cases involve m

[jira] [Created] (SPARK-44685) Remove deprecated Catalog#createExternalTable

2023-08-04 Thread Jia Fan (Jira)

Jia Fan created SPARK-44685:
---

 Summary: Remove deprecated Catalog#createExternalTable
 Key: SPARK-44685
 URL: https://issues.apache.org/jira/browse/SPARK-44685
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


I should remove Catalog#createExternalTable becuase it deprecated when 2.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44433) Implement termination of Python process for foreachBatch & streaming listener

2023-08-04 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-44433.
---
  Assignee: Wei Liu
Resolution: Fixed

Issue resolved by pull request 42283
https://github.com/apache/spark/pull/42283

> Implement termination of Python process for foreachBatch & streaming listener
> -
>
> Key: SPARK-44433
> URL: https://issues.apache.org/jira/browse/SPARK-44433
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Raghu Angadi
>Assignee: Wei Liu
>Priority: Major
> Fix For: 3.5.0
>
>
> In the first implementation of Python support for foreachBatch, the python 
> process termination is not handled correctly. 
>  
> See the long TODO in 
> [https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/StreamingForeachBatchHelper.scala]
>  
> about an outline of the feature to terminate the runners by registering 
> StreamingQueryListners. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44684) Runtime min-max filter

2023-08-04 Thread GANHONGNAN (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GANHONGNAN updated SPARK-44684:
---
Remaining Estimate: 2,016h  (was: 0.05h)
 Original Estimate: 2,016h  (was: 0.05h)

> Runtime min-max filter
> --
>
> Key: SPARK-44684
> URL: https://issues.apache.org/jira/browse/SPARK-44684
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: GANHONGNAN
>Priority: Major
>  Labels: performance
>   Original Estimate: 2,016h
>  Remaining Estimate: 2,016h
>
> We can infer min-max index when building bloom filter and push it down to 
> datasource. 
>  # Min-max index can skip part of data before loading them into memory.
>  # building min-max index can be done along with bloom filter building, so 
> aggregation for bloom filter building can be reused.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44684) Runtime min-max filter

2023-08-04 Thread GANHONGNAN (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751250#comment-17751250
 ] 

GANHONGNAN commented on SPARK-44684:


[~cloud_fan] could you pls review this Jira and assign it to me?

> Runtime min-max filter
> --
>
> Key: SPARK-44684
> URL: https://issues.apache.org/jira/browse/SPARK-44684
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: GANHONGNAN
>Priority: Major
>  Labels: performance
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> We can infer min-max index when building bloom filter and push it down to 
> datasource. 
>  # Min-max index can skip part of data before loading them into memory.
>  # building min-max index can be done along with bloom filter building, so 
> aggregation for bloom filter building can be reused.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44684) Runtime min-max filter

2023-08-04 Thread GANHONGNAN (Jira)

GANHONGNAN created SPARK-44684:
--

 Summary: Runtime min-max filter
 Key: SPARK-44684
 URL: https://issues.apache.org/jira/browse/SPARK-44684
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.1
Reporter: GANHONGNAN


We can infer min-max index when building bloom filter and push it down to 
datasource. 
 # Min-max index can skip part of data before loading them into memory.
 # building min-max index can be done along with bloom filter building, so 
aggregation for bloom filter building can be reused.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44683) Logging level isn't passed to RocksDB state store provider correctly

2023-08-04 Thread Siying Dong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siying Dong updated SPARK-44683:

Description: We pass log4j's log level to RocksDB so that RocksDB debug log 
can go to log4j. However, we pass in log level after we create the logger. 
However, the way it is set isn't effective. This has two impacts: (1) setting 
DEBUG level don't make RocksDB generate DEBUG level logs; (2) setting WARN or 
ERROR level does prevent INFO level logging, but RocksDB still makes JNI calls 
to Scala, which is an unnecessary overhead.  (was: We pass log4j's log level to 
RocksDB so that RocksDB debug log can go to log4j. However, we pass in log 
level after we create the logger. However, RocksDB only takes log level when a 
logger is created, so it never changes. This has two impacts: (1) setting DEBUG 
level don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR 
level does prevent INFO level logging, but RocksDB still makes JNI calls to 
Scala, which is an unnecessary overhead.)

> Logging level isn't passed to RocksDB state store provider correctly
> 
>
> Key: SPARK-44683
> URL: https://issues.apache.org/jira/browse/SPARK-44683
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Siying Dong
>Priority: Minor
>
> We pass log4j's log level to RocksDB so that RocksDB debug log can go to 
> log4j. However, we pass in log level after we create the logger. However, the 
> way it is set isn't effective. This has two impacts: (1) setting DEBUG level 
> don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR level 
> does prevent INFO level logging, but RocksDB still makes JNI calls to Scala, 
> which is an unnecessary overhead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44005) Improve error messages for regular Python UDTFs that return non-tuple values

2023-08-04 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44005:
-
Summary: Improve error messages for regular Python UDTFs that return 
non-tuple values  (was: Improve error messages when regular Python UDTFs that 
return non-tuple values)

> Improve error messages for regular Python UDTFs that return non-tuple values
> 
>
> Key: SPARK-44005
> URL: https://issues.apache.org/jira/browse/SPARK-44005
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, if you have a UDTF like this:
> {code:java}
> class TestUDTF:
> def eval(self, a: int):
> yield a {code}
> and run the UDTF, it will fail with a confusing error message like
> {code:java}
> Unexpected tuple 1 with StructType {code}
> Note this works when arrow is enabled. We should improve error messages for 
> regular UDTFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44005) Improve error messages when regular Python UDTFs that return non-tuple values

2023-08-04 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44005:
-
Summary: Improve error messages when regular Python UDTFs that return 
non-tuple values  (was: Support returning non-tuple values for regular Python 
UDTFs)

> Improve error messages when regular Python UDTFs that return non-tuple values
> -
>
> Key: SPARK-44005
> URL: https://issues.apache.org/jira/browse/SPARK-44005
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, if you have a UDTF like this:
> {code:java}
> class TestUDTF:
> def eval(self, a: int):
> yield a {code}
> and run the UDTF, it will fail with a confusing error message like
> {code:java}
> Unexpected tuple 1 with StructType {code}
> Note this works when arrow is enabled. We should improve error messages for 
> regular UDTFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44005) Support returning non-tuple values for regular Python UDTFs

2023-08-04 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44005:
-
Description: 
Currently, if you have a UDTF like this:
{code:java}
class TestUDTF:
def eval(self, a: int):
yield a {code}
and run the UDTF, it will fail with a confusing error message like
{code:java}
Unexpected tuple 1 with StructType {code}
Note this works when arrow is enabled. We should improve error messages for 
regular UDTFs.

  was:
Currently, if you have a UDTF like this:
{code:java}
class TestUDTF:
def eval(self, a: int):
yield a {code}
and run the UDTF, it will fail with a confusing error message like
{code:java}
Unexpected tuple 1 with StructType {code}
Note this works when arrow is enabled. We should support this use case for 
regular UDTFs.


> Support returning non-tuple values for regular Python UDTFs
> ---
>
> Key: SPARK-44005
> URL: https://issues.apache.org/jira/browse/SPARK-44005
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, if you have a UDTF like this:
> {code:java}
> class TestUDTF:
> def eval(self, a: int):
> yield a {code}
> and run the UDTF, it will fail with a confusing error message like
> {code:java}
> Unexpected tuple 1 with StructType {code}
> Note this works when arrow is enabled. We should improve error messages for 
> regular UDTFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44667) Uninstall large ML libraries for non-ML jobs

2023-08-04 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44667:
-

Assignee: Ruifeng Zheng

> Uninstall large ML libraries for non-ML jobs
> 
>
> Key: SPARK-44667
> URL: https://issues.apache.org/jira/browse/SPARK-44667
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44667) Uninstall large ML libraries for non-ML jobs

2023-08-04 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44667.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42334
[https://github.com/apache/spark/pull/42334]

> Uninstall large ML libraries for non-ML jobs
> 
>
> Key: SPARK-44667
> URL: https://issues.apache.org/jira/browse/SPARK-44667
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44683) Logging level isn't passed to RocksDB state store provider correctly

2023-08-04 Thread Siying Dong (Jira)

Siying Dong created SPARK-44683:
---

 Summary: Logging level isn't passed to RocksDB state store 
provider correctly
 Key: SPARK-44683
 URL: https://issues.apache.org/jira/browse/SPARK-44683
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.4.1
Reporter: Siying Dong


We pass log4j's log level to RocksDB so that RocksDB debug log can go to log4j. 
However, we pass in log level after we create the logger. However, RocksDB only 
takes log level when a logger is created, so it never changes. This has two 
impacts: (1) setting DEBUG level don't make RocksDB generate DEBUG level logs; 
(2) setting WARN or ERROR level does prevent INFO level logging, but RocksDB 
still makes JNI calls to Scala, which is an unnecessary overhead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44663) Disable arrow optimization by default for Python UDTFs

2023-08-04 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-44663.
---
Fix Version/s: 3.5.0
 Assignee: Allison Wang
   Resolution: Fixed

Issue resolved by pull request 42329
https://github.com/apache/spark/pull/42329

> Disable arrow optimization by default for Python UDTFs
> --
>
> Key: SPARK-44663
> URL: https://issues.apache.org/jira/browse/SPARK-44663
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.5.0
>
>
> Disable arrow optimization to make Python UDTFs consistent with Python UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

2023-08-04 Thread Pavan Kotikalapudi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751237#comment-17751237
 ] 

Pavan Kotikalapudi commented on SPARK-24815:


Here is the draft PR with initial implementation 
[https://github.com/apache/spark/pull/42352] and the design doc: 
[https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
 
|https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing].
 Thanks for the review :) 

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43797) Python User-defined Table Functions

2023-08-04 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-43797:
-
Affects Version/s: 4.0.0

> Python User-defined Table Functions
> ---
>
> Key: SPARK-43797
> URL: https://issues.apache.org/jira/browse/SPARK-43797
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> This is an umbrella ticket to support Python user-defined table functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44678) Downgrade Hadoop to 3.3.4

2023-08-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44678.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 42345
[https://github.com/apache/spark/pull/42345]

> Downgrade Hadoop to 3.3.4
> -
>
> Key: SPARK-44678
> URL: https://issues.apache.org/jira/browse/SPARK-44678
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Critical
> Fix For: 3.5.0
>
>
> There is a community report on S3A committer performance regression. Although 
> it's one liner fix, there is no available Hadoop release with that fix at 
> this time.
> HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44678) Downgrade Hadoop to 3.3.4

2023-08-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44678:
-

Assignee: Dongjoon Hyun

> Downgrade Hadoop to 3.3.4
> -
>
> Key: SPARK-44678
> URL: https://issues.apache.org/jira/browse/SPARK-44678
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Critical
>
> There is a community report on S3A committer performance regression. Although 
> it's one liner fix, there is no available Hadoop release with that fix at 
> this time.
> HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44644) Improve error messages for creating Python UDTFs with pickling errors

2023-08-04 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-44644.
---
   Fix Version/s: 4.0.0
Target Version/s: 3.5.0
Assignee: Allison Wang
  Resolution: Fixed

Issue resolved by pull request 42309
https://github.com/apache/spark/pull/42309

> Improve error messages for creating Python UDTFs with pickling errors
> -
>
> Key: SPARK-44644
> URL: https://issues.apache.org/jira/browse/SPARK-44644
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 4.0.0
>
>
> Currently, when users create a Python UDTF with a non-pickleable object, it 
> throws this error:
> _pickle.PicklingError: Cannot pickle files that are not opened for reading: w
>  
> We should make this more user-friendly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44258) Move Metadata to sql/api

2023-08-04 Thread Rui Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang resolved SPARK-44258.
--
Resolution: Fixed

> Move Metadata to sql/api
> 
>
> Key: SPARK-44258
> URL: https://issues.apache.org/jira/browse/SPARK-44258
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42887) Simple DataType interface

2023-08-04 Thread Rui Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang resolved SPARK-42887.
--
Resolution: Fixed

> Simple DataType interface
> -
>
> Key: SPARK-42887
> URL: https://issues.apache.org/jira/browse/SPARK-42887
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>
> This JIRA proposes to move non public API from existing DataType class to 
> make DataType become a simple interface. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44682) Make pandas error class message_parameters strings

2023-08-04 Thread Amanda Liu (Jira)

Amanda Liu created SPARK-44682:
--

 Summary: Make pandas error class message_parameters strings
 Key: SPARK-44682
 URL: https://issues.apache.org/jira/browse/SPARK-44682
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44681) Solve issue referencing github.com/apache/spark-connect-go as Go library

2023-08-04 Thread BoYang (Jira)

BoYang created SPARK-44681:
--

 Summary: Solve issue referencing 
github.com/apache/spark-connect-go as Go library
 Key: SPARK-44681
 URL: https://issues.apache.org/jira/browse/SPARK-44681
 Project: Spark
  Issue Type: Sub-task
  Components: Connect Contrib
Affects Versions: 3.5.0
Reporter: BoYang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43351) Support Golang in Spark Connect

2023-08-04 Thread BoYang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751186#comment-17751186
 ] 

BoYang commented on SPARK-43351:


Thanks! We can keep it as 3.5.0 now.

> Support Golang in Spark Connect
> ---
>
> Key: SPARK-43351
> URL: https://issues.apache.org/jira/browse/SPARK-43351
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: BoYang
>Assignee: BoYang
>Priority: Major
> Fix For: 3.5.0
>
>
> Support Spark Connect client side in Go programming language 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44648) Set up memory limits for analyze in Python.

2023-08-04 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-44648.
---
  Assignee: Takuya Ueshin
Resolution: Fixed

Issue resolved by pull request 42328
https://github.com/apache/spark/pull/42328

> Set up memory limits for analyze in Python.
> ---
>
> Key: SPARK-44648
> URL: https://issues.apache.org/jira/browse/SPARK-44648
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44679) java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2023-08-04 Thread Haitham Eltaweel (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haitham Eltaweel updated SPARK-44679:
-
Attachment: code_sample.txt

> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> -
>
> Key: SPARK-44679
> URL: https://issues.apache.org/jira/browse/SPARK-44679
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.2.1
> Environment: We use Amazon EMR to run Pyspark jobs.
> Amazon EMR version : emr-6.7.0
> Installed applications : 
> Tez 0.9.2, Spark 3.2.1, Hive 3.1.3, Sqoop 1.4.7, Hadoop 3.2.1, Zookeeper 
> 3.5.7, HCatalog 3.1.3, Livy 0.7.1
>Reporter: Haitham Eltaweel
>Priority: Major
> Attachments: code_sample.txt
>
>
> We get the following error from our Pyspark application in Production env:
> _java.lang.OutOfMemoryError: Requested array size exceeds VM limit_
> I simplified the code we used and shared it below so you can easily 
> investigate the issue.
> We use Pyspark to read 900 MB text file which has one record. We use foreach 
> function to iterate over the Datafreme and apply some high order function. 
> The error occurs once foreach action is triggered. I think the issue is 
> related to the integer data type of the bytes array used to hold the 
> serialized dataframe. Since the dataframe record was too big, it seems the 
> serialized record exceeded the max integer value, hence the error occurred. 
> Note that the same error happens when using foreachBatch function with 
> writeStream. 
> Our prod data has many records larger than 100 MB.  Appreciate your help to 
> provide a fix or a solution to that issue.
>  
> *Find below the code snippet:*
> from pyspark.sql import SparkSession,functions as f
>  
> def check_file_name(row):
>     print("check_file_name called")
>  
> def main():
>     spark=SparkSession.builder.enableHiveSupport().getOrCreate()
> inputPath = "s3://bucket-name/common/source/"
>     inputDF = spark.read.text(inputPath, wholetext=True)
>     inputDF = inputDF.select(f.date_format(f.current_timestamp(), 
> 'MMddHH').astype('string').alias('insert_hr'),
>                         f.col("value").alias("raw_data"),
>                         f.input_file_name().alias("input_file_name"))
>     inputDF.foreach(check_file_name)
>  
> if __name__ == "__main__":
>     main()
> *Find below spark-submit command used:*
> spark-submit --master yarn --conf 
> spark.serializer=org.apache.spark.serializer.KryoSerializer  --num-executors 
> 15 --executor-cores 4 --executor-memory 20g --driver-memory 20g --name 
> haitham_job --deploy-mode cluster big_file_process.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44680) parameter markers are not blocked from DEFAULT (and other places)

2023-08-04 Thread Serge Rielau (Jira)

Serge Rielau created SPARK-44680:


 Summary: parameter markers are not blocked from DEFAULT (and other 
places)
 Key: SPARK-44680
 URL: https://issues.apache.org/jira/browse/SPARK-44680
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Serge Rielau


scala> spark.sql("CREATE TABLE t11(c1 int default :parm)", args = Map("parm" -> 
5)).show()

-> success

scala> spark.sql("describe t11");

[INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION] Failed to execute EXISTS_DEFAULT 
command because the destination table column `c1` has a DEFAULT value :parm, 
which fails to resolve as a valid expression.

This likely extends to other DDL-y places.
I can only find protection against placement in the body of a CREATE VIEW.

I see two ways out of this:
* Raise an error (as we do for CREATE VIEW v1(c1) AS SELECT ? )
 * Improve the way we persist queries/expressions to substitute the at-DDL-time 
bound parameter value (it' not a bug it's a feature)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44678) Downgrade Hadoop to 3.3.4

2023-08-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44678:
--
Description: 
There is a community report on S3A committer performance regression. Although 
it's one liner fix, there is no available Hadoop release with that fix at this 
time.

HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer

> Downgrade Hadoop to 3.3.4
> -
>
> Key: SPARK-44678
> URL: https://issues.apache.org/jira/browse/SPARK-44678
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> There is a community report on S3A committer performance regression. Although 
> it's one liner fix, there is no available Hadoop release with that fix at 
> this time.
> HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44679) java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2023-08-04 Thread Haitham Eltaweel (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haitham Eltaweel updated SPARK-44679:
-
Language:   (was: Python)

> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> -
>
> Key: SPARK-44679
> URL: https://issues.apache.org/jira/browse/SPARK-44679
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.2.1
> Environment: We use Amazon EMR to run Pyspark jobs.
> Amazon EMR version : emr-6.7.0
> Installed applications : 
> Tez 0.9.2, Spark 3.2.1, Hive 3.1.3, Sqoop 1.4.7, Hadoop 3.2.1, Zookeeper 
> 3.5.7, HCatalog 3.1.3, Livy 0.7.1
>Reporter: Haitham Eltaweel
>Priority: Major
>
> We get the following error from our Pyspark application in Production env:
> _java.lang.OutOfMemoryError: Requested array size exceeds VM limit_
> I simplified the code we used and shared it below so you can easily 
> investigate the issue.
> We use Pyspark to read 900 MB text file which has one record. We use foreach 
> function to iterate over the Datafreme and apply some high order function. 
> The error occurs once foreach action is triggered. I think the issue is 
> related to the integer data type of the bytes array used to hold the 
> serialized dataframe. Since the dataframe record was too big, it seems the 
> serialized record exceeded the max integer value, hence the error occurred. 
> Note that the same error happens when using foreachBatch function with 
> writeStream. 
> Our prod data has many records larger than 100 MB.  Appreciate your help to 
> provide a fix or a solution to that issue.
>  
> *Find below the code snippet:*
> from pyspark.sql import SparkSession,functions as f
>  
> def check_file_name(row):
>     print("check_file_name called")
>  
> def main():
>     spark=SparkSession.builder.enableHiveSupport().getOrCreate()
> inputPath = "s3://bucket-name/common/source/"
>     inputDF = spark.read.text(inputPath, wholetext=True)
>     inputDF = inputDF.select(f.date_format(f.current_timestamp(), 
> 'MMddHH').astype('string').alias('insert_hr'),
>                         f.col("value").alias("raw_data"),
>                         f.input_file_name().alias("input_file_name"))
>     inputDF.foreach(check_file_name)
>  
> if __name__ == "__main__":
>     main()
> *Find below spark-submit command used:*
> spark-submit --master yarn --conf 
> spark.serializer=org.apache.spark.serializer.KryoSerializer  --num-executors 
> 15 --executor-cores 4 --executor-memory 20g --driver-memory 20g --name 
> haitham_job --deploy-mode cluster big_file_process.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44679) java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2023-08-04 Thread Haitham Eltaweel (Jira)

Haitham Eltaweel created SPARK-44679:


 Summary: java.lang.OutOfMemoryError: Requested array size exceeds 
VM limit
 Key: SPARK-44679
 URL: https://issues.apache.org/jira/browse/SPARK-44679
 Project: Spark
  Issue Type: Bug
  Components: EC2, PySpark
Affects Versions: 3.2.1
 Environment: We use Amazon EMR to run Pyspark jobs.
Amazon EMR version : emr-6.7.0
Installed applications : 
Tez 0.9.2, Spark 3.2.1, Hive 3.1.3, Sqoop 1.4.7, Hadoop 3.2.1, Zookeeper 3.5.7, 
HCatalog 3.1.3, Livy 0.7.1
Reporter: Haitham Eltaweel


We get the following error from our Pyspark application in Production env:

_java.lang.OutOfMemoryError: Requested array size exceeds VM limit_

I simplified the code we used and shared it below so you can easily investigate 
the issue.

We use Pyspark to read 900 MB text file which has one record. We use foreach 
function to iterate over the Datafreme and apply some high order function. The 
error occurs once foreach action is triggered. I think the issue is related to 
the integer data type of the bytes array used to hold the serialized dataframe. 
Since the dataframe record was too big, it seems the serialized record exceeded 
the max integer value, hence the error occurred. 

Note that the same error happens when using foreachBatch function with 
writeStream. 

Our prod data has many records larger than 100 MB.  Appreciate your help to 
provide a fix or a solution to that issue.

 

*Find below the code snippet:*
from pyspark.sql import SparkSession,functions as f
 
def check_file_name(row):
    print("check_file_name called")
 
def main():
    spark=SparkSession.builder.enableHiveSupport().getOrCreate()
inputPath = "s3://bucket-name/common/source/"
    inputDF = spark.read.text(inputPath, wholetext=True)
    inputDF = inputDF.select(f.date_format(f.current_timestamp(), 
'MMddHH').astype('string').alias('insert_hr'),
                        f.col("value").alias("raw_data"),
                        f.input_file_name().alias("input_file_name"))
    inputDF.foreach(check_file_name)
 
if __name__ == "__main__":
    main()
*Find below spark-submit command used:*

spark-submit --master yarn --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer  --num-executors 15 
--executor-cores 4 --executor-memory 20g --driver-memory 20g --name haitham_job 
--deploy-mode cluster big_file_process.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44678) Downgrade Hadoop to 3.3.4

2023-08-04 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-44678:
-

 Summary: Downgrade Hadoop to 3.3.4
 Key: SPARK-44678
 URL: https://issues.apache.org/jira/browse/SPARK-44678
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.5.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44671) Retry ExecutePlan in case initial request didn't reach server in Python client

2023-08-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44671:


Assignee: Hyukjin Kwon

> Retry ExecutePlan in case initial request didn't reach server in Python client
> --
>
> Key: SPARK-44671
> URL: https://issues.apache.org/jira/browse/SPARK-44671
> Project: Spark
>  Issue Type: Task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> SPARK-44624 for Python



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44671) Retry ExecutePlan in case initial request didn't reach server in Python client

2023-08-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44671.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42338
[https://github.com/apache/spark/pull/42338]

> Retry ExecutePlan in case initial request didn't reach server in Python client
> --
>
> Key: SPARK-44671
> URL: https://issues.apache.org/jira/browse/SPARK-44671
> Project: Spark
>  Issue Type: Task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> SPARK-44624 for Python



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44677) Drop legacy Hive-based ORC file format

2023-08-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751148#comment-17751148
 ] 

Dongjoon Hyun commented on SPARK-44677:
---

+1. Thank you, [~chengpan].

> Drop legacy Hive-based ORC file format
> --
>
> Key: SPARK-44677
> URL: https://issues.apache.org/jira/browse/SPARK-44677
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>
> Currently, Spark allows to use spark.sql.orc.impl=native/hive to switch the 
> ORC FileFormat implementation.
> SPARK-23456(2.4) switched the default value of spark.sql.orc.impl from "hive" 
> to "native". and prepared to drop the "hive" implementation in the future.
> > ... eventually, Apache Spark will drop old Hive-based ORC code.
> The native implementation works well during the whole Spark 3.x period, so 
> it's a good time to consider dropping the "hive" one in Spark 4.0.
> Also, we should take care about the backward-compatibility during change.
> > BTW, IIRC, there was a different at Hive ORC CHAR implementation before. 
> > So, we couldn't remove it for backward-compatibility issues. Since Spark 
> > implements many CHAR features, we need to re-verify that {{native}} 
> > implementation has all legacy Hive-based ORC features



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44677) Drop legacy Hive-based ORC file format

2023-08-04 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-44677:
-

 Summary: Drop legacy Hive-based ORC file format
 Key: SPARK-44677
 URL: https://issues.apache.org/jira/browse/SPARK-44677
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Cheng Pan


Currently, Spark allows to use spark.sql.orc.impl=native/hive to switch the ORC 
FileFormat implementation.

SPARK-23456(2.4) switched the default value of spark.sql.orc.impl from "hive" 
to "native". and prepared to drop the "hive" implementation in the future.

> ... eventually, Apache Spark will drop old Hive-based ORC code.

The native implementation works well during the whole Spark 3.x period, so it's 
a good time to consider dropping the "hive" one in Spark 4.0.

Also, we should take care about the backward-compatibility during change.

> BTW, IIRC, there was a different at Hive ORC CHAR implementation before. So, 
> we couldn't remove it for backward-compatibility issues. Since Spark 
> implements many CHAR features, we need to re-verify that {{native}} 
> implementation has all legacy Hive-based ORC features



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44672) Fix git ignore rules related to Antlr

2023-08-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44672:
-

Assignee: Yang Jie

> Fix git ignore rules related to Antlr
> -
>
> Key: SPARK-44672
> URL: https://issues.apache.org/jira/browse/SPARK-44672
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> run `git status` after SPARK-44475 merged
>  
> {code:java}
> sql/api/gen/
> sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/gen/ {code}
> The rules should be modified in the .gitignore file to ignore them.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44672) Fix git ignore rules related to Antlr

2023-08-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44672.
---
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42342
[https://github.com/apache/spark/pull/42342]

> Fix git ignore rules related to Antlr
> -
>
> Key: SPARK-44672
> URL: https://issues.apache.org/jira/browse/SPARK-44672
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0, 4.0.0
>
>
> run `git status` after SPARK-44475 merged
>  
> {code:java}
> sql/api/gen/
> sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/gen/ {code}
> The rules should be modified in the .gitignore file to ignore them.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44634) Encoders.bean does no longer support nested beans with type arguments

2023-08-04 Thread Giambattista Bloisi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giambattista Bloisi updated SPARK-44634:

Description: 
Hi,

  while upgrading a project from spark 2.4.0 to 3.4.1 version, I have 
encountered the same problem described in [java - Encoders.bean attempts to 
check the validity of a return type considering its generic type and not its 
concrete class, with Spark 3.4.0 - Stack 
Overflow|https://stackoverflow.com/questions/76045255/encoders-bean-attempts-to-check-the-validity-of-a-return-type-considering-its-ge].

Put it short, starting from Spark 3.4.x Encoders.bean throws an exception when 
the passed class contains a field whose type is a nested bean with type 
arguments:

 
{code:java}
class A {
   T value;
   // value getter and setter
}

class B {
   A stringHolder;
   // stringHolder getter and setter
}

Encoders.bean(B.class); // throws "SparkUnsupportedOperationException: 
[ENCODER_NOT_FOUND]..."{code}
 

 

It looks like this is a regression introduced with [SPARK-42093 SQL Move 
JavaTypeInference to 
AgnosticEncoders|https://github.com/apache/spark/commit/18672003513d5a4aa610b6b94dbbc15c33185d3#diff-1191737b908340a2f4c22b71b1c40ebaa0da9d8b40c958089c346a3bda26943b]
 while getting rid of TypeToken, that somehow managed that case.

  was:
Hi,

  while upgrading a project from spark 2.4.0 to 3.4.1 version, I have 
encountered the same problem described in [java - Encoders.bean attempts to 
check the validity of a return type considering its generic type and not its 
concrete class, with Spark 3.4.0 - Stack 
Overflow|https://stackoverflow.com/questions/76045255/encoders-bean-attempts-to-check-the-validity-of-a-return-type-considering-its-ge].

Put it short, starting from Spark 3.4.x Encoders.bean throws an exception when 
the passed class contains a field whose type is a nested bean with type 
arguments:

 
{code:java}
class A {
   T value;
   // value getter and setter
}

class B {
   A stringHolder;
   // stringHolder getter and setter
}

Encoders.bean(B.class); // throws "SparkUnsupportedOperationException: 
[ENCODER_NOT_FOUND]..."{code}
 

 

It looks like this is a regression introduced with [SPARK-42093 SQL Move 
JavaTypeInference to 
AgnosticEncoders|https://github.com/apache/spark/commit/18672003513d5a4aa610b6b94dbbc15c33185d3#diff-1191737b908340a2f4c22b71b1c40ebaa0da9d8b40c958089c346a3bda26943b]
 while getting rid of TypeToken, that somehow managed that case.

I'm going to submit a PR to re-enable this functionality.


> Encoders.bean does no longer support nested beans with type arguments
> -
>
> Key: SPARK-44634
> URL: https://issues.apache.org/jira/browse/SPARK-44634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0, 4.0.0
>Reporter: Giambattista Bloisi
>Priority: Major
>
> Hi,
>   while upgrading a project from spark 2.4.0 to 3.4.1 version, I have 
> encountered the same problem described in [java - Encoders.bean attempts to 
> check the validity of a return type considering its generic type and not its 
> concrete class, with Spark 3.4.0 - Stack 
> Overflow|https://stackoverflow.com/questions/76045255/encoders-bean-attempts-to-check-the-validity-of-a-return-type-considering-its-ge].
> Put it short, starting from Spark 3.4.x Encoders.bean throws an exception 
> when the passed class contains a field whose type is a nested bean with type 
> arguments:
>  
> {code:java}
> class A {
>T value;
>// value getter and setter
> }
> class B {
>A stringHolder;
>// stringHolder getter and setter
> }
> Encoders.bean(B.class); // throws "SparkUnsupportedOperationException: 
> [ENCODER_NOT_FOUND]..."{code}
>  
>  
> It looks like this is a regression introduced with [SPARK-42093 SQL Move 
> JavaTypeInference to 
> AgnosticEncoders|https://github.com/apache/spark/commit/18672003513d5a4aa610b6b94dbbc15c33185d3#diff-1191737b908340a2f4c22b71b1c40ebaa0da9d8b40c958089c346a3bda26943b]
>  while getting rid of TypeToken, that somehow managed that case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44674) Remove `BytecodeUtils` from `graphx` module

2023-08-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44674.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42343
[https://github.com/apache/spark/pull/42343]

> Remove `BytecodeUtils` from `graphx` module
> ---
>
> Key: SPARK-44674
> URL: https://issues.apache.org/jira/browse/SPARK-44674
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44674) Remove `BytecodeUtils` from `graphx` module

2023-08-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44674:
-

Assignee: Yang Jie

> Remove `BytecodeUtils` from `graphx` module
> ---
>
> Key: SPARK-44674
> URL: https://issues.apache.org/jira/browse/SPARK-44674
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44676) Ensure Spark Connect scala client CloseableIterator is closed in all cases where exception could be thrown

2023-08-04 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-44676:
-

 Summary: Ensure Spark Connect scala client CloseableIterator is 
closed in all cases where exception could be thrown
 Key: SPARK-44676
 URL: https://issues.apache.org/jira/browse/SPARK-44676
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Juliusz Sompolski


We currently already ensure that CloseableIterator is consumed in all places, 
and in  ExecutePlanResponseReattachableIterator we ensure that the iterator is 
closed in case of GRPC error.

We should also ensure that all places in the client that use a 
CloseableIterator will close it gracefully, also in case of another exception 
being thrown, including InterruptedException.

Some try \{ } finally \{ iteretor.close() } blocks may be needed for that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44656) Close dangling iterators in SparkResult too (Spark Connect Scala)

2023-08-04 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-44656:
-

Assignee: Alice Sayutina

> Close dangling iterators in SparkResult too (Spark Connect Scala)
> -
>
> Key: SPARK-44656
> URL: https://issues.apache.org/jira/browse/SPARK-44656
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Alice Sayutina
>Assignee: Alice Sayutina
>Priority: Major
>
> SPARK-44636 followup. We didn't address iterators grabbed in SparkResult 
> there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44656) Close dangling iterators in SparkResult too (Spark Connect Scala)

2023-08-04 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44656.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Close dangling iterators in SparkResult too (Spark Connect Scala)
> -
>
> Key: SPARK-44656
> URL: https://issues.apache.org/jira/browse/SPARK-44656
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Alice Sayutina
>Assignee: Alice Sayutina
>Priority: Major
> Fix For: 3.5.0
>
>
> SPARK-44636 followup. We didn't address iterators grabbed in SparkResult 
> there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44675) Increase ReservedCodeCacheSize for release build

2023-08-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44675:
---

Assignee: Yuming Wang

> Increase ReservedCodeCacheSize for release build
> 
>
> Key: SPARK-44675
> URL: https://issues.apache.org/jira/browse/SPARK-44675
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44675) Increase ReservedCodeCacheSize for release build

2023-08-04 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44675.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42344
[https://github.com/apache/spark/pull/42344]

> Increase ReservedCodeCacheSize for release build
> 
>
> Key: SPARK-44675
> URL: https://issues.apache.org/jira/browse/SPARK-44675
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44675) Increase ReservedCodeCacheSize for release build

2023-08-04 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-44675:
---

 Summary: Increase ReservedCodeCacheSize for release build
 Key: SPARK-44675
 URL: https://issues.apache.org/jira/browse/SPARK-44675
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44674) Remove `BytecodeUtils` from `graphx` module

2023-08-04 Thread Yang Jie (Jira)

Yang Jie created SPARK-44674:


 Summary: Remove `BytecodeUtils` from `graphx` module
 Key: SPARK-44674
 URL: https://issues.apache.org/jira/browse/SPARK-44674
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44673) Raise a ArrowIOError with empty messages

2023-08-04 Thread Deng An (Jira)

Deng An created SPARK-44673:
---

 Summary: Raise a ArrowIOError with empty messages
 Key: SPARK-44673
 URL: https://issues.apache.org/jira/browse/SPARK-44673
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Deng An


We encountered an problem while using PySpark 2.4.4, where the Python Runner of 
a certain task threw a pyarrow.lib.ArrowIOError without any message, which is 
too confusing. And this task was scheduled multiple times, all of these task 
attempts failed for the same pyarrow.lib.ArrowIOError.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44600) Make `repl` module daily test pass

2023-08-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44600.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42291
[https://github.com/apache/spark/pull/42291]

> Make `repl` module daily test pass
> --
>
> Key: SPARK-44600
> URL: https://issues.apache.org/jira/browse/SPARK-44600
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 4.0.0
>
>
> [https://github.com/apache/spark/actions/runs/5727123477/job/15518895421]
>  
> {code:java}
> - SPARK-15236: use Hive catalog *** FAILED ***
> 18137  isContain was true Interpreter output contained 'Exception':
> 18138  Welcome to
> 18139  __
> 18140   / __/__  ___ _/ /__
> 18141  _\ \/ _ \/ _ `/ __/  '_/
> 18142 /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
> 18143/_/
> 18144   
> 18145  Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_372)
> 18146  Type in expressions to have them evaluated.
> 18147  Type :help for more information.
> 18148  
> 18149  scala> 
> 18150  scala> java.lang.NoClassDefFoundError: 
> org/sparkproject/guava/cache/CacheBuilder
> 18151at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:197)
> 18152at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog$lzycompute(BaseSessionStateBuilder.scala:153)
> 18153at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog(BaseSessionStateBuilder.scala:152)
> 18154at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog$lzycompute(BaseSessionStateBuilder.scala:166)
> 18155at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog(BaseSessionStateBuilder.scala:166)
> 18156at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager$lzycompute(BaseSessionStateBuilder.scala:168)
> 18157at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager(BaseSessionStateBuilder.scala:168)
> 18158at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.(BaseSessionStateBuilder.scala:185)
> 18159at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:185)
> 18160at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$2(BaseSessionStateBuilder.scala:374)
> 18161at 
> org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:92)
> 18162at 
> org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:92)
> 18163at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
> 18164at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
> 18165at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
> 18166at 
> org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
> 18167at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
> 18168at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
> 18169at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
> 18170at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
> 18171at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
> 18172at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
> 18173at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
> 18174at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
> 18175at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
> 18176at 
> org.apache.spark.sql.SparkSession.$anonfun$sql$4(SparkSession.scala:691)
> 18177at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
> 18178at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682)
> 18179at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:713)
> 18180at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:744)
> 18181... 100 elided
> 18182  Caused by: java.lang.ClassNotFoundException: 
> org.sparkproject.guava.cache.CacheBuilder
> 18183at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
> 18184at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
> 18185at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> 18186at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> 18187

[jira] [Assigned] (SPARK-44600) Make `repl` module daily test pass

2023-08-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44600:


Assignee: Yang Jie

> Make `repl` module daily test pass
> --
>
> Key: SPARK-44600
> URL: https://issues.apache.org/jira/browse/SPARK-44600
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> [https://github.com/apache/spark/actions/runs/5727123477/job/15518895421]
>  
> {code:java}
> - SPARK-15236: use Hive catalog *** FAILED ***
> 18137  isContain was true Interpreter output contained 'Exception':
> 18138  Welcome to
> 18139  __
> 18140   / __/__  ___ _/ /__
> 18141  _\ \/ _ \/ _ `/ __/  '_/
> 18142 /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
> 18143/_/
> 18144   
> 18145  Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_372)
> 18146  Type in expressions to have them evaluated.
> 18147  Type :help for more information.
> 18148  
> 18149  scala> 
> 18150  scala> java.lang.NoClassDefFoundError: 
> org/sparkproject/guava/cache/CacheBuilder
> 18151at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:197)
> 18152at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog$lzycompute(BaseSessionStateBuilder.scala:153)
> 18153at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog(BaseSessionStateBuilder.scala:152)
> 18154at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog$lzycompute(BaseSessionStateBuilder.scala:166)
> 18155at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog(BaseSessionStateBuilder.scala:166)
> 18156at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager$lzycompute(BaseSessionStateBuilder.scala:168)
> 18157at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager(BaseSessionStateBuilder.scala:168)
> 18158at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.(BaseSessionStateBuilder.scala:185)
> 18159at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:185)
> 18160at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$2(BaseSessionStateBuilder.scala:374)
> 18161at 
> org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:92)
> 18162at 
> org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:92)
> 18163at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
> 18164at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
> 18165at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
> 18166at 
> org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
> 18167at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
> 18168at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
> 18169at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
> 18170at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
> 18171at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
> 18172at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
> 18173at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
> 18174at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
> 18175at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
> 18176at 
> org.apache.spark.sql.SparkSession.$anonfun$sql$4(SparkSession.scala:691)
> 18177at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
> 18178at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682)
> 18179at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:713)
> 18180at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:744)
> 18181... 100 elided
> 18182  Caused by: java.lang.ClassNotFoundException: 
> org.sparkproject.guava.cache.CacheBuilder
> 18183at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
> 18184at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
> 18185at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> 18186at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> 18187... 130 more
> 18188  
> 18189  scala>  | 
> 18190  scala> :quit (ReplSuite.scala:83) {code}



--
This message was sent by Atlass

[jira] [Created] (SPARK-44672) Fix git ignore rules related to Antlr

2023-08-04 Thread Yang Jie (Jira)

Yang Jie created SPARK-44672:


 Summary: Fix git ignore rules related to Antlr
 Key: SPARK-44672
 URL: https://issues.apache.org/jira/browse/SPARK-44672
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.5.0, 4.0.0
Reporter: Yang Jie


run `git status` after SPARK-44475 merged

 
{code:java}
sql/api/gen/
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/gen/ {code}
The rules should be modified in the .gitignore file to ignore them.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44666) Uninstall CodeQL/Go/Node in non-container jobs

2023-08-04 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44666.
---
  Assignee: Ruifeng Zheng
Resolution: Resolved

> Uninstall CodeQL/Go/Node in non-container jobs
> --
>
> Key: SPARK-44666
> URL: https://issues.apache.org/jira/browse/SPARK-44666
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44666) Uninstall CodeQL/Go/Node in non-container jobs

2023-08-04 Thread Ruifeng Zheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750998#comment-17750998
 ] 

Ruifeng Zheng commented on SPARK-44666:
---

resolved in https://github.com/apache/spark/pull/42333

> Uninstall CodeQL/Go/Node in non-container jobs
> --
>
> Key: SPARK-44666
> URL: https://issues.apache.org/jira/browse/SPARK-44666
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44070) Bump snappy-java 1.1.10.1

2023-08-04 Thread Anil Poriya (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750985#comment-17750985
 ] 

Anil Poriya commented on SPARK-44070:
-

[~chengpan] Hi, is there any plan to backport this fix to version 3.3.2?
OR any issue tracking the bump of snappy-java in spark 3.3.2?

> Bump snappy-java 1.1.10.1
> -
>
> Key: SPARK-44070
> URL: https://issues.apache.org/jira/browse/SPARK-44070
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

61 matches

Mail list logo