[jira] [Created] (SPARK-44687) Fix Scala 2.13 mima check
Yang Jie created SPARK-44687: Summary: Fix Scala 2.13 mima check Key: SPARK-44687 URL: https://issues.apache.org/jira/browse/SPARK-44687 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie [https://github.com/apache/spark/actions/runs/5695413124/job/15438535023] {code:java} [error] spark-core: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems (filtered 4013) 8218[error] * the type hierarchy of object org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages#SparkAppConfig is different in current version. Missing types {scala.runtime.AbstractFunction4} 8219[error]filter with: ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$SparkAppConfig$") 8220[error] java.lang.RuntimeException: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems (filtered 4013) 8221[error] at scala.sys.package$.error(package.scala:30) 8222[error] at com.typesafe.tools.mima.plugin.SbtMima$.reportModuleErrors(SbtMima.scala:89) 8223[error] at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2(MimaPlugin.scala:36) 8224[error] at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2$adapted(MimaPlugin.scala:26) 8225[error] at scala.collection.Iterator.foreach(Iterator.scala:943) 8226[error] at scala.collection.Iterator.foreach$(Iterator.scala:943) 8227[error] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) 8228[error] at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1(MimaPlugin.scala:26) 8229[error] at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1$adapted(MimaPlugin.scala:25) 8230[error] at scala.Function1.$anonfun$compose$1(Function1.scala:49) 8231[error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63) 8232[error] at sbt.std.Transform$$anon$4.work(Transform.scala:69) 8233[error] at sbt.Execute.$anonfun$submit$2(Execute.scala:283) 8234[error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24) 8235[error] at sbt.Execute.work(Execute.scala:292) 8236[error] at sbt.Execute.$anonfun$submit$1(Execute.scala:283) 8237[error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265) 8238[error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:65) 8239[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266) 8240[error] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 8241[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266) 8242[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 8243[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 8244[error] at java.lang.Thread.run(Thread.java:750) 8245[error] (core / mimaReportBinaryIssues) Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems (filtered 4013) 8246[error] Total time: 172 s (02:52), completed Jul 28, 2023 7:26:06 PM {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44629) Publish PySpark Test Guidelines webpage
[ https://issues.apache.org/jira/browse/SPARK-44629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751260#comment-17751260 ] Snoot.io commented on SPARK-44629: -- User 'asl3' has created a pull request for this issue: https://github.com/apache/spark/pull/42284 > Publish PySpark Test Guidelines webpage > --- > > Key: SPARK-44629 > URL: https://issues.apache.org/jira/browse/SPARK-44629 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44665) Add support for pandas DataFrame assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-44665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751257#comment-17751257 ] Snoot.io commented on SPARK-44665: -- User 'asl3' has created a pull request for this issue: https://github.com/apache/spark/pull/42332 > Add support for pandas DataFrame assertDataFrameEqual > - > > Key: SPARK-44665 > URL: https://issues.apache.org/jira/browse/SPARK-44665 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44665) Add support for pandas DataFrame assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-44665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751256#comment-17751256 ] Snoot.io commented on SPARK-44665: -- User 'asl3' has created a pull request for this issue: https://github.com/apache/spark/pull/42332 > Add support for pandas DataFrame assertDataFrameEqual > - > > Key: SPARK-44665 > URL: https://issues.apache.org/jira/browse/SPARK-44665 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44306) Group FileStatus with few RPC calls within Yarn Client
[ https://issues.apache.org/jira/browse/SPARK-44306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751255#comment-17751255 ] Snoot.io commented on SPARK-44306: -- User 'shuwang21' has created a pull request for this issue: https://github.com/apache/spark/pull/42357 > Group FileStatus with few RPC calls within Yarn Client > -- > > Key: SPARK-44306 > URL: https://issues.apache.org/jira/browse/SPARK-44306 > Project: Spark > Issue Type: New Feature > Components: Spark Submit >Affects Versions: 0.9.2, 2.3.0, 3.5.0 >Reporter: SHU WANG >Priority: Major > > It's inefficient to obtain *FileStatus* for each resource [one by > one|https://github.com/apache/spark/blob/531ec8bddc8dd22ca39486dbdd31e62e989ddc15/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71C1]. > In our company setting, we are running Spark with Hadoop Yarn and HDFS. We > noticed the current behavior has two major drawbacks: > # Since each *getFileStatus* call involves network delays, the overall delay > can be *large* and add *uncertainty* to the overall Spark job runtime. > Specifically, we quantify this overhead within our cluster. We see the p50 > overhead is around 10s, p80 is 1 min, and p100 is up to 15 mins. When HDFS is > overloaded, the delays become more severe. > # In our cluster, we have nearly 100 million *getFileStatus* call to HDFS > daily. We noticed that in our cluster, most resources come from the same HDFS > directory for each user (See our [engineer blog > post|https://engineering.linkedin.com/blog/2023/reducing-apache-spark-application-dependencies-upload-by-99-] > about why we took this approach). Therefore, we can greatly reduce nearly > 100 million *getFileStatus* call to 0.1 million *listStatus* calls daily. > This will further reduce overhead from the HDFS side. > All in all, a more efficient way to fetch the *FileStatus* for each resource > is highly needed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44686) Add option to create RowEncoder in Encoders helper class.
Herman van Hövell created SPARK-44686: - Summary: Add option to create RowEncoder in Encoders helper class. Key: SPARK-44686 URL: https://issues.apache.org/jira/browse/SPARK-44686 Project: Spark Issue Type: New Feature Components: Connect, SQL Affects Versions: 3.5.0 Reporter: Herman van Hövell Assignee: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns
[ https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44662: Target Version/s: (was: 3.3.3) > SPIP: Improving performance of BroadcastHashJoin queries with stream side > join key on non partition columns > --- > > Key: SPARK-44662 > URL: https://issues.apache.org/jira/browse/SPARK-44662 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.3 >Reporter: Asif >Priority: Major > > h2. *Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon.* > On the lines of DPP which helps DataSourceV2 relations when the joining key > is a partition column, the same concept can be extended over to the case > where joining key is not a partition column. > The Keys of BroadcastHashJoin are already available before actual evaluation > of the stream iterator. These keys can be pushed down to the DataSource as a > SortedSet. > For non partition columns, the DataSources like iceberg have max/min stats on > column available at manifest level, and for formats like parquet , they have > max/min stats at various storage level. The passed SortedSet can be used to > prune using ranges at both driver level ( manifests files) as well as > executor level ( while actually going through chunks , row groups etc at > parquet level) > If the data is stored as Columnar Batch format , then it would not be > possible to filter out individual row at DataSource level, even though we > have keys. > But at the scan level, ( ColumnToRowExec) it is still possible to filter out > as many rows as possible , if the query involves nested joins. Thus reducing > the number of rows to join at the higher join levels. > Will be adding more details.. > h2. *Q2. What problem is this proposal NOT designed to solve?* > This can only help in BroadcastHashJoin's performance if the join is Inner or > Left Semi. > This will also not work if there are nodes like Expand, Generator , Aggregate > (without group by on keys not part of joining column etc) below the > BroadcastHashJoin node being targeted. > h2. *Q3. How is it done today, and what are the limits of current practice?* > Currently this sort of pruning at DataSource level is being done using DPP > (Dynamic Partition Pruning ) and IFF one of the join key column is a > Partitioning column ( so that cost of DPP query is justified and way less > than amount of data it will be filtering by skipping partitions). > The limitation is that DPP type approach is not implemented ( intentionally I > believe), if the join column is a non partition column ( because of cost of > "DPP type" query would most likely be way high as compared to any possible > pruning ( especially if the column is not stored in a sorted manner). > h2. *Q4. What is new in your approach and why do you think it will be > successful?* > 1) This allows pruning on non partition column based joins. > 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP > type" query. > 3) The Data can be used by DataSource to prune at driver (possibly) and also > at executor level ( as in case of parquet which has max/min at various > structure levels) > 4) The big benefit should be seen in multilevel nested join queries. In the > current code base, if I am correct, only one join's pruning filter would get > pushed at scan level. Since it is on partition key may be that is sufficient. > But if it is a nested Join query , and may be involving different columns on > streaming side for join, each such filter push could do significant pruning. > This requires some handling in case of AQE, as the stream side iterator ( & > hence stage evaluation needs to be delayed, till all the available join > filters in the nested tree are pushed at their respective target > BatchScanExec). > h4. *Single Row Filteration* > 5) In case of nested broadcasted joins, if the datasource is column vector > oriented , then what spark would get is a ColumnarBatch. But because scans > have Filters from multiple joins, they can be retrieved and can be applied in > code generated at ColumnToRowExec level, using a new "containsKey" method on > HashedRelation. Thus only those rows which satisfy all the > BroadcastedHashJoins ( whose keys have been pushed) , will be used for join > evaluation. > The code is already there , will be opening a PR. For non partition table > TPCDS run on laptop with TPCDS data size of ( scale factor 4), I am seeing > 15% gain. > For partition table TPCDS, there is improvement in 4 - 5 queries to the tune > of 10% to 37%. > h2. *Q5. Who cares? If you are successful, what difference will it make?* > If use cases involve
[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns
[ https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44662: Fix Version/s: (was: 3.3.3) > SPIP: Improving performance of BroadcastHashJoin queries with stream side > join key on non partition columns > --- > > Key: SPARK-44662 > URL: https://issues.apache.org/jira/browse/SPARK-44662 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.3 >Reporter: Asif >Priority: Major > > h2. *Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon.* > On the lines of DPP which helps DataSourceV2 relations when the joining key > is a partition column, the same concept can be extended over to the case > where joining key is not a partition column. > The Keys of BroadcastHashJoin are already available before actual evaluation > of the stream iterator. These keys can be pushed down to the DataSource as a > SortedSet. > For non partition columns, the DataSources like iceberg have max/min stats on > column available at manifest level, and for formats like parquet , they have > max/min stats at various storage level. The passed SortedSet can be used to > prune using ranges at both driver level ( manifests files) as well as > executor level ( while actually going through chunks , row groups etc at > parquet level) > If the data is stored as Columnar Batch format , then it would not be > possible to filter out individual row at DataSource level, even though we > have keys. > But at the scan level, ( ColumnToRowExec) it is still possible to filter out > as many rows as possible , if the query involves nested joins. Thus reducing > the number of rows to join at the higher join levels. > Will be adding more details.. > h2. *Q2. What problem is this proposal NOT designed to solve?* > This can only help in BroadcastHashJoin's performance if the join is Inner or > Left Semi. > This will also not work if there are nodes like Expand, Generator , Aggregate > (without group by on keys not part of joining column etc) below the > BroadcastHashJoin node being targeted. > h2. *Q3. How is it done today, and what are the limits of current practice?* > Currently this sort of pruning at DataSource level is being done using DPP > (Dynamic Partition Pruning ) and IFF one of the join key column is a > Partitioning column ( so that cost of DPP query is justified and way less > than amount of data it will be filtering by skipping partitions). > The limitation is that DPP type approach is not implemented ( intentionally I > believe), if the join column is a non partition column ( because of cost of > "DPP type" query would most likely be way high as compared to any possible > pruning ( especially if the column is not stored in a sorted manner). > h2. *Q4. What is new in your approach and why do you think it will be > successful?* > 1) This allows pruning on non partition column based joins. > 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP > type" query. > 3) The Data can be used by DataSource to prune at driver (possibly) and also > at executor level ( as in case of parquet which has max/min at various > structure levels) > 4) The big benefit should be seen in multilevel nested join queries. In the > current code base, if I am correct, only one join's pruning filter would get > pushed at scan level. Since it is on partition key may be that is sufficient. > But if it is a nested Join query , and may be involving different columns on > streaming side for join, each such filter push could do significant pruning. > This requires some handling in case of AQE, as the stream side iterator ( & > hence stage evaluation needs to be delayed, till all the available join > filters in the nested tree are pushed at their respective target > BatchScanExec). > h4. *Single Row Filteration* > 5) In case of nested broadcasted joins, if the datasource is column vector > oriented , then what spark would get is a ColumnarBatch. But because scans > have Filters from multiple joins, they can be retrieved and can be applied in > code generated at ColumnToRowExec level, using a new "containsKey" method on > HashedRelation. Thus only those rows which satisfy all the > BroadcastedHashJoins ( whose keys have been pushed) , will be used for join > evaluation. > The code is already there , will be opening a PR. For non partition table > TPCDS run on laptop with TPCDS data size of ( scale factor 4), I am seeing > 15% gain. > For partition table TPCDS, there is improvement in 4 - 5 queries to the tune > of 10% to 37%. > h2. *Q5. Who cares? If you are successful, what difference will it make?* > If use cases involve m
[jira] [Created] (SPARK-44685) Remove deprecated Catalog#createExternalTable
Jia Fan created SPARK-44685: --- Summary: Remove deprecated Catalog#createExternalTable Key: SPARK-44685 URL: https://issues.apache.org/jira/browse/SPARK-44685 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Jia Fan I should remove Catalog#createExternalTable becuase it deprecated when 2.2.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44433) Implement termination of Python process for foreachBatch & streaming listener
[ https://issues.apache.org/jira/browse/SPARK-44433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-44433. --- Assignee: Wei Liu Resolution: Fixed Issue resolved by pull request 42283 https://github.com/apache/spark/pull/42283 > Implement termination of Python process for foreachBatch & streaming listener > - > > Key: SPARK-44433 > URL: https://issues.apache.org/jira/browse/SPARK-44433 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.4.1 >Reporter: Raghu Angadi >Assignee: Wei Liu >Priority: Major > Fix For: 3.5.0 > > > In the first implementation of Python support for foreachBatch, the python > process termination is not handled correctly. > > See the long TODO in > [https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/StreamingForeachBatchHelper.scala] > > about an outline of the feature to terminate the runners by registering > StreamingQueryListners. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44684) Runtime min-max filter
[ https://issues.apache.org/jira/browse/SPARK-44684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GANHONGNAN updated SPARK-44684: --- Remaining Estimate: 2,016h (was: 0.05h) Original Estimate: 2,016h (was: 0.05h) > Runtime min-max filter > -- > > Key: SPARK-44684 > URL: https://issues.apache.org/jira/browse/SPARK-44684 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.1 >Reporter: GANHONGNAN >Priority: Major > Labels: performance > Original Estimate: 2,016h > Remaining Estimate: 2,016h > > We can infer min-max index when building bloom filter and push it down to > datasource. > # Min-max index can skip part of data before loading them into memory. > # building min-max index can be done along with bloom filter building, so > aggregation for bloom filter building can be reused. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44684) Runtime min-max filter
[ https://issues.apache.org/jira/browse/SPARK-44684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751250#comment-17751250 ] GANHONGNAN commented on SPARK-44684: [~cloud_fan] could you pls review this Jira and assign it to me? > Runtime min-max filter > -- > > Key: SPARK-44684 > URL: https://issues.apache.org/jira/browse/SPARK-44684 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.1 >Reporter: GANHONGNAN >Priority: Major > Labels: performance > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > We can infer min-max index when building bloom filter and push it down to > datasource. > # Min-max index can skip part of data before loading them into memory. > # building min-max index can be done along with bloom filter building, so > aggregation for bloom filter building can be reused. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44684) Runtime min-max filter
GANHONGNAN created SPARK-44684: -- Summary: Runtime min-max filter Key: SPARK-44684 URL: https://issues.apache.org/jira/browse/SPARK-44684 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.1 Reporter: GANHONGNAN We can infer min-max index when building bloom filter and push it down to datasource. # Min-max index can skip part of data before loading them into memory. # building min-max index can be done along with bloom filter building, so aggregation for bloom filter building can be reused. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44683) Logging level isn't passed to RocksDB state store provider correctly
[ https://issues.apache.org/jira/browse/SPARK-44683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siying Dong updated SPARK-44683: Description: We pass log4j's log level to RocksDB so that RocksDB debug log can go to log4j. However, we pass in log level after we create the logger. However, the way it is set isn't effective. This has two impacts: (1) setting DEBUG level don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR level does prevent INFO level logging, but RocksDB still makes JNI calls to Scala, which is an unnecessary overhead. (was: We pass log4j's log level to RocksDB so that RocksDB debug log can go to log4j. However, we pass in log level after we create the logger. However, RocksDB only takes log level when a logger is created, so it never changes. This has two impacts: (1) setting DEBUG level don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR level does prevent INFO level logging, but RocksDB still makes JNI calls to Scala, which is an unnecessary overhead.) > Logging level isn't passed to RocksDB state store provider correctly > > > Key: SPARK-44683 > URL: https://issues.apache.org/jira/browse/SPARK-44683 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.1 >Reporter: Siying Dong >Priority: Minor > > We pass log4j's log level to RocksDB so that RocksDB debug log can go to > log4j. However, we pass in log level after we create the logger. However, the > way it is set isn't effective. This has two impacts: (1) setting DEBUG level > don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR level > does prevent INFO level logging, but RocksDB still makes JNI calls to Scala, > which is an unnecessary overhead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44005) Improve error messages for regular Python UDTFs that return non-tuple values
[ https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44005: - Summary: Improve error messages for regular Python UDTFs that return non-tuple values (was: Improve error messages when regular Python UDTFs that return non-tuple values) > Improve error messages for regular Python UDTFs that return non-tuple values > > > Key: SPARK-44005 > URL: https://issues.apache.org/jira/browse/SPARK-44005 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Currently, if you have a UDTF like this: > {code:java} > class TestUDTF: > def eval(self, a: int): > yield a {code} > and run the UDTF, it will fail with a confusing error message like > {code:java} > Unexpected tuple 1 with StructType {code} > Note this works when arrow is enabled. We should improve error messages for > regular UDTFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44005) Improve error messages when regular Python UDTFs that return non-tuple values
[ https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44005: - Summary: Improve error messages when regular Python UDTFs that return non-tuple values (was: Support returning non-tuple values for regular Python UDTFs) > Improve error messages when regular Python UDTFs that return non-tuple values > - > > Key: SPARK-44005 > URL: https://issues.apache.org/jira/browse/SPARK-44005 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Currently, if you have a UDTF like this: > {code:java} > class TestUDTF: > def eval(self, a: int): > yield a {code} > and run the UDTF, it will fail with a confusing error message like > {code:java} > Unexpected tuple 1 with StructType {code} > Note this works when arrow is enabled. We should improve error messages for > regular UDTFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44005) Support returning non-tuple values for regular Python UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44005: - Description: Currently, if you have a UDTF like this: {code:java} class TestUDTF: def eval(self, a: int): yield a {code} and run the UDTF, it will fail with a confusing error message like {code:java} Unexpected tuple 1 with StructType {code} Note this works when arrow is enabled. We should improve error messages for regular UDTFs. was: Currently, if you have a UDTF like this: {code:java} class TestUDTF: def eval(self, a: int): yield a {code} and run the UDTF, it will fail with a confusing error message like {code:java} Unexpected tuple 1 with StructType {code} Note this works when arrow is enabled. We should support this use case for regular UDTFs. > Support returning non-tuple values for regular Python UDTFs > --- > > Key: SPARK-44005 > URL: https://issues.apache.org/jira/browse/SPARK-44005 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Currently, if you have a UDTF like this: > {code:java} > class TestUDTF: > def eval(self, a: int): > yield a {code} > and run the UDTF, it will fail with a confusing error message like > {code:java} > Unexpected tuple 1 with StructType {code} > Note this works when arrow is enabled. We should improve error messages for > regular UDTFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44667) Uninstall large ML libraries for non-ML jobs
[ https://issues.apache.org/jira/browse/SPARK-44667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-44667: - Assignee: Ruifeng Zheng > Uninstall large ML libraries for non-ML jobs > > > Key: SPARK-44667 > URL: https://issues.apache.org/jira/browse/SPARK-44667 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44667) Uninstall large ML libraries for non-ML jobs
[ https://issues.apache.org/jira/browse/SPARK-44667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-44667. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42334 [https://github.com/apache/spark/pull/42334] > Uninstall large ML libraries for non-ML jobs > > > Key: SPARK-44667 > URL: https://issues.apache.org/jira/browse/SPARK-44667 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44683) Logging level isn't passed to RocksDB state store provider correctly
Siying Dong created SPARK-44683: --- Summary: Logging level isn't passed to RocksDB state store provider correctly Key: SPARK-44683 URL: https://issues.apache.org/jira/browse/SPARK-44683 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.4.1 Reporter: Siying Dong We pass log4j's log level to RocksDB so that RocksDB debug log can go to log4j. However, we pass in log level after we create the logger. However, RocksDB only takes log level when a logger is created, so it never changes. This has two impacts: (1) setting DEBUG level don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR level does prevent INFO level logging, but RocksDB still makes JNI calls to Scala, which is an unnecessary overhead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44663) Disable arrow optimization by default for Python UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-44663. --- Fix Version/s: 3.5.0 Assignee: Allison Wang Resolution: Fixed Issue resolved by pull request 42329 https://github.com/apache/spark/pull/42329 > Disable arrow optimization by default for Python UDTFs > -- > > Key: SPARK-44663 > URL: https://issues.apache.org/jira/browse/SPARK-44663 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.5.0 > > > Disable arrow optimization to make Python UDTFs consistent with Python UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751237#comment-17751237 ] Pavan Kotikalapudi commented on SPARK-24815: Here is the draft PR with initial implementation [https://github.com/apache/spark/pull/42352] and the design doc: [https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing |https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing]. Thanks for the review :) > Structured Streaming should support dynamic allocation > -- > > Key: SPARK-24815 > URL: https://issues.apache.org/jira/browse/SPARK-24815 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Karthik Palaniappan >Priority: Minor > > For batch jobs, dynamic allocation is very useful for adding and removing > containers to match the actual workload. On multi-tenant clusters, it ensures > that a Spark job is taking no more resources than necessary. In cloud > environments, it enables autoscaling. > However, if you set spark.dynamicAllocation.enabled=true and run a structured > streaming job, the batch dynamic allocation algorithm kicks in. It requests > more executors if the task backlog is a certain size, and removes executors > if they idle for a certain period of time. > Quick thoughts: > 1) Dynamic allocation should be pluggable, rather than hardcoded to a > particular implementation in SparkContext.scala (this should be a separate > JIRA). > 2) We should make a structured streaming algorithm that's separate from the > batch algorithm. Eventually, continuous processing might need its own > algorithm. > 3) Spark should print a warning if you run a structured streaming job when > Core's dynamic allocation is enabled -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43797) Python User-defined Table Functions
[ https://issues.apache.org/jira/browse/SPARK-43797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-43797: - Affects Version/s: 4.0.0 > Python User-defined Table Functions > --- > > Key: SPARK-43797 > URL: https://issues.apache.org/jira/browse/SPARK-43797 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > > This is an umbrella ticket to support Python user-defined table functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44678) Downgrade Hadoop to 3.3.4
[ https://issues.apache.org/jira/browse/SPARK-44678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44678. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 42345 [https://github.com/apache/spark/pull/42345] > Downgrade Hadoop to 3.3.4 > - > > Key: SPARK-44678 > URL: https://issues.apache.org/jira/browse/SPARK-44678 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Critical > Fix For: 3.5.0 > > > There is a community report on S3A committer performance regression. Although > it's one liner fix, there is no available Hadoop release with that fix at > this time. > HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44678) Downgrade Hadoop to 3.3.4
[ https://issues.apache.org/jira/browse/SPARK-44678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44678: - Assignee: Dongjoon Hyun > Downgrade Hadoop to 3.3.4 > - > > Key: SPARK-44678 > URL: https://issues.apache.org/jira/browse/SPARK-44678 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Critical > > There is a community report on S3A committer performance regression. Although > it's one liner fix, there is no available Hadoop release with that fix at > this time. > HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44644) Improve error messages for creating Python UDTFs with pickling errors
[ https://issues.apache.org/jira/browse/SPARK-44644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-44644. --- Fix Version/s: 4.0.0 Target Version/s: 3.5.0 Assignee: Allison Wang Resolution: Fixed Issue resolved by pull request 42309 https://github.com/apache/spark/pull/42309 > Improve error messages for creating Python UDTFs with pickling errors > - > > Key: SPARK-44644 > URL: https://issues.apache.org/jira/browse/SPARK-44644 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 4.0.0 > > > Currently, when users create a Python UDTF with a non-pickleable object, it > throws this error: > _pickle.PicklingError: Cannot pickle files that are not opened for reading: w > > We should make this more user-friendly -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44258) Move Metadata to sql/api
[ https://issues.apache.org/jira/browse/SPARK-44258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang resolved SPARK-44258. -- Resolution: Fixed > Move Metadata to sql/api > > > Key: SPARK-44258 > URL: https://issues.apache.org/jira/browse/SPARK-44258 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42887) Simple DataType interface
[ https://issues.apache.org/jira/browse/SPARK-42887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang resolved SPARK-42887. -- Resolution: Fixed > Simple DataType interface > - > > Key: SPARK-42887 > URL: https://issues.apache.org/jira/browse/SPARK-42887 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > > This JIRA proposes to move non public API from existing DataType class to > make DataType become a simple interface. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44682) Make pandas error class message_parameters strings
Amanda Liu created SPARK-44682: -- Summary: Make pandas error class message_parameters strings Key: SPARK-44682 URL: https://issues.apache.org/jira/browse/SPARK-44682 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44681) Solve issue referencing github.com/apache/spark-connect-go as Go library
BoYang created SPARK-44681: -- Summary: Solve issue referencing github.com/apache/spark-connect-go as Go library Key: SPARK-44681 URL: https://issues.apache.org/jira/browse/SPARK-44681 Project: Spark Issue Type: Sub-task Components: Connect Contrib Affects Versions: 3.5.0 Reporter: BoYang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43351) Support Golang in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-43351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751186#comment-17751186 ] BoYang commented on SPARK-43351: Thanks! We can keep it as 3.5.0 now. > Support Golang in Spark Connect > --- > > Key: SPARK-43351 > URL: https://issues.apache.org/jira/browse/SPARK-43351 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: BoYang >Assignee: BoYang >Priority: Major > Fix For: 3.5.0 > > > Support Spark Connect client side in Go programming language -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44648) Set up memory limits for analyze in Python.
[ https://issues.apache.org/jira/browse/SPARK-44648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-44648. --- Assignee: Takuya Ueshin Resolution: Fixed Issue resolved by pull request 42328 https://github.com/apache/spark/pull/42328 > Set up memory limits for analyze in Python. > --- > > Key: SPARK-44648 > URL: https://issues.apache.org/jira/browse/SPARK-44648 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44679) java.lang.OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-44679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haitham Eltaweel updated SPARK-44679: - Attachment: code_sample.txt > java.lang.OutOfMemoryError: Requested array size exceeds VM limit > - > > Key: SPARK-44679 > URL: https://issues.apache.org/jira/browse/SPARK-44679 > Project: Spark > Issue Type: Bug > Components: EC2, PySpark >Affects Versions: 3.2.1 > Environment: We use Amazon EMR to run Pyspark jobs. > Amazon EMR version : emr-6.7.0 > Installed applications : > Tez 0.9.2, Spark 3.2.1, Hive 3.1.3, Sqoop 1.4.7, Hadoop 3.2.1, Zookeeper > 3.5.7, HCatalog 3.1.3, Livy 0.7.1 >Reporter: Haitham Eltaweel >Priority: Major > Attachments: code_sample.txt > > > We get the following error from our Pyspark application in Production env: > _java.lang.OutOfMemoryError: Requested array size exceeds VM limit_ > I simplified the code we used and shared it below so you can easily > investigate the issue. > We use Pyspark to read 900 MB text file which has one record. We use foreach > function to iterate over the Datafreme and apply some high order function. > The error occurs once foreach action is triggered. I think the issue is > related to the integer data type of the bytes array used to hold the > serialized dataframe. Since the dataframe record was too big, it seems the > serialized record exceeded the max integer value, hence the error occurred. > Note that the same error happens when using foreachBatch function with > writeStream. > Our prod data has many records larger than 100 MB. Appreciate your help to > provide a fix or a solution to that issue. > > *Find below the code snippet:* > from pyspark.sql import SparkSession,functions as f > > def check_file_name(row): > print("check_file_name called") > > def main(): > spark=SparkSession.builder.enableHiveSupport().getOrCreate() > inputPath = "s3://bucket-name/common/source/" > inputDF = spark.read.text(inputPath, wholetext=True) > inputDF = inputDF.select(f.date_format(f.current_timestamp(), > 'MMddHH').astype('string').alias('insert_hr'), > f.col("value").alias("raw_data"), > f.input_file_name().alias("input_file_name")) > inputDF.foreach(check_file_name) > > if __name__ == "__main__": > main() > *Find below spark-submit command used:* > spark-submit --master yarn --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer --num-executors > 15 --executor-cores 4 --executor-memory 20g --driver-memory 20g --name > haitham_job --deploy-mode cluster big_file_process.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44680) parameter markers are not blocked from DEFAULT (and other places)
Serge Rielau created SPARK-44680: Summary: parameter markers are not blocked from DEFAULT (and other places) Key: SPARK-44680 URL: https://issues.apache.org/jira/browse/SPARK-44680 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Serge Rielau scala> spark.sql("CREATE TABLE t11(c1 int default :parm)", args = Map("parm" -> 5)).show() -> success scala> spark.sql("describe t11"); [INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION] Failed to execute EXISTS_DEFAULT command because the destination table column `c1` has a DEFAULT value :parm, which fails to resolve as a valid expression. This likely extends to other DDL-y places. I can only find protection against placement in the body of a CREATE VIEW. I see two ways out of this: * Raise an error (as we do for CREATE VIEW v1(c1) AS SELECT ? ) * Improve the way we persist queries/expressions to substitute the at-DDL-time bound parameter value (it' not a bug it's a feature) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44678) Downgrade Hadoop to 3.3.4
[ https://issues.apache.org/jira/browse/SPARK-44678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44678: -- Description: There is a community report on S3A committer performance regression. Although it's one liner fix, there is no available Hadoop release with that fix at this time. HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer > Downgrade Hadoop to 3.3.4 > - > > Key: SPARK-44678 > URL: https://issues.apache.org/jira/browse/SPARK-44678 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Priority: Critical > > There is a community report on S3A committer performance regression. Although > it's one liner fix, there is no available Hadoop release with that fix at > this time. > HADOOP-18757: Bump corePoolSize of HadoopThreadPoolExecutor in s3a committer -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44679) java.lang.OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-44679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haitham Eltaweel updated SPARK-44679: - Language: (was: Python) > java.lang.OutOfMemoryError: Requested array size exceeds VM limit > - > > Key: SPARK-44679 > URL: https://issues.apache.org/jira/browse/SPARK-44679 > Project: Spark > Issue Type: Bug > Components: EC2, PySpark >Affects Versions: 3.2.1 > Environment: We use Amazon EMR to run Pyspark jobs. > Amazon EMR version : emr-6.7.0 > Installed applications : > Tez 0.9.2, Spark 3.2.1, Hive 3.1.3, Sqoop 1.4.7, Hadoop 3.2.1, Zookeeper > 3.5.7, HCatalog 3.1.3, Livy 0.7.1 >Reporter: Haitham Eltaweel >Priority: Major > > We get the following error from our Pyspark application in Production env: > _java.lang.OutOfMemoryError: Requested array size exceeds VM limit_ > I simplified the code we used and shared it below so you can easily > investigate the issue. > We use Pyspark to read 900 MB text file which has one record. We use foreach > function to iterate over the Datafreme and apply some high order function. > The error occurs once foreach action is triggered. I think the issue is > related to the integer data type of the bytes array used to hold the > serialized dataframe. Since the dataframe record was too big, it seems the > serialized record exceeded the max integer value, hence the error occurred. > Note that the same error happens when using foreachBatch function with > writeStream. > Our prod data has many records larger than 100 MB. Appreciate your help to > provide a fix or a solution to that issue. > > *Find below the code snippet:* > from pyspark.sql import SparkSession,functions as f > > def check_file_name(row): > print("check_file_name called") > > def main(): > spark=SparkSession.builder.enableHiveSupport().getOrCreate() > inputPath = "s3://bucket-name/common/source/" > inputDF = spark.read.text(inputPath, wholetext=True) > inputDF = inputDF.select(f.date_format(f.current_timestamp(), > 'MMddHH').astype('string').alias('insert_hr'), > f.col("value").alias("raw_data"), > f.input_file_name().alias("input_file_name")) > inputDF.foreach(check_file_name) > > if __name__ == "__main__": > main() > *Find below spark-submit command used:* > spark-submit --master yarn --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer --num-executors > 15 --executor-cores 4 --executor-memory 20g --driver-memory 20g --name > haitham_job --deploy-mode cluster big_file_process.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44679) java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Haitham Eltaweel created SPARK-44679: Summary: java.lang.OutOfMemoryError: Requested array size exceeds VM limit Key: SPARK-44679 URL: https://issues.apache.org/jira/browse/SPARK-44679 Project: Spark Issue Type: Bug Components: EC2, PySpark Affects Versions: 3.2.1 Environment: We use Amazon EMR to run Pyspark jobs. Amazon EMR version : emr-6.7.0 Installed applications : Tez 0.9.2, Spark 3.2.1, Hive 3.1.3, Sqoop 1.4.7, Hadoop 3.2.1, Zookeeper 3.5.7, HCatalog 3.1.3, Livy 0.7.1 Reporter: Haitham Eltaweel We get the following error from our Pyspark application in Production env: _java.lang.OutOfMemoryError: Requested array size exceeds VM limit_ I simplified the code we used and shared it below so you can easily investigate the issue. We use Pyspark to read 900 MB text file which has one record. We use foreach function to iterate over the Datafreme and apply some high order function. The error occurs once foreach action is triggered. I think the issue is related to the integer data type of the bytes array used to hold the serialized dataframe. Since the dataframe record was too big, it seems the serialized record exceeded the max integer value, hence the error occurred. Note that the same error happens when using foreachBatch function with writeStream. Our prod data has many records larger than 100 MB. Appreciate your help to provide a fix or a solution to that issue. *Find below the code snippet:* from pyspark.sql import SparkSession,functions as f def check_file_name(row): print("check_file_name called") def main(): spark=SparkSession.builder.enableHiveSupport().getOrCreate() inputPath = "s3://bucket-name/common/source/" inputDF = spark.read.text(inputPath, wholetext=True) inputDF = inputDF.select(f.date_format(f.current_timestamp(), 'MMddHH').astype('string').alias('insert_hr'), f.col("value").alias("raw_data"), f.input_file_name().alias("input_file_name")) inputDF.foreach(check_file_name) if __name__ == "__main__": main() *Find below spark-submit command used:* spark-submit --master yarn --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --num-executors 15 --executor-cores 4 --executor-memory 20g --driver-memory 20g --name haitham_job --deploy-mode cluster big_file_process.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44678) Downgrade Hadoop to 3.3.4
Dongjoon Hyun created SPARK-44678: - Summary: Downgrade Hadoop to 3.3.4 Key: SPARK-44678 URL: https://issues.apache.org/jira/browse/SPARK-44678 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.5.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44671) Retry ExecutePlan in case initial request didn't reach server in Python client
[ https://issues.apache.org/jira/browse/SPARK-44671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44671: Assignee: Hyukjin Kwon > Retry ExecutePlan in case initial request didn't reach server in Python client > -- > > Key: SPARK-44671 > URL: https://issues.apache.org/jira/browse/SPARK-44671 > Project: Spark > Issue Type: Task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > SPARK-44624 for Python -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44671) Retry ExecutePlan in case initial request didn't reach server in Python client
[ https://issues.apache.org/jira/browse/SPARK-44671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44671. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42338 [https://github.com/apache/spark/pull/42338] > Retry ExecutePlan in case initial request didn't reach server in Python client > -- > > Key: SPARK-44671 > URL: https://issues.apache.org/jira/browse/SPARK-44671 > Project: Spark > Issue Type: Task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > SPARK-44624 for Python -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44677) Drop legacy Hive-based ORC file format
[ https://issues.apache.org/jira/browse/SPARK-44677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751148#comment-17751148 ] Dongjoon Hyun commented on SPARK-44677: --- +1. Thank you, [~chengpan]. > Drop legacy Hive-based ORC file format > -- > > Key: SPARK-44677 > URL: https://issues.apache.org/jira/browse/SPARK-44677 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > > Currently, Spark allows to use spark.sql.orc.impl=native/hive to switch the > ORC FileFormat implementation. > SPARK-23456(2.4) switched the default value of spark.sql.orc.impl from "hive" > to "native". and prepared to drop the "hive" implementation in the future. > > ... eventually, Apache Spark will drop old Hive-based ORC code. > The native implementation works well during the whole Spark 3.x period, so > it's a good time to consider dropping the "hive" one in Spark 4.0. > Also, we should take care about the backward-compatibility during change. > > BTW, IIRC, there was a different at Hive ORC CHAR implementation before. > > So, we couldn't remove it for backward-compatibility issues. Since Spark > > implements many CHAR features, we need to re-verify that {{native}} > > implementation has all legacy Hive-based ORC features -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44677) Drop legacy Hive-based ORC file format
Cheng Pan created SPARK-44677: - Summary: Drop legacy Hive-based ORC file format Key: SPARK-44677 URL: https://issues.apache.org/jira/browse/SPARK-44677 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Cheng Pan Currently, Spark allows to use spark.sql.orc.impl=native/hive to switch the ORC FileFormat implementation. SPARK-23456(2.4) switched the default value of spark.sql.orc.impl from "hive" to "native". and prepared to drop the "hive" implementation in the future. > ... eventually, Apache Spark will drop old Hive-based ORC code. The native implementation works well during the whole Spark 3.x period, so it's a good time to consider dropping the "hive" one in Spark 4.0. Also, we should take care about the backward-compatibility during change. > BTW, IIRC, there was a different at Hive ORC CHAR implementation before. So, > we couldn't remove it for backward-compatibility issues. Since Spark > implements many CHAR features, we need to re-verify that {{native}} > implementation has all legacy Hive-based ORC features -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44672) Fix git ignore rules related to Antlr
[ https://issues.apache.org/jira/browse/SPARK-44672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44672: - Assignee: Yang Jie > Fix git ignore rules related to Antlr > - > > Key: SPARK-44672 > URL: https://issues.apache.org/jira/browse/SPARK-44672 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > run `git status` after SPARK-44475 merged > > {code:java} > sql/api/gen/ > sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/gen/ {code} > The rules should be modified in the .gitignore file to ignore them. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44672) Fix git ignore rules related to Antlr
[ https://issues.apache.org/jira/browse/SPARK-44672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44672. --- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42342 [https://github.com/apache/spark/pull/42342] > Fix git ignore rules related to Antlr > - > > Key: SPARK-44672 > URL: https://issues.apache.org/jira/browse/SPARK-44672 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0, 4.0.0 > > > run `git status` after SPARK-44475 merged > > {code:java} > sql/api/gen/ > sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/gen/ {code} > The rules should be modified in the .gitignore file to ignore them. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44634) Encoders.bean does no longer support nested beans with type arguments
[ https://issues.apache.org/jira/browse/SPARK-44634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giambattista Bloisi updated SPARK-44634: Description: Hi, while upgrading a project from spark 2.4.0 to 3.4.1 version, I have encountered the same problem described in [java - Encoders.bean attempts to check the validity of a return type considering its generic type and not its concrete class, with Spark 3.4.0 - Stack Overflow|https://stackoverflow.com/questions/76045255/encoders-bean-attempts-to-check-the-validity-of-a-return-type-considering-its-ge]. Put it short, starting from Spark 3.4.x Encoders.bean throws an exception when the passed class contains a field whose type is a nested bean with type arguments: {code:java} class A { T value; // value getter and setter } class B { A stringHolder; // stringHolder getter and setter } Encoders.bean(B.class); // throws "SparkUnsupportedOperationException: [ENCODER_NOT_FOUND]..."{code} It looks like this is a regression introduced with [SPARK-42093 SQL Move JavaTypeInference to AgnosticEncoders|https://github.com/apache/spark/commit/18672003513d5a4aa610b6b94dbbc15c33185d3#diff-1191737b908340a2f4c22b71b1c40ebaa0da9d8b40c958089c346a3bda26943b] while getting rid of TypeToken, that somehow managed that case. was: Hi, while upgrading a project from spark 2.4.0 to 3.4.1 version, I have encountered the same problem described in [java - Encoders.bean attempts to check the validity of a return type considering its generic type and not its concrete class, with Spark 3.4.0 - Stack Overflow|https://stackoverflow.com/questions/76045255/encoders-bean-attempts-to-check-the-validity-of-a-return-type-considering-its-ge]. Put it short, starting from Spark 3.4.x Encoders.bean throws an exception when the passed class contains a field whose type is a nested bean with type arguments: {code:java} class A { T value; // value getter and setter } class B { A stringHolder; // stringHolder getter and setter } Encoders.bean(B.class); // throws "SparkUnsupportedOperationException: [ENCODER_NOT_FOUND]..."{code} It looks like this is a regression introduced with [SPARK-42093 SQL Move JavaTypeInference to AgnosticEncoders|https://github.com/apache/spark/commit/18672003513d5a4aa610b6b94dbbc15c33185d3#diff-1191737b908340a2f4c22b71b1c40ebaa0da9d8b40c958089c346a3bda26943b] while getting rid of TypeToken, that somehow managed that case. I'm going to submit a PR to re-enable this functionality. > Encoders.bean does no longer support nested beans with type arguments > - > > Key: SPARK-44634 > URL: https://issues.apache.org/jira/browse/SPARK-44634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: Giambattista Bloisi >Priority: Major > > Hi, > while upgrading a project from spark 2.4.0 to 3.4.1 version, I have > encountered the same problem described in [java - Encoders.bean attempts to > check the validity of a return type considering its generic type and not its > concrete class, with Spark 3.4.0 - Stack > Overflow|https://stackoverflow.com/questions/76045255/encoders-bean-attempts-to-check-the-validity-of-a-return-type-considering-its-ge]. > Put it short, starting from Spark 3.4.x Encoders.bean throws an exception > when the passed class contains a field whose type is a nested bean with type > arguments: > > {code:java} > class A { >T value; >// value getter and setter > } > class B { >A stringHolder; >// stringHolder getter and setter > } > Encoders.bean(B.class); // throws "SparkUnsupportedOperationException: > [ENCODER_NOT_FOUND]..."{code} > > > It looks like this is a regression introduced with [SPARK-42093 SQL Move > JavaTypeInference to > AgnosticEncoders|https://github.com/apache/spark/commit/18672003513d5a4aa610b6b94dbbc15c33185d3#diff-1191737b908340a2f4c22b71b1c40ebaa0da9d8b40c958089c346a3bda26943b] > while getting rid of TypeToken, that somehow managed that case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44674) Remove `BytecodeUtils` from `graphx` module
[ https://issues.apache.org/jira/browse/SPARK-44674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44674. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42343 [https://github.com/apache/spark/pull/42343] > Remove `BytecodeUtils` from `graphx` module > --- > > Key: SPARK-44674 > URL: https://issues.apache.org/jira/browse/SPARK-44674 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44674) Remove `BytecodeUtils` from `graphx` module
[ https://issues.apache.org/jira/browse/SPARK-44674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44674: - Assignee: Yang Jie > Remove `BytecodeUtils` from `graphx` module > --- > > Key: SPARK-44674 > URL: https://issues.apache.org/jira/browse/SPARK-44674 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44676) Ensure Spark Connect scala client CloseableIterator is closed in all cases where exception could be thrown
Juliusz Sompolski created SPARK-44676: - Summary: Ensure Spark Connect scala client CloseableIterator is closed in all cases where exception could be thrown Key: SPARK-44676 URL: https://issues.apache.org/jira/browse/SPARK-44676 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Juliusz Sompolski We currently already ensure that CloseableIterator is consumed in all places, and in ExecutePlanResponseReattachableIterator we ensure that the iterator is closed in case of GRPC error. We should also ensure that all places in the client that use a CloseableIterator will close it gracefully, also in case of another exception being thrown, including InterruptedException. Some try \{ } finally \{ iteretor.close() } blocks may be needed for that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44656) Close dangling iterators in SparkResult too (Spark Connect Scala)
[ https://issues.apache.org/jira/browse/SPARK-44656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-44656: - Assignee: Alice Sayutina > Close dangling iterators in SparkResult too (Spark Connect Scala) > - > > Key: SPARK-44656 > URL: https://issues.apache.org/jira/browse/SPARK-44656 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Alice Sayutina >Assignee: Alice Sayutina >Priority: Major > > SPARK-44636 followup. We didn't address iterators grabbed in SparkResult > there. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44656) Close dangling iterators in SparkResult too (Spark Connect Scala)
[ https://issues.apache.org/jira/browse/SPARK-44656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44656. --- Fix Version/s: 3.5.0 Resolution: Fixed > Close dangling iterators in SparkResult too (Spark Connect Scala) > - > > Key: SPARK-44656 > URL: https://issues.apache.org/jira/browse/SPARK-44656 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Alice Sayutina >Assignee: Alice Sayutina >Priority: Major > Fix For: 3.5.0 > > > SPARK-44636 followup. We didn't address iterators grabbed in SparkResult > there. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44675) Increase ReservedCodeCacheSize for release build
[ https://issues.apache.org/jira/browse/SPARK-44675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44675: --- Assignee: Yuming Wang > Increase ReservedCodeCacheSize for release build > > > Key: SPARK-44675 > URL: https://issues.apache.org/jira/browse/SPARK-44675 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44675) Increase ReservedCodeCacheSize for release build
[ https://issues.apache.org/jira/browse/SPARK-44675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44675. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42344 [https://github.com/apache/spark/pull/42344] > Increase ReservedCodeCacheSize for release build > > > Key: SPARK-44675 > URL: https://issues.apache.org/jira/browse/SPARK-44675 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44675) Increase ReservedCodeCacheSize for release build
Yuming Wang created SPARK-44675: --- Summary: Increase ReservedCodeCacheSize for release build Key: SPARK-44675 URL: https://issues.apache.org/jira/browse/SPARK-44675 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44674) Remove `BytecodeUtils` from `graphx` module
Yang Jie created SPARK-44674: Summary: Remove `BytecodeUtils` from `graphx` module Key: SPARK-44674 URL: https://issues.apache.org/jira/browse/SPARK-44674 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44673) Raise a ArrowIOError with empty messages
Deng An created SPARK-44673: --- Summary: Raise a ArrowIOError with empty messages Key: SPARK-44673 URL: https://issues.apache.org/jira/browse/SPARK-44673 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: Deng An We encountered an problem while using PySpark 2.4.4, where the Python Runner of a certain task threw a pyarrow.lib.ArrowIOError without any message, which is too confusing. And this task was scheduled multiple times, all of these task attempts failed for the same pyarrow.lib.ArrowIOError. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44600) Make `repl` module daily test pass
[ https://issues.apache.org/jira/browse/SPARK-44600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44600. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42291 [https://github.com/apache/spark/pull/42291] > Make `repl` module daily test pass > -- > > Key: SPARK-44600 > URL: https://issues.apache.org/jira/browse/SPARK-44600 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 4.0.0 > > > [https://github.com/apache/spark/actions/runs/5727123477/job/15518895421] > > {code:java} > - SPARK-15236: use Hive catalog *** FAILED *** > 18137 isContain was true Interpreter output contained 'Exception': > 18138 Welcome to > 18139 __ > 18140 / __/__ ___ _/ /__ > 18141 _\ \/ _ \/ _ `/ __/ '_/ > 18142 /___/ .__/\_,_/_/ /_/\_\ version 4.0.0-SNAPSHOT > 18143/_/ > 18144 > 18145 Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_372) > 18146 Type in expressions to have them evaluated. > 18147 Type :help for more information. > 18148 > 18149 scala> > 18150 scala> java.lang.NoClassDefFoundError: > org/sparkproject/guava/cache/CacheBuilder > 18151at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:197) > 18152at > org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog$lzycompute(BaseSessionStateBuilder.scala:153) > 18153at > org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog(BaseSessionStateBuilder.scala:152) > 18154at > org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog$lzycompute(BaseSessionStateBuilder.scala:166) > 18155at > org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog(BaseSessionStateBuilder.scala:166) > 18156at > org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager$lzycompute(BaseSessionStateBuilder.scala:168) > 18157at > org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager(BaseSessionStateBuilder.scala:168) > 18158at > org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.(BaseSessionStateBuilder.scala:185) > 18159at > org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:185) > 18160at > org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$2(BaseSessionStateBuilder.scala:374) > 18161at > org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:92) > 18162at > org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:92) > 18163at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77) > 18164at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138) > 18165at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219) > 18166at > org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546) > 18167at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219) > 18168at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) > 18169at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218) > 18170at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77) > 18171at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) > 18172at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66) > 18173at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > 18174at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) > 18175at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98) > 18176at > org.apache.spark.sql.SparkSession.$anonfun$sql$4(SparkSession.scala:691) > 18177at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) > 18178at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682) > 18179at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:713) > 18180at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:744) > 18181... 100 elided > 18182 Caused by: java.lang.ClassNotFoundException: > org.sparkproject.guava.cache.CacheBuilder > 18183at java.net.URLClassLoader.findClass(URLClassLoader.java:387) > 18184at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > 18185at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > 18186at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > 18187
[jira] [Assigned] (SPARK-44600) Make `repl` module daily test pass
[ https://issues.apache.org/jira/browse/SPARK-44600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44600: Assignee: Yang Jie > Make `repl` module daily test pass > -- > > Key: SPARK-44600 > URL: https://issues.apache.org/jira/browse/SPARK-44600 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > [https://github.com/apache/spark/actions/runs/5727123477/job/15518895421] > > {code:java} > - SPARK-15236: use Hive catalog *** FAILED *** > 18137 isContain was true Interpreter output contained 'Exception': > 18138 Welcome to > 18139 __ > 18140 / __/__ ___ _/ /__ > 18141 _\ \/ _ \/ _ `/ __/ '_/ > 18142 /___/ .__/\_,_/_/ /_/\_\ version 4.0.0-SNAPSHOT > 18143/_/ > 18144 > 18145 Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_372) > 18146 Type in expressions to have them evaluated. > 18147 Type :help for more information. > 18148 > 18149 scala> > 18150 scala> java.lang.NoClassDefFoundError: > org/sparkproject/guava/cache/CacheBuilder > 18151at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:197) > 18152at > org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog$lzycompute(BaseSessionStateBuilder.scala:153) > 18153at > org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog(BaseSessionStateBuilder.scala:152) > 18154at > org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog$lzycompute(BaseSessionStateBuilder.scala:166) > 18155at > org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog(BaseSessionStateBuilder.scala:166) > 18156at > org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager$lzycompute(BaseSessionStateBuilder.scala:168) > 18157at > org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager(BaseSessionStateBuilder.scala:168) > 18158at > org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.(BaseSessionStateBuilder.scala:185) > 18159at > org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:185) > 18160at > org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$2(BaseSessionStateBuilder.scala:374) > 18161at > org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:92) > 18162at > org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:92) > 18163at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77) > 18164at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138) > 18165at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219) > 18166at > org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546) > 18167at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219) > 18168at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) > 18169at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218) > 18170at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77) > 18171at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) > 18172at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66) > 18173at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > 18174at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) > 18175at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98) > 18176at > org.apache.spark.sql.SparkSession.$anonfun$sql$4(SparkSession.scala:691) > 18177at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) > 18178at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682) > 18179at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:713) > 18180at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:744) > 18181... 100 elided > 18182 Caused by: java.lang.ClassNotFoundException: > org.sparkproject.guava.cache.CacheBuilder > 18183at java.net.URLClassLoader.findClass(URLClassLoader.java:387) > 18184at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > 18185at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > 18186at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > 18187... 130 more > 18188 > 18189 scala> | > 18190 scala> :quit (ReplSuite.scala:83) {code} -- This message was sent by Atlass
[jira] [Created] (SPARK-44672) Fix git ignore rules related to Antlr
Yang Jie created SPARK-44672: Summary: Fix git ignore rules related to Antlr Key: SPARK-44672 URL: https://issues.apache.org/jira/browse/SPARK-44672 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.5.0, 4.0.0 Reporter: Yang Jie run `git status` after SPARK-44475 merged {code:java} sql/api/gen/ sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/gen/ {code} The rules should be modified in the .gitignore file to ignore them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44666) Uninstall CodeQL/Go/Node in non-container jobs
[ https://issues.apache.org/jira/browse/SPARK-44666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-44666. --- Assignee: Ruifeng Zheng Resolution: Resolved > Uninstall CodeQL/Go/Node in non-container jobs > -- > > Key: SPARK-44666 > URL: https://issues.apache.org/jira/browse/SPARK-44666 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44666) Uninstall CodeQL/Go/Node in non-container jobs
[ https://issues.apache.org/jira/browse/SPARK-44666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750998#comment-17750998 ] Ruifeng Zheng commented on SPARK-44666: --- resolved in https://github.com/apache/spark/pull/42333 > Uninstall CodeQL/Go/Node in non-container jobs > -- > > Key: SPARK-44666 > URL: https://issues.apache.org/jira/browse/SPARK-44666 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44070) Bump snappy-java 1.1.10.1
[ https://issues.apache.org/jira/browse/SPARK-44070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750985#comment-17750985 ] Anil Poriya commented on SPARK-44070: - [~chengpan] Hi, is there any plan to backport this fix to version 3.3.2? OR any issue tracking the bump of snappy-java in spark 3.3.2? > Bump snappy-java 1.1.10.1 > - > > Key: SPARK-44070 > URL: https://issues.apache.org/jira/browse/SPARK-44070 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.4.1, 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org