[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails
[ https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717882#comment-17717882 ] kalyan s commented on SPARK-43106: -- Thank you for the response [~dongjoon] . Most of the workloads have been running on 2.4, while we have made good progress moving workloads to 3.X this year. We notice this in a few long-running workloads on static partitions/unpartitioned datasets. While HDFS has been our primary storage backend, moving to object stores on GCP has been making this problem more pronounced, due to inherent slowness in writing to these. [~vaibhavb] can you share some test code to help here? > Data lost from the table if the INSERT OVERWRITE query fails > > > Key: SPARK-43106 > URL: https://issues.apache.org/jira/browse/SPARK-43106 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: Vaibhav Beriwala >Priority: Major > > When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, > Spark has the following behavior: > 1) It will first clean up all the data from the actual table path. > 2) It will then launch a job that performs the actual insert. > > There are 2 major issues with this approach: > 1) If the insert job launched in step 2 above fails for any reason, the data > from the original table is lost. > 2) If the insert job in step 2 above takes a huge time to complete, then > table data is unavailable to other readers for the entire duration the job > takes. > This behavior is the same even for the partitioned tables when using static > partitioning. For dynamic partitioning, we do not delete the table data > before the job launch. > > Is there a reason as to why we perform this delete before the job launch and > not as part of the Job commit operation? This issue is not there with Hive - > where the data is cleaned up as part of the Job commit operation probably. As > part of SPARK-19183, we did add a new hook in the commit protocol for this > exact same purpose, but seems like its default behavior is still to delete > the table data before the job launch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43237) Handle null exception message in event log
[ https://issues.apache.org/jira/browse/SPARK-43237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717875#comment-17717875 ] Snoot.io commented on SPARK-43237: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/40911 > Handle null exception message in event log > -- > > Key: SPARK-43237 > URL: https://issues.apache.org/jira/browse/SPARK-43237 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43237) Handle null exception message in event log
[ https://issues.apache.org/jira/browse/SPARK-43237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-43237. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40911 [https://github.com/apache/spark/pull/40911] > Handle null exception message in event log > -- > > Key: SPARK-43237 > URL: https://issues.apache.org/jira/browse/SPARK-43237 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43237) Handle null exception message in event log
[ https://issues.apache.org/jira/browse/SPARK-43237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-43237: --- Assignee: Zhongwei Zhu > Handle null exception message in event log > -- > > Key: SPARK-43237 > URL: https://issues.apache.org/jira/browse/SPARK-43237 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43270) Implement __dir__() in pyspark.sql.dataframe.DataFrame to include columns
[ https://issues.apache.org/jira/browse/SPARK-43270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43270. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40907 [https://github.com/apache/spark/pull/40907] > Implement __dir__() in pyspark.sql.dataframe.DataFrame to include columns > - > > Key: SPARK-43270 > URL: https://issues.apache.org/jira/browse/SPARK-43270 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Beishao Cao >Assignee: Beishao Cao >Priority: Major > Fix For: 3.5.0 > > Attachments: Screenshot 2023-04-23 at 6.48.46 PM.png > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently, Given {{df.|}} , the databricks notebook will only suggest the > method of dataframe(see attached Screenshot of databricks notebook), > {{However, df.column_name}} is also legal and runnable > Hence we should override the parent __{{{}dir__{}}}{{{}(){}}} method on > Python {{DataFrame}} class to include column names. And the benefit of this > is engine that uses {{dir()}} to generate autocomplete suggestions (e.g. > IPython kernel, Databricks Notebooks) will suggest column names on the > completion {{df.|}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43270) Implement __dir__() in pyspark.sql.dataframe.DataFrame to include columns
[ https://issues.apache.org/jira/browse/SPARK-43270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43270: Assignee: Beishao Cao > Implement __dir__() in pyspark.sql.dataframe.DataFrame to include columns > - > > Key: SPARK-43270 > URL: https://issues.apache.org/jira/browse/SPARK-43270 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Beishao Cao >Assignee: Beishao Cao >Priority: Major > Attachments: Screenshot 2023-04-23 at 6.48.46 PM.png > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently, Given {{df.|}} , the databricks notebook will only suggest the > method of dataframe(see attached Screenshot of databricks notebook), > {{However, df.column_name}} is also legal and runnable > Hence we should override the parent __{{{}dir__{}}}{{{}(){}}} method on > Python {{DataFrame}} class to include column names. And the benefit of this > is engine that uses {{dir()}} to generate autocomplete suggestions (e.g. > IPython kernel, Databricks Notebooks) will suggest column names on the > completion {{df.|}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42940) Session management support streaming connect
[ https://issues.apache.org/jira/browse/SPARK-42940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42940. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40937 [https://github.com/apache/spark/pull/40937] > Session management support streaming connect > > > Key: SPARK-42940 > URL: https://issues.apache.org/jira/browse/SPARK-42940 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > Fix For: 3.5.0 > > > Add session support for streaming jobs. > E.g. a session should stay alive when a streaming job is alive. > It might differ more complex scenarios like what happens when client loses > track of the session. Such semantics would be handled as part of session > semantics across Spark Connect (including streaming). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42940) Session management support streaming connect
[ https://issues.apache.org/jira/browse/SPARK-42940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42940: Assignee: Raghu Angadi > Session management support streaming connect > > > Key: SPARK-42940 > URL: https://issues.apache.org/jira/browse/SPARK-42940 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > > Add session support for streaming jobs. > E.g. a session should stay alive when a streaming job is alive. > It might differ more complex scenarios like what happens when client loses > track of the session. Such semantics would be handled as part of session > semantics across Spark Connect (including streaming). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43324) DataSource V2: Handle UPDATE commands for delta-based sources
Anton Okolnychyi created SPARK-43324: Summary: DataSource V2: Handle UPDATE commands for delta-based sources Key: SPARK-43324 URL: https://issues.apache.org/jira/browse/SPARK-43324 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Anton Okolnychyi We should handle UPDATE commands for data sources that support row deltas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43323) DataFrame.toPandas with Arrow enabled should handle exceptions properly
Takuya Ueshin created SPARK-43323: - Summary: DataFrame.toPandas with Arrow enabled should handle exceptions properly Key: SPARK-43323 URL: https://issues.apache.org/jira/browse/SPARK-43323 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Takuya Ueshin Currently {{DataFrame.toPandas}} doesn't capture exceptions happened in Spark properly. {code:python} >>> spark.conf.set("spark.sql.ansi.enabled", True) >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True) >>> spark.sql("select 1/0").toPandas() ... An error occurred while calling o53.getResult. : org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322) ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails
[ https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717855#comment-17717855 ] Dongjoon Hyun commented on SPARK-43106: --- Thank you for reporting. To narrow down your issue more, let me ask more information, [~itskals]. # Is this specific to Apache Spark 3.3.2, could you try to use other Apache Spark versions like Apache Spark 3.4.0 or Apache Spark 3.3.1? # What storage backend are you using now, HDFS or S3? # Do you think you can provide us a reproducible example? > Data lost from the table if the INSERT OVERWRITE query fails > > > Key: SPARK-43106 > URL: https://issues.apache.org/jira/browse/SPARK-43106 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: Vaibhav Beriwala >Priority: Major > > When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, > Spark has the following behavior: > 1) It will first clean up all the data from the actual table path. > 2) It will then launch a job that performs the actual insert. > > There are 2 major issues with this approach: > 1) If the insert job launched in step 2 above fails for any reason, the data > from the original table is lost. > 2) If the insert job in step 2 above takes a huge time to complete, then > table data is unavailable to other readers for the entire duration the job > takes. > This behavior is the same even for the partitioned tables when using static > partitioning. For dynamic partitioning, we do not delete the table data > before the job launch. > > Is there a reason as to why we perform this delete before the job launch and > not as part of the Job commit operation? This issue is not there with Hive - > where the data is cleaned up as part of the Job commit operation probably. As > part of SPARK-19183, we did add a new hook in the commit protocol for this > exact same purpose, but seems like its default behavior is still to delete > the table data before the job launch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43322) Spark SQL docs for explode_outer and posexplode_outer omit behavior for null/empty
[ https://issues.apache.org/jira/browse/SPARK-43322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Juchnicki updated SPARK-43322: - Issue Type: Documentation (was: Improvement) > Spark SQL docs for explode_outer and posexplode_outer omit behavior for > null/empty > -- > > Key: SPARK-43322 > URL: https://issues.apache.org/jira/browse/SPARK-43322 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: Robert Juchnicki >Priority: Minor > > The Spark SQL documentation for > [explode_outer|https://spark.apache.org/docs/latest/api/sql/index.html#explode_outer] > and > [posexplode_outer|[https://spark.apache.org/docs/latest/api/sql/index.html#posexplode_outer|https://spark.apache.org/docs/latest/api/sql/index.html#posexplode_outer)]] > omits mentioning that null or empty arrays produce nulls. The descriptions > do not appear to be written down in a doc file and are likely pulled from the > `ExpressionDescription` tags for the `Explode` and `PosExplode` generators > when the `GeneratorOuter` wrapper is used. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43322) Spark SQL docs for explode_outer and posexplode_outer omit behavior for null/empty
Robert Juchnicki created SPARK-43322: Summary: Spark SQL docs for explode_outer and posexplode_outer omit behavior for null/empty Key: SPARK-43322 URL: https://issues.apache.org/jira/browse/SPARK-43322 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Robert Juchnicki The Spark SQL documentation for [explode_outer|https://spark.apache.org/docs/latest/api/sql/index.html#explode_outer] and [posexplode_outer|[https://spark.apache.org/docs/latest/api/sql/index.html#posexplode_outer|https://spark.apache.org/docs/latest/api/sql/index.html#posexplode_outer)]] omits mentioning that null or empty arrays produce nulls. The descriptions do not appear to be written down in a doc file and are likely pulled from the `ExpressionDescription` tags for the `Explode` and `PosExplode` generators when the `GeneratorOuter` wrapper is used. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43321) Impl Dataset#JoinWith
Zhen Li created SPARK-43321: --- Summary: Impl Dataset#JoinWith Key: SPARK-43321 URL: https://issues.apache.org/jira/browse/SPARK-43321 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Zhen Li Impl missing method JoinWith -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43320) Directly call Hive 2.3.9 API
Cheng Pan created SPARK-43320: - Summary: Directly call Hive 2.3.9 API Key: SPARK-43320 URL: https://issues.apache.org/jira/browse/SPARK-43320 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43319) Remove usage of deprecated DefaultKubernetesClient
Cheng Pan created SPARK-43319: - Summary: Remove usage of deprecated DefaultKubernetesClient Key: SPARK-43319 URL: https://issues.apache.org/jira/browse/SPARK-43319 Project: Spark Issue Type: Test Components: Kubernetes, Tests Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43263) Upgrade FasterXML jackson to 2.15.0
[ https://issues.apache.org/jira/browse/SPARK-43263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-43263. -- Fix Version/s: 3.5.0 Assignee: Bjørn Jørgensen Resolution: Fixed Resolved by https://github.com/apache/spark/pull/40933 > Upgrade FasterXML jackson to 2.15.0 > --- > > Key: SPARK-43263 > URL: https://issues.apache.org/jira/browse/SPARK-43263 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.5.0 > > > * #390: (yaml) Upgrade to Snakeyaml 2.0 (resolves > [CVE-2022-1471|https://nvd.nist.gov/vuln/detail/CVE-2022-1471]) > (contributed by @pjfannin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43318) spark reader csv and json support wholetext parameters
melin created SPARK-43318: - Summary: spark reader csv and json support wholetext parameters Key: SPARK-43318 URL: https://issues.apache.org/jira/browse/SPARK-43318 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: melin Fix For: 3.5.0 FTPInputStream used by Hadoop FTPFileSystem does not support seek, and spark HadoopFileLinesReader fails to be read. Support to read the entire file, and then split lines, avoid reading failure [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ftp/FTPInputStream.java] [~cloud_fan] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717611#comment-17717611 ] Steve Loughran commented on SPARK-43170: FWIW, using S3 URLs 's3://x/dwm_user_app_action_sum_all' means it's an AWS EMR deployment, with their private fork of spark, etc. you might want to raise a support case there > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png, > image-2023-04-19-10-59-44-118.png, screenshot-1.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception
[ https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717566#comment-17717566 ] Pralabh Kumar commented on SPARK-43235: --- [~gurwls223] Can u please look into this . > ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE > if isPublic throws exception > -- > > Key: SPARK-43235 > URL: https://issues.apache.org/jira/browse/SPARK-43235 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Pralabh Kumar >Priority: Minor > > Hi Spark Team . > Currently *ClientDistributedCacheManager* *getVisibility* methods checks > whether resource visibility can be set to private or public. > In order to set *LocalResourceVisibility.PUBLIC* ,isPublic checks permission > of all the ancestors directories for the executable directory . It goes till > the root folder to check permission of all the parents > (ancestorsHaveExecutePermissions) > checkPermissionOfOther calls FileStatus getFileStatus to check the > permission . > If the FileStatus getFileStatus throws exception Spark Submit fails . It > didn't sets the permission to Private. > if (isPublic(conf, uri, statCache)) { > LocalResourceVisibility.PUBLIC > } else { > LocalResourceVisibility.PRIVATE > } > Generally if the user doesn't have permission to check for root folder > (specifically in case of cloud file system(GCS) (for the buckets) , methods > throws error IOException(Error accessing Bucket). > > *Ideally if there is an error in isPublic , which means Spark isn't able to > determine the execution permission of all the parents directory , it should > set the LocalResourceVisibility.PRIVATE. However, it currently throws an > exception in isPublic and hence Spark Submit fails* > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43306) Migrate `ValueError` from Spark SQL types into error class
[ https://issues.apache.org/jira/browse/SPARK-43306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717565#comment-17717565 ] ASF GitHub Bot commented on SPARK-43306: User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40975 > Migrate `ValueError` from Spark SQL types into error class > -- > > Key: SPARK-43306 > URL: https://issues.apache.org/jira/browse/SPARK-43306 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Migrate `ValueError` from Spark SQL types into error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43304) Enable test_to_latex by supporting jinja2>=3.0.0
[ https://issues.apache.org/jira/browse/SPARK-43304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717564#comment-17717564 ] ASF GitHub Bot commented on SPARK-43304: User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40973 > Enable test_to_latex by supporting jinja2>=3.0.0 > > > Key: SPARK-43304 > URL: https://issues.apache.org/jira/browse/SPARK-43304 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Please refer to [https://github.com/pandas-dev/pandas/pull/47970] see more > detail. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43304) Enable test_to_latex by supporting jinja2>=3.0.0
[ https://issues.apache.org/jira/browse/SPARK-43304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717563#comment-17717563 ] ASF GitHub Bot commented on SPARK-43304: User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/40973 > Enable test_to_latex by supporting jinja2>=3.0.0 > > > Key: SPARK-43304 > URL: https://issues.apache.org/jira/browse/SPARK-43304 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Please refer to [https://github.com/pandas-dev/pandas/pull/47970] see more > detail. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42843) Assign a name to the error class _LEGACY_ERROR_TEMP_2007
[ https://issues.apache.org/jira/browse/SPARK-42843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717555#comment-17717555 ] ASF GitHub Bot commented on SPARK-42843: User 'liang3zy22' has created a pull request for this issue: https://github.com/apache/spark/pull/40955 > Assign a name to the error class _LEGACY_ERROR_TEMP_2007 > > > Key: SPARK-42843 > URL: https://issues.apache.org/jira/browse/SPARK-42843 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2007* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43317) Support combine adjacent aggregation
XiDuo You created SPARK-43317: - Summary: Support combine adjacent aggregation Key: SPARK-43317 URL: https://issues.apache.org/jira/browse/SPARK-43317 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You If there have adjacent aggregation with Partial and Final mode, we can combine them to Complete mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43316) Add more CTE SQL tests
Runyao.Chen created SPARK-43316: --- Summary: Add more CTE SQL tests Key: SPARK-43316 URL: https://issues.apache.org/jira/browse/SPARK-43316 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.4.0 Reporter: Runyao.Chen CTE is a hot area in terms of regression and needs more test coverage. We can borrow from other open-source DBMS (Postgres, ZetaSQL, DuckDB) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org