date:20230428

[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-04-28 Thread kalyan s (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717882#comment-17717882
 ] 

kalyan s commented on SPARK-43106:
--

Thank you for the response [~dongjoon] .

Most of the workloads have been running on 2.4, while we have made good 
progress moving workloads to 3.X this year.

We notice this in a few long-running workloads on static 
partitions/unpartitioned datasets.

While HDFS has been our primary storage backend, moving to object stores on GCP 
has been making this problem more pronounced, due to inherent slowness in 
writing to these.

[~vaibhavb] can you share some test code to help here?

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Vaibhav Beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43237) Handle null exception message in event log

2023-04-28 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717875#comment-17717875
 ] 

Snoot.io commented on SPARK-43237:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/40911

> Handle null exception message in event log
> --
>
> Key: SPARK-43237
> URL: https://issues.apache.org/jira/browse/SPARK-43237
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43237) Handle null exception message in event log

2023-04-28 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-43237.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40911
[https://github.com/apache/spark/pull/40911]

> Handle null exception message in event log
> --
>
> Key: SPARK-43237
> URL: https://issues.apache.org/jira/browse/SPARK-43237
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43237) Handle null exception message in event log

2023-04-28 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-43237:
---

Assignee: Zhongwei Zhu

> Handle null exception message in event log
> --
>
> Key: SPARK-43237
> URL: https://issues.apache.org/jira/browse/SPARK-43237
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43270) Implement dir() in pyspark.sql.dataframe.DataFrame to include columns

2023-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43270.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40907
[https://github.com/apache/spark/pull/40907]

> Implement __dir__() in pyspark.sql.dataframe.DataFrame to include columns
> -
>
> Key: SPARK-43270
> URL: https://issues.apache.org/jira/browse/SPARK-43270
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Beishao Cao
>Assignee: Beishao Cao
>Priority: Major
> Fix For: 3.5.0
>
> Attachments: Screenshot 2023-04-23 at 6.48.46 PM.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Given {{df.|}} , the databricks notebook will only suggest the 
> method of dataframe(see attached Screenshot of databricks notebook),
> {{However, df.column_name}} is also legal and runnable 
> Hence we should override the parent __{{{}dir__{}}}{{{}(){}}} method on 
> Python {{DataFrame}} class to include column names. And the benefit of this 
> is engine that uses {{dir()}} to generate autocomplete suggestions (e.g. 
> IPython kernel, Databricks Notebooks) will suggest column names on the 
> completion {{df.|}} 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43270) Implement dir() in pyspark.sql.dataframe.DataFrame to include columns

2023-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43270:


Assignee: Beishao Cao

> Implement __dir__() in pyspark.sql.dataframe.DataFrame to include columns
> -
>
> Key: SPARK-43270
> URL: https://issues.apache.org/jira/browse/SPARK-43270
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Beishao Cao
>Assignee: Beishao Cao
>Priority: Major
> Attachments: Screenshot 2023-04-23 at 6.48.46 PM.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Given {{df.|}} , the databricks notebook will only suggest the 
> method of dataframe(see attached Screenshot of databricks notebook),
> {{However, df.column_name}} is also legal and runnable 
> Hence we should override the parent __{{{}dir__{}}}{{{}(){}}} method on 
> Python {{DataFrame}} class to include column names. And the benefit of this 
> is engine that uses {{dir()}} to generate autocomplete suggestions (e.g. 
> IPython kernel, Databricks Notebooks) will suggest column names on the 
> completion {{df.|}} 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42940) Session management support streaming connect

2023-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42940.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40937
[https://github.com/apache/spark/pull/40937]

> Session management support streaming connect
> 
>
> Key: SPARK-42940
> URL: https://issues.apache.org/jira/browse/SPARK-42940
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 3.5.0
>
>
> Add session support for streaming jobs. 
> E.g. a session should stay alive when a streaming job is alive. 
> It might differ more complex scenarios like what happens when client loses 
> track of the session. Such semantics would be handled as part of session 
> semantics across Spark Connect (including streaming). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42940) Session management support streaming connect

2023-04-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42940:


Assignee: Raghu Angadi

> Session management support streaming connect
> 
>
> Key: SPARK-42940
> URL: https://issues.apache.org/jira/browse/SPARK-42940
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
>
> Add session support for streaming jobs. 
> E.g. a session should stay alive when a streaming job is alive. 
> It might differ more complex scenarios like what happens when client loses 
> track of the session. Such semantics would be handled as part of session 
> semantics across Spark Connect (including streaming). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43324) DataSource V2: Handle UPDATE commands for delta-based sources

2023-04-28 Thread Anton Okolnychyi (Jira)

Anton Okolnychyi created SPARK-43324:


 Summary: DataSource V2: Handle UPDATE commands for delta-based 
sources
 Key: SPARK-43324
 URL: https://issues.apache.org/jira/browse/SPARK-43324
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Anton Okolnychyi


We should handle UPDATE commands for data sources that support row deltas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43323) DataFrame.toPandas with Arrow enabled should handle exceptions properly

2023-04-28 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-43323:
-

 Summary: DataFrame.toPandas with Arrow enabled should handle 
exceptions properly
 Key: SPARK-43323
 URL: https://issues.apache.org/jira/browse/SPARK-43323
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Takuya Ueshin


Currently {{DataFrame.toPandas}} doesn't capture exceptions happened in Spark 
properly.

{code:python}
>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> spark.sql("select 1/0").toPandas()
...
  An error occurred while calling o53.getResult.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
...
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-04-28 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717855#comment-17717855
 ] 

Dongjoon Hyun commented on SPARK-43106:
---

Thank you for reporting. To narrow down your issue more, let me ask more 
information, [~itskals].
# Is this specific to Apache Spark 3.3.2, could you try to use other Apache 
Spark versions like Apache Spark 3.4.0 or Apache Spark 3.3.1?
# What storage backend are you using now, HDFS or S3?
# Do you think you can provide us a reproducible example?

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Vaibhav Beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43322) Spark SQL docs for explode_outer and posexplode_outer omit behavior for null/empty

2023-04-28 Thread Robert Juchnicki (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Juchnicki updated SPARK-43322:
-
Issue Type: Documentation  (was: Improvement)

> Spark SQL docs for explode_outer and posexplode_outer omit behavior for 
> null/empty
> --
>
> Key: SPARK-43322
> URL: https://issues.apache.org/jira/browse/SPARK-43322
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Robert Juchnicki
>Priority: Minor
>
> The Spark SQL documentation for 
> [explode_outer|https://spark.apache.org/docs/latest/api/sql/index.html#explode_outer]
>  and 
> [posexplode_outer|[https://spark.apache.org/docs/latest/api/sql/index.html#posexplode_outer|https://spark.apache.org/docs/latest/api/sql/index.html#posexplode_outer)]]
>  omits mentioning that null or empty arrays produce nulls. The descriptions 
> do not appear to be written down in a doc file and are likely pulled from the 
> `ExpressionDescription` tags for the `Explode` and `PosExplode` generators 
> when the `GeneratorOuter` wrapper is used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43322) Spark SQL docs for explode_outer and posexplode_outer omit behavior for null/empty

2023-04-28 Thread Robert Juchnicki (Jira)

Robert Juchnicki created SPARK-43322:


 Summary: Spark SQL docs for explode_outer and posexplode_outer 
omit behavior for null/empty
 Key: SPARK-43322
 URL: https://issues.apache.org/jira/browse/SPARK-43322
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Robert Juchnicki


The Spark SQL documentation for 
[explode_outer|https://spark.apache.org/docs/latest/api/sql/index.html#explode_outer]
 and 
[posexplode_outer|[https://spark.apache.org/docs/latest/api/sql/index.html#posexplode_outer|https://spark.apache.org/docs/latest/api/sql/index.html#posexplode_outer)]]
 omits mentioning that null or empty arrays produce nulls. The descriptions do 
not appear to be written down in a doc file and are likely pulled from the 
`ExpressionDescription` tags for the `Explode` and `PosExplode` generators when 
the `GeneratorOuter` wrapper is used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43321) Impl Dataset#JoinWith

2023-04-28 Thread Zhen Li (Jira)

Zhen Li created SPARK-43321:
---

 Summary: Impl Dataset#JoinWith
 Key: SPARK-43321
 URL: https://issues.apache.org/jira/browse/SPARK-43321
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Zhen Li


Impl missing method JoinWith



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43320) Directly call Hive 2.3.9 API

2023-04-28 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-43320:
-

 Summary: Directly call Hive 2.3.9 API
 Key: SPARK-43320
 URL: https://issues.apache.org/jira/browse/SPARK-43320
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43319) Remove usage of deprecated DefaultKubernetesClient

2023-04-28 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-43319:
-

 Summary: Remove usage of deprecated DefaultKubernetesClient
 Key: SPARK-43319
 URL: https://issues.apache.org/jira/browse/SPARK-43319
 Project: Spark
  Issue Type: Test
  Components: Kubernetes, Tests
Affects Versions: 3.5.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43263) Upgrade FasterXML jackson to 2.15.0

2023-04-28 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-43263.
--
Fix Version/s: 3.5.0
 Assignee: Bjørn Jørgensen
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/40933

> Upgrade FasterXML jackson to 2.15.0
> ---
>
> Key: SPARK-43263
> URL: https://issues.apache.org/jira/browse/SPARK-43263
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.5.0
>
>
> * #390: (yaml) Upgrade to Snakeyaml 2.0 (resolves 
> [CVE-2022-1471|https://nvd.nist.gov/vuln/detail/CVE-2022-1471])
>  (contributed by @pjfannin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43318) spark reader csv and json support wholetext parameters

2023-04-28 Thread melin (Jira)

melin created SPARK-43318:
-

 Summary: spark reader csv and json support wholetext parameters
 Key: SPARK-43318
 URL: https://issues.apache.org/jira/browse/SPARK-43318
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: melin
 Fix For: 3.5.0


FTPInputStream used by Hadoop FTPFileSystem does not support seek, and spark 
HadoopFileLinesReader fails to be read. 

Support to read the entire file, and then split lines, avoid reading failure

 

[https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ftp/FTPInputStream.java]

 

[~cloud_fan] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-28 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717611#comment-17717611
 ] 

Steve Loughran commented on SPARK-43170:


FWIW, using  S3 URLs  's3://x/dwm_user_app_action_sum_all' means it's an 
AWS EMR deployment, with their private fork of spark, etc. you might want to 
raise a support case there

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png, 
> image-2023-04-19-10-59-44-118.png, screenshot-1.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

2023-04-28 Thread Pralabh Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717566#comment-17717566
 ] 

Pralabh Kumar commented on SPARK-43235:
---

[~gurwls223] Can u please look into this . 

> ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE 
> if isPublic throws exception
> --
>
> Key: SPARK-43235
> URL: https://issues.apache.org/jira/browse/SPARK-43235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Hi Spark Team .
> Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
> whether resource visibility can be set to private or public. 
> In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
> of all the ancestors directories for the executable directory . It goes till 
> the root folder to check permission of all the parents 
> (ancestorsHaveExecutePermissions) 
> checkPermissionOfOther calls  FileStatus getFileStatus to check the 
> permission .
> If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
> didn't sets the permission to Private.
> if (isPublic(conf, uri, statCache)) {
> LocalResourceVisibility.PUBLIC
> } else {
> LocalResourceVisibility.PRIVATE
> }
> Generally if the user doesn't have permission to check for root folder 
> (specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
> throws error IOException(Error accessing Bucket).
>  
> *Ideally if there is an error in isPublic , which means Spark isn't able to 
> determine the execution permission of all the parents directory , it should 
> set the LocalResourceVisibility.PRIVATE.  However, it currently throws an 
> exception in isPublic and hence Spark Submit fails*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43306) Migrate `ValueError` from Spark SQL types into error class

2023-04-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717565#comment-17717565
 ] 

ASF GitHub Bot commented on SPARK-43306:


User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/40975

> Migrate `ValueError` from Spark SQL types into error class
> --
>
> Key: SPARK-43306
> URL: https://issues.apache.org/jira/browse/SPARK-43306
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Migrate `ValueError` from Spark SQL types into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43304) Enable test_to_latex by supporting jinja2>=3.0.0

2023-04-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717564#comment-17717564
 ] 

ASF GitHub Bot commented on SPARK-43304:


User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/40973

> Enable test_to_latex by supporting jinja2>=3.0.0
> 
>
> Key: SPARK-43304
> URL: https://issues.apache.org/jira/browse/SPARK-43304
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Please refer to [https://github.com/pandas-dev/pandas/pull/47970] see more 
> detail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43304) Enable test_to_latex by supporting jinja2>=3.0.0

2023-04-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717563#comment-17717563
 ] 

ASF GitHub Bot commented on SPARK-43304:


User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/40973

> Enable test_to_latex by supporting jinja2>=3.0.0
> 
>
> Key: SPARK-43304
> URL: https://issues.apache.org/jira/browse/SPARK-43304
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Please refer to [https://github.com/pandas-dev/pandas/pull/47970] see more 
> detail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42843) Assign a name to the error class _LEGACY_ERROR_TEMP_2007

2023-04-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717555#comment-17717555
 ] 

ASF GitHub Bot commented on SPARK-42843:


User 'liang3zy22' has created a pull request for this issue:
https://github.com/apache/spark/pull/40955

> Assign a name to the error class _LEGACY_ERROR_TEMP_2007
> 
>
> Key: SPARK-42843
> URL: https://issues.apache.org/jira/browse/SPARK-42843
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2007* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43317) Support combine adjacent aggregation

2023-04-28 Thread XiDuo You (Jira)

XiDuo You created SPARK-43317:
-

 Summary: Support combine adjacent aggregation
 Key: SPARK-43317
 URL: https://issues.apache.org/jira/browse/SPARK-43317
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


If there have adjacent aggregation with Partial and Final mode, we can combine 
them to Complete mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43316) Add more CTE SQL tests

2023-04-28 Thread Runyao.Chen (Jira)

Runyao.Chen created SPARK-43316:
---

 Summary: Add more CTE SQL tests
 Key: SPARK-43316
 URL: https://issues.apache.org/jira/browse/SPARK-43316
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.4.0
Reporter: Runyao.Chen


CTE is a hot area in terms of regression and needs more test coverage.

We can borrow from other open-source DBMS (Postgres, ZetaSQL, DuckDB)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

[jira] [Commented] (SPARK-43237) Handle null exception message in event log

[jira] [Resolved] (SPARK-43237) Handle null exception message in event log

[jira] [Assigned] (SPARK-43237) Handle null exception message in event log

[jira] [Resolved] (SPARK-43270) Implement dir() in pyspark.sql.dataframe.DataFrame to include columns

[jira] [Assigned] (SPARK-43270) Implement dir() in pyspark.sql.dataframe.DataFrame to include columns

[jira] [Resolved] (SPARK-42940) Session management support streaming connect

[jira] [Assigned] (SPARK-42940) Session management support streaming connect

[jira] [Created] (SPARK-43324) DataSource V2: Handle UPDATE commands for delta-based sources

[jira] [Created] (SPARK-43323) DataFrame.toPandas with Arrow enabled should handle exceptions properly

[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

[jira] [Updated] (SPARK-43322) Spark SQL docs for explode_outer and posexplode_outer omit behavior for null/empty

[jira] [Created] (SPARK-43322) Spark SQL docs for explode_outer and posexplode_outer omit behavior for null/empty

[jira] [Created] (SPARK-43321) Impl Dataset#JoinWith

[jira] [Created] (SPARK-43320) Directly call Hive 2.3.9 API

[jira] [Created] (SPARK-43319) Remove usage of deprecated DefaultKubernetesClient

[jira] [Resolved] (SPARK-43263) Upgrade FasterXML jackson to 2.15.0

[jira] [Created] (SPARK-43318) spark reader csv and json support wholetext parameters

[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

[jira] [Commented] (SPARK-43306) Migrate `ValueError` from Spark SQL types into error class

[jira] [Commented] (SPARK-43304) Enable test_to_latex by supporting jinja2>=3.0.0

[jira] [Commented] (SPARK-43304) Enable test_to_latex by supporting jinja2>=3.0.0

[jira] [Commented] (SPARK-42843) Assign a name to the error class _LEGACY_ERROR_TEMP_2007

[jira] [Created] (SPARK-43317) Support combine adjacent aggregation

[jira] [Created] (SPARK-43316) Add more CTE SQL tests

26 matches

Site Navigation

Mail list logo

Footer information