date:20230418

[jira] [Resolved] (SPARK-43165) Move canWrite to DataTypeUtils

2023-04-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43165.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40825
[https://github.com/apache/spark/pull/40825]

> Move canWrite to DataTypeUtils
> --
>
> Key: SPARK-43165
> URL: https://issues.apache.org/jira/browse/SPARK-43165
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43179) Add option for applications to control saving of metadata in the External Shuffle Service LevelDB

2023-04-18 Thread Chandni Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated SPARK-43179:
--
Summary: Add option for applications to control saving of metadata in the 
External Shuffle Service LevelDB  (was: Add option for applications to control 
saving of metadata in External Shuffle Service LevelDB)

> Add option for applications to control saving of metadata in the External 
> Shuffle Service LevelDB
> -
>
> Key: SPARK-43179
> URL: https://issues.apache.org/jira/browse/SPARK-43179
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.4.0
>Reporter: Chandni Singh
>Priority: Major
>
> Currently, the External Shuffle Service stores application metadata in 
> LevelDB. This is necessary to enable the shuffle server to resume serving 
> shuffle data for an application whose executors registered before the 
> NodeManager restarts. However, the metadata includes the application secret, 
> which is stored in LevelDB without encryption. This is a potential security 
> risk, particularly for applications with high security requirements. While 
> filesystem access control lists (ACLs) can help protect keys and 
> certificates, they may not be sufficient for some use cases. In response, we 
> have decided not to store metadata for these high-security applications in 
> LevelDB. As a result, these applications may experience more failures in the 
> event of a node restart, but we believe this trade-off is acceptable given 
> the increased security risk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43179) Add option for applications to control saving of metadata in External Shuffle Service LevelDB

2023-04-18 Thread Chandni Singh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated SPARK-43179:
--
Summary: Add option for applications to control saving of metadata in 
External Shuffle Service LevelDB  (was: Allow applications to control whether 
their metadata gets saved by the shuffle server in the db)

> Add option for applications to control saving of metadata in External Shuffle 
> Service LevelDB
> -
>
> Key: SPARK-43179
> URL: https://issues.apache.org/jira/browse/SPARK-43179
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.4.0
>Reporter: Chandni Singh
>Priority: Major
>
> Currently, the External Shuffle Service stores application metadata in 
> LevelDB. This is necessary to enable the shuffle server to resume serving 
> shuffle data for an application whose executors registered before the 
> NodeManager restarts. However, the metadata includes the application secret, 
> which is stored in LevelDB without encryption. This is a potential security 
> risk, particularly for applications with high security requirements. While 
> filesystem access control lists (ACLs) can help protect keys and 
> certificates, they may not be sufficient for some use cases. In response, we 
> have decided not to store metadata for these high-security applications in 
> LevelDB. As a result, these applications may experience more failures in the 
> event of a node restart, but we believe this trade-off is acceptable given 
> the increased security risk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43179) Allow applications to control whether their metadata gets saved by the shuffle server in the db

2023-04-18 Thread Chandni Singh (Jira)

Chandni Singh created SPARK-43179:
-

 Summary: Allow applications to control whether their metadata gets 
saved by the shuffle server in the db
 Key: SPARK-43179
 URL: https://issues.apache.org/jira/browse/SPARK-43179
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.4.0
Reporter: Chandni Singh


Currently, the External Shuffle Service stores application metadata in LevelDB. 
This is necessary to enable the shuffle server to resume serving shuffle data 
for an application whose executors registered before the NodeManager restarts. 
However, the metadata includes the application secret, which is stored in 
LevelDB without encryption. This is a potential security risk, particularly for 
applications with high security requirements. While filesystem access control 
lists (ACLs) can help protect keys and certificates, they may not be sufficient 
for some use cases. In response, we have decided not to store metadata for 
these high-security applications in LevelDB. As a result, these applications 
may experience more failures in the event of a node restart, but we believe 
this trade-off is acceptable given the increased security risk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35877) Spark Protobuf jar has CVE issue CVE-2015-5237

2023-04-18 Thread Abhay Dandekar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713891#comment-17713891
 ] 

Abhay Dandekar commented on SPARK-35877:


Dear team, any target version for this protobuf upgrade? I checked in the 
latest SPARK (spark-3.3.2-bin-hadoop3), and it is still using 
protobuf-java-2.5.0.jar.

Thank you.

> Spark Protobuf jar has CVE issue CVE-2015-5237
> --
>
> Key: SPARK-35877
> URL: https://issues.apache.org/jira/browse/SPARK-35877
> Project: Spark
>  Issue Type: Bug
>  Components: Security, Spark Core
>Affects Versions: 2.4.5, 3.1.1
>Reporter: jobit mathew
>Priority: Minor
>
> Spark Protobuf jar has CVE issue CVE-2015-5237



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42845) Assign a name to the error class _LEGACY_ERROR_TEMP_2010

2023-04-18 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713886#comment-17713886
 ] 

Snoot.io commented on SPARK-42845:
--

User 'liang3zy22' has created a pull request for this issue:
https://github.com/apache/spark/pull/40817

> Assign a name to the error class _LEGACY_ERROR_TEMP_2010
> 
>
> Key: SPARK-42845
> URL: https://issues.apache.org/jira/browse/SPARK-42845
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2010* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42845) Assign a name to the error class _LEGACY_ERROR_TEMP_2010

2023-04-18 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713885#comment-17713885
 ] 

Snoot.io commented on SPARK-42845:
--

User 'liang3zy22' has created a pull request for this issue:
https://github.com/apache/spark/pull/40817

> Assign a name to the error class _LEGACY_ERROR_TEMP_2010
> 
>
> Key: SPARK-42845
> URL: https://issues.apache.org/jira/browse/SPARK-42845
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2010* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-43170.
-
Resolution: Not A Bug

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png, 
> image-2023-04-19-10-59-44-118.png, screenshot-1.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43170:

Attachment: screenshot-1.png

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png, 
> image-2023-04-19-10-59-44-118.png, screenshot-1.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713861#comment-17713861
 ] 

Yuming Wang commented on SPARK-43170:
-

Maybe your partition exists, but there is no data under the partition, such as 
the following:
 !screenshot-1.png! 
{noformat}
yumwang@LM-SHC-16508156 dwm_user_app_action_sum_all2 % ls -R
dt=20230412

./dt=20230412:
hour=23

./dt=20230412/hour=23:
appid=blibli.mobile.commerceappid=cn.shopee.br  
appid=cn.shopee.my
appid=cn.shopee.app appid=cn.shopee.id  
appid=cn.shopee.ph

./dt=20230412/hour=23/appid=blibli.mobile.commerce:

./dt=20230412/hour=23/appid=cn.shopee.app:

./dt=20230412/hour=23/appid=cn.shopee.br:

./dt=20230412/hour=23/appid=cn.shopee.id:

./dt=20230412/hour=23/appid=cn.shopee.my:

./dt=20230412/hour=23/appid=cn.shopee.ph:
{noformat}

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png, 
> image-2023-04-19-10-59-44-118.png, screenshot-1.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQU

[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713858#comment-17713858
 ] 

Yuming Wang commented on SPARK-43170:
-

I can't reproduce this issue:

!image-2023-04-19-10-59-44-118.png!

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png, 
> image-2023-04-19-10-59-44-118.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-43170:

Attachment: image-2023-04-19-10-59-44-118.png

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png, 
> image-2023-04-19-10-59-44-118.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread todd (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713855#comment-17713855
 ] 

todd commented on SPARK-43170:
--

[~yumwang]  The code only executes spark.sql("xxx"), but does not perform 
cache-related operations. But the same code, why spark3.0 and spark3.2 have 
different results.If it is convenient for you, you can reproduce it.

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43098) Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty

2023-04-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-43098:
---

Assignee: Jack Chen

> Should not handle the COUNT bug when the GROUP BY clause of a correlated 
> scalar subquery is non-empty
> -
>
> Key: SPARK-43098
> URL: https://issues.apache.org/jira/browse/SPARK-43098
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.4.1, 3.5.0
>
>
> From [~allisonwang-db] :
> There is no COUNT bug when the correlated equality predicates are also in the 
> group by clause. However, the current logic to handle the COUNT bug still 
> adds default aggregate function value and returns incorrect results.
>  
> {code:java}
> create view t1(c1, c2) as values (0, 1), (1, 2);
> create view t2(c1, c2) as values (0, 2), (0, 3);
> select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from 
> t1;
> -- Correct answer: [(0, 1, 2), (1, 2, null)]
> +---+---+--+
> |c1 |c2 |scalarsubquery(c1)|
> +---+---+--+
> |0  |1  |2 |
> |1  |2  |0 |
> +---+---+--+
>  {code}
>  
> This bug affects scalar subqueries in RewriteCorrelatedScalarSubquery, but 
> lateral subqueries handle it correctly in DecorrelateInnerQuery. Related: 
> https://issues.apache.org/jira/browse/SPARK-36113 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43098) Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty

2023-04-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-43098.
-
Fix Version/s: 3.5.0
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 40811
[https://github.com/apache/spark/pull/40811]

> Should not handle the COUNT bug when the GROUP BY clause of a correlated 
> scalar subquery is non-empty
> -
>
> Key: SPARK-43098
> URL: https://issues.apache.org/jira/browse/SPARK-43098
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jack Chen
>Priority: Major
> Fix For: 3.5.0, 3.4.1
>
>
> From [~allisonwang-db] :
> There is no COUNT bug when the correlated equality predicates are also in the 
> group by clause. However, the current logic to handle the COUNT bug still 
> adds default aggregate function value and returns incorrect results.
>  
> {code:java}
> create view t1(c1, c2) as values (0, 1), (1, 2);
> create view t2(c1, c2) as values (0, 2), (0, 3);
> select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from 
> t1;
> -- Correct answer: [(0, 1, 2), (1, 2, null)]
> +---+---+--+
> |c1 |c2 |scalarsubquery(c1)|
> +---+---+--+
> |0  |1  |2 |
> |1  |2  |0 |
> +---+---+--+
>  {code}
>  
> This bug affects scalar subqueries in RewriteCorrelatedScalarSubquery, but 
> lateral subqueries handle it correctly in DecorrelateInnerQuery. Related: 
> https://issues.apache.org/jira/browse/SPARK-36113 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43146) Implement eager evaluation.

2023-04-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43146.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40800
[https://github.com/apache/spark/pull/40800]

> Implement eager evaluation.
> ---
>
> Key: SPARK-43146
> URL: https://issues.apache.org/jira/browse/SPARK-43146
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43146) Implement eager evaluation.

2023-04-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43146:


Assignee: Takuya Ueshin

> Implement eager evaluation.
> ---
>
> Key: SPARK-43146
> URL: https://issues.apache.org/jira/browse/SPARK-43146
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42592) Document SS guide doc for supporting multiple stateful operators (especially chained aggregations)

2023-04-18 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713811#comment-17713811
 ] 

Jungtaek Lim commented on SPARK-42592:
--

[~XinrongM] Maybe we missed to re-tag fixed versions for PRs which happened in 
parallel with RCs. I changed the fixed version for this ticket from 3.4.1 to 
3.4.0.

> Document SS guide doc for supporting multiple stateful operators (especially 
> chained aggregations)
> --
>
> Key: SPARK-42592
> URL: https://issues.apache.org/jira/browse/SPARK-42592
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.4.0, 3.5.0
>
>
> We made a change on the guide doc for SPARK-40925 via SPARK-42105, but from 
> SPARK-42105 we only removed the section of "limitation of global watermark". 
> That said, we haven't provided any example of new functionality, especially 
> that users need to know about the change of SQL function (window) in chained 
> time window aggregations.
> In this ticket, we will add the example of chained time window aggregations, 
> with introducing new functionality of SQL function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42592) Document SS guide doc for supporting multiple stateful operators (especially chained aggregations)

2023-04-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-42592:
-
Fix Version/s: 3.4.0
   (was: 3.4.1)

> Document SS guide doc for supporting multiple stateful operators (especially 
> chained aggregations)
> --
>
> Key: SPARK-42592
> URL: https://issues.apache.org/jira/browse/SPARK-42592
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.4.0, 3.5.0
>
>
> We made a change on the guide doc for SPARK-40925 via SPARK-42105, but from 
> SPARK-42105 we only removed the section of "limitation of global watermark". 
> That said, we haven't provided any example of new functionality, especially 
> that users need to know about the change of SQL function (window) in chained 
> time window aggregations.
> In this ticket, we will add the example of chained time window aggregations, 
> with introducing new functionality of SQL function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43178) Migrate UDF errors into error class

2023-04-18 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-43178:
---

 Summary: Migrate UDF errors into error class
 Key: SPARK-43178
 URL: https://issues.apache.org/jira/browse/SPARK-43178
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Migrate pyspark/sql/udf.py errors into error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43173) `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive`

2023-04-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43173.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40837
[https://github.com/apache/spark/pull/40837]

> `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive`
> -
>
> Key: SPARK-43173
> URL: https://issues.apache.org/jira/browse/SPARK-43173
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> both
> ```
> build/mvn clean install -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite   
> ```
> and 
> ```
> build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite"
> ```
>  
> will test failed when using Java 11&17 
>  
> {code:java}
> - write jdbc *** FAILED ***
>   io.grpc.StatusRuntimeException: INTERNAL: No suitable driver
>   at io.grpc.Status.asRuntimeException(Status.java:535)
>   at 
> io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458)
>   at 
> org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43173) `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive`

2023-04-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43173:


Assignee: Yang Jie

> `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive`
> -
>
> Key: SPARK-43173
> URL: https://issues.apache.org/jira/browse/SPARK-43173
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> both
> ```
> build/mvn clean install -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite   
> ```
> and 
> ```
> build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite"
> ```
>  
> will test failed when using Java 11&17 
>  
> {code:java}
> - write jdbc *** FAILED ***
>   io.grpc.StatusRuntimeException: INTERNAL: No suitable driver
>   at io.grpc.Status.asRuntimeException(Status.java:535)
>   at 
> io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458)
>   at 
> org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43169) Update mima's previousSparkVersion to 3.4.0

2023-04-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43169.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40830
[https://github.com/apache/spark/pull/40830]

> Update mima's previousSparkVersion to 3.4.0
> ---
>
> Key: SPARK-43169
> URL: https://issues.apache.org/jira/browse/SPARK-43169
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43169) Update mima's previousSparkVersion to 3.4.0

2023-04-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43169:


Assignee: Yang Jie

> Update mima's previousSparkVersion to 3.4.0
> ---
>
> Key: SPARK-43169
> URL: https://issues.apache.org/jira/browse/SPARK-43169
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"

2023-04-18 Thread Sun Chao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713801#comment-17713801
 ] 

Sun Chao commented on SPARK-42539:
--

Oops my bad [~xkrogen] - you're right, this is not in Spark 3.5 release, sorry! 
I must have forgotten to mark it resolved when the second PR got merged. 

> User-provided JARs can override Spark's Hive metadata client JARs when using 
> "builtin"
> --
>
> Key: SPARK-42539
> URL: https://issues.apache.org/jira/browse/SPARK-42539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.3, 3.3.2
>Reporter: Erik Krogen
>Priority: Major
> Fix For: 3.5.0
>
>
> Recently we observed that on version 3.2.0 and Java 8, it is possible for 
> user-provided Hive JARs to break the ability for Spark, via the Hive metadata 
> client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when 
> using the default behavior of the "builtin" Hive version. After SPARK-35321, 
> when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client 
> version is used, we will call the method {{Hive.getWithoutRegisterFns()}} 
> (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for 
> example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break 
> with a {{NoSuchMethodError}}. This particular failure mode was resolved in 
> 3.2.1 by SPARK-37446, but while investigating, we found a general issue that 
> it's possible for user JARs to override Spark's own JARs -- but only inside 
> of the IsolatedClientLoader when using "builtin". This happens because even 
> when Spark is configured to use the "builtin" Hive classes, it still creates 
> a separate URLClassLoader for the HiveClientImpl used for HMS communication. 
> To get the set of JAR URLs to use for this classloader, Spark [collects all 
> of the JARs used by the user classloader (and its parent, and that 
> classloader's parent, and so 
> on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438].
>  Thus the newly created classloader will have all of the same JARs as the 
> user classloader, but the ordering has been reversed! User JARs get 
> prioritized ahead of system JARs, because the classloader hierarchy is 
> traversed from bottom-to-top. For example let's say we have user JARs 
> "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this:
> {code}
> MutableURLClassLoader
> -- foo.jar
> -- hive-exec-2.3.8.jar
> -- parent: URLClassLoader
> - spark-core_2.12-3.2.0.jar
> - ...
> - hive-exec-2.3.9.jar
> - ...
> {code}
> This setup provides the expected behavior within the user classloader; it 
> will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the 
> MutableURLClassLoader is only checked if the class doesn't exist in the 
> parent. But when a JAR list is constructed for the IsolatedClientLoader, it 
> traverses the URLs from MutableURLClassLoader first, then it's parent, so the 
> final list looks like (in order):
> {code}
> URLClassLoader [IsolatedClientLoader]
> -- foo.jar
> -- hive-exec-2.3.8.jar
> -- spark-core_2.12-3.2.0.jar
> -- ...
> -- hive-exec-2.3.9.jar
> -- ...
> -- parent: boot classloader (JVM classes)
> {code}
> Now when a lookup happens, all of the JARs are within the same 
> URLClassLoader, and the user JARs are in front of the Spark ones, so the user 
> JARs get prioritized. This is the opposite of the expected behavior when 
> using the default user/application classloader in Spark, which has 
> parent-first behavior, prioritizing the Spark/system classes over the user 
> classes. (Note that this behavior is correct when using the 
> {{ChildFirstURLClassLoader}}.)
> After SPARK-37446, the NoSuchMethodError is no longer an issue, but this 
> still breaks assumptions about how user JARs should be treated vs. system 
> JARs, and presents the ability for the client to break in other ways. For 
> example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have 
> been included; the changes in Hive 2.3.9 were needed to improve compatibility 
> with older HMS, so if a user were to accidentally include these older JARs, 
> it could break the ability of Spark to communicate with HMS 1.x
> I see two solutions to this:
> *(A) Remove the separate classloader entirely when using "builtin"*
> Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even 
> create a new classloader when using "builtin". This makes sense, as [called 
> out in this 
> comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], 
> since the point of "builtin" is to use the existing JARs on the cl

[jira] [Assigned] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"

2023-04-18 Thread Sun Chao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Chao reassigned SPARK-42539:


Assignee: Erik Krogen

> User-provided JARs can override Spark's Hive metadata client JARs when using 
> "builtin"
> --
>
> Key: SPARK-42539
> URL: https://issues.apache.org/jira/browse/SPARK-42539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.3, 3.3.2
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.5.0
>
>
> Recently we observed that on version 3.2.0 and Java 8, it is possible for 
> user-provided Hive JARs to break the ability for Spark, via the Hive metadata 
> client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when 
> using the default behavior of the "builtin" Hive version. After SPARK-35321, 
> when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client 
> version is used, we will call the method {{Hive.getWithoutRegisterFns()}} 
> (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for 
> example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break 
> with a {{NoSuchMethodError}}. This particular failure mode was resolved in 
> 3.2.1 by SPARK-37446, but while investigating, we found a general issue that 
> it's possible for user JARs to override Spark's own JARs -- but only inside 
> of the IsolatedClientLoader when using "builtin". This happens because even 
> when Spark is configured to use the "builtin" Hive classes, it still creates 
> a separate URLClassLoader for the HiveClientImpl used for HMS communication. 
> To get the set of JAR URLs to use for this classloader, Spark [collects all 
> of the JARs used by the user classloader (and its parent, and that 
> classloader's parent, and so 
> on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438].
>  Thus the newly created classloader will have all of the same JARs as the 
> user classloader, but the ordering has been reversed! User JARs get 
> prioritized ahead of system JARs, because the classloader hierarchy is 
> traversed from bottom-to-top. For example let's say we have user JARs 
> "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this:
> {code}
> MutableURLClassLoader
> -- foo.jar
> -- hive-exec-2.3.8.jar
> -- parent: URLClassLoader
> - spark-core_2.12-3.2.0.jar
> - ...
> - hive-exec-2.3.9.jar
> - ...
> {code}
> This setup provides the expected behavior within the user classloader; it 
> will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the 
> MutableURLClassLoader is only checked if the class doesn't exist in the 
> parent. But when a JAR list is constructed for the IsolatedClientLoader, it 
> traverses the URLs from MutableURLClassLoader first, then it's parent, so the 
> final list looks like (in order):
> {code}
> URLClassLoader [IsolatedClientLoader]
> -- foo.jar
> -- hive-exec-2.3.8.jar
> -- spark-core_2.12-3.2.0.jar
> -- ...
> -- hive-exec-2.3.9.jar
> -- ...
> -- parent: boot classloader (JVM classes)
> {code}
> Now when a lookup happens, all of the JARs are within the same 
> URLClassLoader, and the user JARs are in front of the Spark ones, so the user 
> JARs get prioritized. This is the opposite of the expected behavior when 
> using the default user/application classloader in Spark, which has 
> parent-first behavior, prioritizing the Spark/system classes over the user 
> classes. (Note that this behavior is correct when using the 
> {{ChildFirstURLClassLoader}}.)
> After SPARK-37446, the NoSuchMethodError is no longer an issue, but this 
> still breaks assumptions about how user JARs should be treated vs. system 
> JARs, and presents the ability for the client to break in other ways. For 
> example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have 
> been included; the changes in Hive 2.3.9 were needed to improve compatibility 
> with older HMS, so if a user were to accidentally include these older JARs, 
> it could break the ability of Spark to communicate with HMS 1.x
> I see two solutions to this:
> *(A) Remove the separate classloader entirely when using "builtin"*
> Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even 
> create a new classloader when using "builtin". This makes sense, as [called 
> out in this 
> comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], 
> since the point of "builtin" is to use the existing JARs on the classpath 
> anyway. This proposes simply extending the changes from SPARK-26839 to all 
> Java versions, instead of restricting to Java 9+ only.
>

[jira] [Resolved] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"

2023-04-18 Thread Erik Krogen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved SPARK-42539.
-
Resolution: Fixed

[~csun] it looks like this didn't get marked as closed / fix-version updated 
when the PR was merged. I believe this went only into 3.5.0; the original PR 
went into branch-3.4 but was reverted and the second PR didn't make it to 
branch-3.4. I've marked the fix version as 3.5.0 but please correct me if I'm 
wrong here:
{code:java}
> glog apache/branch-3.4 | grep SPARK-42539
* 26009d47c1f 2023-02-28 Revert "[SPARK-42539][SQL][HIVE] Eliminate separate 
classloader when using 'builtin' Hive version for metadata client" [Hyukjin 
Kwon ]
* 40a4019dfc5 2023-02-27 [SPARK-42539][SQL][HIVE] Eliminate separate 
classloader when using 'builtin' Hive version for metadata client [Erik Krogen 
]


> glog apache/master | grep SPARK-42539
* 2e34427d4f3 2023-03-01 [SPARK-42539][SQL][HIVE] Eliminate separate 
classloader when using 'builtin' Hive version for metadata client [Erik Krogen 
]
* 5627ceeddb4 2023-02-28 Revert "[SPARK-42539][SQL][HIVE] Eliminate separate 
classloader when using 'builtin' Hive version for metadata client" [Hyukjin 
Kwon ]
* 27ad5830f9a 2023-02-27 [SPARK-42539][SQL][HIVE] Eliminate separate 
classloader when using 'builtin' Hive version for metadata client [Erik Krogen 
] {code}

> User-provided JARs can override Spark's Hive metadata client JARs when using 
> "builtin"
> --
>
> Key: SPARK-42539
> URL: https://issues.apache.org/jira/browse/SPARK-42539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.3, 3.3.2
>Reporter: Erik Krogen
>Priority: Major
> Fix For: 3.5.0
>
>
> Recently we observed that on version 3.2.0 and Java 8, it is possible for 
> user-provided Hive JARs to break the ability for Spark, via the Hive metadata 
> client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when 
> using the default behavior of the "builtin" Hive version. After SPARK-35321, 
> when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client 
> version is used, we will call the method {{Hive.getWithoutRegisterFns()}} 
> (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for 
> example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break 
> with a {{NoSuchMethodError}}. This particular failure mode was resolved in 
> 3.2.1 by SPARK-37446, but while investigating, we found a general issue that 
> it's possible for user JARs to override Spark's own JARs -- but only inside 
> of the IsolatedClientLoader when using "builtin". This happens because even 
> when Spark is configured to use the "builtin" Hive classes, it still creates 
> a separate URLClassLoader for the HiveClientImpl used for HMS communication. 
> To get the set of JAR URLs to use for this classloader, Spark [collects all 
> of the JARs used by the user classloader (and its parent, and that 
> classloader's parent, and so 
> on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438].
>  Thus the newly created classloader will have all of the same JARs as the 
> user classloader, but the ordering has been reversed! User JARs get 
> prioritized ahead of system JARs, because the classloader hierarchy is 
> traversed from bottom-to-top. For example let's say we have user JARs 
> "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this:
> {code}
> MutableURLClassLoader
> -- foo.jar
> -- hive-exec-2.3.8.jar
> -- parent: URLClassLoader
> - spark-core_2.12-3.2.0.jar
> - ...
> - hive-exec-2.3.9.jar
> - ...
> {code}
> This setup provides the expected behavior within the user classloader; it 
> will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the 
> MutableURLClassLoader is only checked if the class doesn't exist in the 
> parent. But when a JAR list is constructed for the IsolatedClientLoader, it 
> traverses the URLs from MutableURLClassLoader first, then it's parent, so the 
> final list looks like (in order):
> {code}
> URLClassLoader [IsolatedClientLoader]
> -- foo.jar
> -- hive-exec-2.3.8.jar
> -- spark-core_2.12-3.2.0.jar
> -- ...
> -- hive-exec-2.3.9.jar
> -- ...
> -- parent: boot classloader (JVM classes)
> {code}
> Now when a lookup happens, all of the JARs are within the same 
> URLClassLoader, and the user JARs are in front of the Spark ones, so the user 
> JARs get prioritized. This is the opposite of the expected behavior when 
> using the default user/application classloader in Spark, which has 
> parent-first behavior, prioritizing the Spark/system classes over the user 
> classes. (Note that thi

[jira] [Updated] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"

2023-04-18 Thread Erik Krogen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-42539:

Fix Version/s: 3.5.0

> User-provided JARs can override Spark's Hive metadata client JARs when using 
> "builtin"
> --
>
> Key: SPARK-42539
> URL: https://issues.apache.org/jira/browse/SPARK-42539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.3, 3.3.2
>Reporter: Erik Krogen
>Priority: Major
> Fix For: 3.5.0
>
>
> Recently we observed that on version 3.2.0 and Java 8, it is possible for 
> user-provided Hive JARs to break the ability for Spark, via the Hive metadata 
> client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when 
> using the default behavior of the "builtin" Hive version. After SPARK-35321, 
> when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client 
> version is used, we will call the method {{Hive.getWithoutRegisterFns()}} 
> (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for 
> example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break 
> with a {{NoSuchMethodError}}. This particular failure mode was resolved in 
> 3.2.1 by SPARK-37446, but while investigating, we found a general issue that 
> it's possible for user JARs to override Spark's own JARs -- but only inside 
> of the IsolatedClientLoader when using "builtin". This happens because even 
> when Spark is configured to use the "builtin" Hive classes, it still creates 
> a separate URLClassLoader for the HiveClientImpl used for HMS communication. 
> To get the set of JAR URLs to use for this classloader, Spark [collects all 
> of the JARs used by the user classloader (and its parent, and that 
> classloader's parent, and so 
> on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438].
>  Thus the newly created classloader will have all of the same JARs as the 
> user classloader, but the ordering has been reversed! User JARs get 
> prioritized ahead of system JARs, because the classloader hierarchy is 
> traversed from bottom-to-top. For example let's say we have user JARs 
> "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this:
> {code}
> MutableURLClassLoader
> -- foo.jar
> -- hive-exec-2.3.8.jar
> -- parent: URLClassLoader
> - spark-core_2.12-3.2.0.jar
> - ...
> - hive-exec-2.3.9.jar
> - ...
> {code}
> This setup provides the expected behavior within the user classloader; it 
> will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the 
> MutableURLClassLoader is only checked if the class doesn't exist in the 
> parent. But when a JAR list is constructed for the IsolatedClientLoader, it 
> traverses the URLs from MutableURLClassLoader first, then it's parent, so the 
> final list looks like (in order):
> {code}
> URLClassLoader [IsolatedClientLoader]
> -- foo.jar
> -- hive-exec-2.3.8.jar
> -- spark-core_2.12-3.2.0.jar
> -- ...
> -- hive-exec-2.3.9.jar
> -- ...
> -- parent: boot classloader (JVM classes)
> {code}
> Now when a lookup happens, all of the JARs are within the same 
> URLClassLoader, and the user JARs are in front of the Spark ones, so the user 
> JARs get prioritized. This is the opposite of the expected behavior when 
> using the default user/application classloader in Spark, which has 
> parent-first behavior, prioritizing the Spark/system classes over the user 
> classes. (Note that this behavior is correct when using the 
> {{ChildFirstURLClassLoader}}.)
> After SPARK-37446, the NoSuchMethodError is no longer an issue, but this 
> still breaks assumptions about how user JARs should be treated vs. system 
> JARs, and presents the ability for the client to break in other ways. For 
> example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have 
> been included; the changes in Hive 2.3.9 were needed to improve compatibility 
> with older HMS, so if a user were to accidentally include these older JARs, 
> it could break the ability of Spark to communicate with HMS 1.x
> I see two solutions to this:
> *(A) Remove the separate classloader entirely when using "builtin"*
> Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even 
> create a new classloader when using "builtin". This makes sense, as [called 
> out in this 
> comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], 
> since the point of "builtin" is to use the existing JARs on the classpath 
> anyway. This proposes simply extending the changes from SPARK-26839 to all 
> Java versions, instead of restricting to Java 9+ only.
> *(B) Reverse the ordering of parent/

[jira] [Commented] (SPARK-43167) Streaming Connect console output format support

2023-04-18 Thread Wei Liu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713787#comment-17713787
 ] 

Wei Liu commented on SPARK-43167:
-

Should be:

```

Welcome to

                    __

     / __/__  ___ _/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0-SNAPSHOT

      /_/

 

Using Python version 3.10.8 (main, Oct 13 2022 09:48:40)

Spark context Web UI available at http://10.10.105.160:4040

Spark context available as 'sc' (master = local[*], app id = 
local-1681856185012).

SparkSession available as 'spark'.

>>> spark



>>> q = 
>>> spark.readStream.format("rate").load().writeStream.format("console").start()

23/04/18 15:17:12 WARN ResolveWriteToStream: Temporary checkpoint location 
created which is deleted normally when the query didn't fail: 
/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkwgp/T/temporary-64d68668-bc6f-46aa-8ea5-b66ddae09f91.
 If it's required to delete it under any circumstances, please set 
spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to 
know deleting temp checkpoint folder is best effort.

23/04/18 15:17:12 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not 
supported in streaming DataFrames/Datasets and will be disabled.

---                                     

Batch: 0

---

l+-+-+

|timestamp|value|

+-+-+

+-+-+

 

---                                     

Batch: 1

---

++-+

|           timestamp|value|

++-+

|2023-04-18 15:17:...|    0|

|2023-04-18 15:17:...|    1|

++-+

 

---

Batch: 2

---

++-+

|           timestamp|value|

++-+

|2023-04-18 15:17:...|    2|

|2023-04-18 15:17:...|    3|

++-+

 

---

Batch: 3

---

++-+

|           timestamp|value|

++-+

|2023-04-18 15:17:...|    4|

++-+

```

 

> Streaming Connect console output format support
> ---
>
> Key: SPARK-43167
> URL: https://issues.apache.org/jira/browse/SPARK-43167
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43177) Add deprecation warning for input_file_name()

2023-04-18 Thread Yaohua Zhao (Jira)

Yaohua Zhao created SPARK-43177:
---

 Summary: Add deprecation warning for input_file_name()
 Key: SPARK-43177
 URL: https://issues.apache.org/jira/browse/SPARK-43177
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yaohua Zhao


With the new `_metadata` column, users shouldn’t need to use input_file_name() 
anymore. We should add a deprecation warning and update the docs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42452) Remove hadoop-2 profile from Apache Spark

2023-04-18 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-42452.
--
Fix Version/s: 3.5.0
 Assignee: Yang Jie
   Resolution: Fixed

> Remove hadoop-2 profile from Apache Spark
> -
>
> Key: SPARK-42452
> URL: https://issues.apache.org/jira/browse/SPARK-42452
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> SPARK-40651 Drop Hadoop2 binary distribtuion from release process and 
> SPARK-42447 Remove Hadoop 2 GitHub Action job
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2023-04-18 Thread Wojciech Indyk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713623#comment-17713623
 ] 

Wojciech Indyk commented on SPARK-30542:


Will be fixed by this PR: https://github.com/apache/spark/pull/40821

> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> Spark Structured Streaming doesn't allow two structured streaming jobs to 
> write data to the same base directory which is possible with using dstreams.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43176) Deduplicate imports in Connect Tests

2023-04-18 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-43176:
-

 Summary: Deduplicate imports in Connect Tests
 Key: SPARK-43176
 URL: https://issues.apache.org/jira/browse/SPARK-43176
 Project: Spark
  Issue Type: Test
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43175) decom.sh can cause an UnsupportedOperationException

2023-04-18 Thread Iain Cardnell (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iain Cardnell updated SPARK-43175:
--
Description: 
decom.sh can cause an UnsupportedOperationException which then causes the 
Executor to die with a SparkUncaughtException and does not complete the 
decommission properly.

 
*Problem:*
SignalUtils.scala line 124:
 
{code:java}
if (escalate) {
   prevHandler.handle(sig)
}{code}
 
 
*Logs:*
 
{noformat}
failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to 
decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, 
); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + 
kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail 
--pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 
UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to 
decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ 
\\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker 
pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ 
timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z,
"java.lang.UnsupportedOperationException: invoking native signal handle not 
supported
 at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source)
 at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source)
 at 
org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124)
 at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown Source)
 at java.base/jdk.internal.misc.Signal$1.run(Unknown Source) at 
java.base/java.lang.Thread.run(Unknown Source)",2023-04-17T23:44:35.407488217Z 
"2023-04-17 23:44:35
[SIGPWR handler] ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - 
Uncaught exception in thread Thread[SIGPWR handler,9,system] - 
{}",2023-04-17T23:44:35.407457859Z
 " ... 1 more",2023-04-17T23:44:35.405548994Z "
 at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)",2023-04-17T23:44:35.405542621Z
 "
 at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)",2023-04-17T23:44:35.405536674Z
 "
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)",2023-04-17T23:44:35.405516396Z
 "
 at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)",2023-04-17T23:44:35.405416352Z
 "
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)",2023-04-17T23:44:35.405410491Z
 "
...
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)",2023-04-17T23:44:35.405262304Z
 "
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)",2023-04-17T23:44:35.405256591Z
 "
 at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:209)",2023-04-17T23:44:35.405250814Z{noformat}
 

In this case prevHandler is the NativeHandler (See 
[https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280])
 and it throws the exception.

*Possible Solutions:* * Check if prevHandler is an instance of NativeHandler 
and do not call it in that case.
 * try catch around the invoke of the handler and log a warning/error on 
exceptions.

  was:
decom.sh can cause an UnsupportedOperationException which then causes the 
Executor to die with a SparkUncaughtException and does not complete the 
decommission properly.

 
*Problem:*
SignalUtils.scala line 124:
 
{code:java}
if (escalate) {
   prevHandler.handle(sig)
}{code}
 
 
*Logs:*
 
{noformat}
failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to 
decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, 
); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + 
kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail 
--pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 
UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to 
decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ 
\\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker 
pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ 
timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z,
"java.lang.UnsupportedOperationException: invoking native signal handle not 
supported
 at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source)
 at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source)

[jira] [Updated] (SPARK-43175) decom.sh can cause an UnsupportedOperationException

2023-04-18 Thread Iain Cardnell (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iain Cardnell updated SPARK-43175:
--
Description: 
decom.sh can cause an UnsupportedOperationException which then causes the 
Executor to die with a SparkUncaughtException and does not complete the 
decommission properly.

 
*Problem:*
SignalUtils.scala line 124:
 
{code:java}
if (escalate) {
   prevHandler.handle(sig)
}{code}
 
 
*Logs:*
 
{noformat}
failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to 
decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, 
); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + 
kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail 
--pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 
UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to 
decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ 
\\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker 
pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ 
timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z,
"java.lang.UnsupportedOperationException: invoking native signal handle not 
supported
 at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source)
 at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source)
 at 
org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124)
 at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown Source)
 at java.base/jdk.internal.misc.Signal$1.run(Unknown Source) at 
java.base/java.lang.Thread.run(Unknown Source)",2023-04-17T23:44:35.407488217Z 
"2023-04-17 23:44:35
[SIGPWR handler] ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - 
Uncaught exception in thread Thread[SIGPWR handler,9,system] - 
{}",2023-04-17T23:44:35.407457859Z
 " ... 1 more",2023-04-17T23:44:35.405548994Z "
 at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)",2023-04-17T23:44:35.405542621Z
 "
 at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)",2023-04-17T23:44:35.405536674Z
 "
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)",2023-04-17T23:44:35.405516396Z
 "
 at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)",2023-04-17T23:44:35.405416352Z
 "
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)",2023-04-17T23:44:35.405410491Z
 "
...
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)",2023-04-17T23:44:35.405262304Z
 "
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)",2023-04-17T23:44:35.405256591Z
 "
 at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:209)",2023-04-17T23:44:35.405250814Z{noformat}
 

In this case prevHandler is the NativeHandler (See 
[https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280])
 and it throws the exception.

*Possible Solutions:*

 * Check if prevHandler is an instance of NativeHandler and do not call it in 
that case.
 * try catch around the invoke of the handler and log a warning/error on 
exceptions.

  was:
decom.sh can cause an UnsupportedOperationException which then causes the 
Executor to die with a SparkUncaughtException and does not complete the 
decommission properly.

 
*Problem:*
SignalUtils.scala line 124:
 
{code:java}
if (escalate) {
   prevHandler.handle(sig)
}{code}
 
 
*Logs:*
 
{noformat}
failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to 
decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, 
); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + 
kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail 
--pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 
UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to 
decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ 
\\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker 
pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ 
timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z,
"java.lang.UnsupportedOperationException: invoking native signal handle not 
supported
 at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source)
 at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source)
 at 
org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124)
 at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown

[jira] [Updated] (SPARK-43175) decom.sh can cause an UnsupportedOperationException

2023-04-18 Thread Iain Cardnell (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iain Cardnell updated SPARK-43175:
--
Description: 
decom.sh can cause an UnsupportedOperationException which then causes the 
Executor to die with a SparkUncaughtException and does not complete the 
decommission properly.

 
*Problem:*
SignalUtils.scala line 124:
 
{code:java}
if (escalate) {
   prevHandler.handle(sig)
}{code}
 
 
*Logs:*
 
{noformat}
failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to 
decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, 
); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + 
kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail 
--pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 
UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to 
decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ 
\\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker 
pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ 
timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z,
"java.lang.UnsupportedOperationException: invoking native signal handle not 
supported
 at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source)
 at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source)
 at 
org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124)
 at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown Source)
 at java.base/jdk.internal.misc.Signal$1.run(Unknown Source) at 
java.base/java.lang.Thread.run(Unknown Source)",2023-04-17T23:44:35.407488217Z 
"2023-04-17 23:44:35
[SIGPWR handler] ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - 
Uncaught exception in thread Thread[SIGPWR handler,9,system] - 
{}",2023-04-17T23:44:35.407457859Z
 " ... 1 more",2023-04-17T23:44:35.405548994Z "
 at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)",2023-04-17T23:44:35.405542621Z
 "
 at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)",2023-04-17T23:44:35.405536674Z
 "
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)",2023-04-17T23:44:35.405516396Z
 "
 at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)",2023-04-17T23:44:35.405416352Z
 "
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)",2023-04-17T23:44:35.405410491Z
 "
...
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)",2023-04-17T23:44:35.405262304Z
 "
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)",2023-04-17T23:44:35.405256591Z
 "
 at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:209)",2023-04-17T23:44:35.405250814Z{noformat}
 

In this case prevHandler is the NativeHandler (See 
[https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280|Github
 JDK Source]) and it throws the exception.

*Possible Solutions:* * Check if prevHandler is an instance of NativeHandler 
and do not call it in that case.
 * try catch around the invoke of the handler and log a warning/error on 
exceptions.

  was:
decom.sh can cause an UnsupportedOperationException which then causes the 
Executor to die with a SparkUncaughtException and does not complete the 
decommission properly.

 
*Problem:*
SignalUtils.scala line 124:
 
{code:java}
if (escalate) {
   prevHandler.handle(sig)
}{code}
 
 
*Logs:*
 
{noformat}
failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to 
decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, 
); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + 
kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail 
--pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 
UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to 
decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ 
\\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker 
pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ 
timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, 
"java.lang.UnsupportedOperationException: invoking native signal handle not 
supported at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown 
Source) at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown 
Source) at 
org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124) 
at jdk.unsupported/sun.misc.Signal$InternalMiscHandle

[jira] [Created] (SPARK-43175) decom.sh can cause an UnsupportedOperationException

2023-04-18 Thread Iain Cardnell (Jira)

Iain Cardnell created SPARK-43175:
-

 Summary: decom.sh can cause an UnsupportedOperationException
 Key: SPARK-43175
 URL: https://issues.apache.org/jira/browse/SPARK-43175
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Iain Cardnell


decom.sh can cause an UnsupportedOperationException which then causes the 
Executor to die with a SparkUncaughtException and does not complete the 
decommission properly.

 
*Problem:*
SignalUtils.scala line 124:
 
{code:java}
if (escalate) {
   prevHandler.handle(sig)
}{code}
 
 
*Logs:*
 
{noformat}
failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to 
decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, 
); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + 
kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail 
--pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 
UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to 
decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ 
\\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker 
pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ 
timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, 
"java.lang.UnsupportedOperationException: invoking native signal handle not 
supported at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown 
Source) at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown 
Source) at 
org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124) 
at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown Source) 
at java.base/jdk.internal.misc.Signal$1.run(Unknown Source) at 
java.base/java.lang.Thread.run(Unknown Source)",2023-04-17T23:44:35.407488217Z 
"2023-04-17 23:44:35 [SIGPWR handler] ERROR 
org.apache.spark.util.SparkUncaughtExceptionHandler - Uncaught exception in 
thread Thread[SIGPWR handler,9,system] - {}",2023-04-17T23:44:35.407457859Z " 
... 1 more",2023-04-17T23:44:35.405548994Z " at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)",2023-04-17T23:44:35.405542621Z
 " at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)",2023-04-17T23:44:35.405536674Z
 " at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)",2023-04-17T23:44:35.405516396Z
 " at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)",2023-04-17T23:44:35.405416352Z
 " at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)",2023-04-17T23:44:35.405410491Z
 " at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)",2023-04-17T23:44:35.405402143Z
 " at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)",2023-04-17T23:44:35.405396413Z
 " at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)",2023-04-17T23:44:35.405390525Z
 " at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)",2023-04-17T23:44:35.405384806Z
 " at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)",2023-04-17T23:44:35.405378755Z
 " at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)",2023-04-17T23:44:35.405372709Z
 " at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)",2023-04-17T23:44:35.405359325Z
 " at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)",2023-04-17T23:44:35.405353609Z
 " at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)",2023-04-17T23:44:35.405347958Z
 " at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)",2023-04-17T23:44:35.405342114Z
 " at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)",2023-04-17T23:44:35.405336302Z
 " at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)",2023-04-17T23:44:35.405330321Z
 " at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)",2023-04-17T23:44:35.405324741Z
 " at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)",2023-04-17T23:44:35.405319173Z
 " at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)",2023-04-17T23:44:35.405313526Z
 " at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.ja

[jira] [Commented] (SPARK-42657) Support to find and transfer client-side REPL classfiles to server as artifacts

2023-04-18 Thread GridGain Integration (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713583#comment-17713583
 ] 

GridGain Integration commented on SPARK-42657:
--

User 'vicennial' has created a pull request for this issue:
https://github.com/apache/spark/pull/40675

> Support to find and transfer client-side REPL classfiles to server as 
> artifacts  
> -
>
> Key: SPARK-42657
> URL: https://issues.apache.org/jira/browse/SPARK-42657
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.5.0
>
>
> To run UDFs which are defined on the client side REPL, we require a mechanism 
> that can find the local REPL classfiles and then utilise the mechanism from 
> https://issues.apache.org/jira/browse/SPARK-42653 to transfer them to the 
> server as artifacts.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43172) Expose host and bearer tokens from the spark connect client

2023-04-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43172.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40836
[https://github.com/apache/spark/pull/40836]

> Expose host and bearer tokens from the spark connect client
> ---
>
> Key: SPARK-43172
> URL: https://issues.apache.org/jira/browse/SPARK-43172
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
> Fix For: 3.5.0
>
>
> The `SparkConnectClient` class takes in a connection string to connect with 
> the spark connect service.
>  
> As part of setting up the connection, it parses the connection string. Expose 
> the parsed host and bearer tokens as part of the class, so they may be 
> accessed by consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43172) Expose host and bearer tokens from the spark connect client

2023-04-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43172:


Assignee: Niranjan Jayakar

> Expose host and bearer tokens from the spark connect client
> ---
>
> Key: SPARK-43172
> URL: https://issues.apache.org/jira/browse/SPARK-43172
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
>
> The `SparkConnectClient` class takes in a connection string to connect with 
> the spark connect service.
>  
> As part of setting up the connection, it parses the connection string. Expose 
> the parsed host and bearer tokens as part of the class, so they may be 
> accessed by consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713576#comment-17713576
 ] 

Yuming Wang commented on SPARK-43170:
-

Why it is {{CachedRDDBuilder}}?
{noformat}
(2) InMemoryRelation
Arguments: [appid#65], 
CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
 memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
functions=[], output=[appid#65])
+- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
   +- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
  +- *(1) Project [appid#65]
 +- *(1) ColumnarToRow
+- FileScan parquet 
ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
,None)
{noformat}


> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []

[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread todd (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713543#comment-17713543
 ] 

todd commented on SPARK-43170:
--

[~yumwang]  no cache

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43174) Fix SparkSQLCLIDriver completer

2023-04-18 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-43174:
---

 Summary: Fix SparkSQLCLIDriver completer
 Key: SPARK-43174
 URL: https://issues.apache.org/jira/browse/SPARK-43174
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43173) `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive`

2023-04-18 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-43173:
-
Description: 
both

```

build/mvn clean install -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite   

```

and 

```

build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite"

```

 

will test failed when using Java 11&17 

 
{code:java}
- write jdbc *** FAILED ***
  io.grpc.StatusRuntimeException: INTERNAL: No suitable driver
  at io.grpc.Status.asRuntimeException(Status.java:535)
  at 
io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
  at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458)
  at 
org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257)
  at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) {code}

  was:
both

```

build/mvn clean install -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite   

```

and 

```

build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite"

```

 

will test failed 

 
{code:java}
- write jdbc *** FAILED ***
  io.grpc.StatusRuntimeException: INTERNAL: No suitable driver
  at io.grpc.Status.asRuntimeException(Status.java:535)
  at 
io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
  at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458)
  at 
org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257)
  at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) {code}


> `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive`
> -
>
> Key: SPARK-43173
> URL: https://issues.apache.org/jira/browse/SPARK-43173
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> both
> ```
> build/mvn clean install -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite   
> ```
> and 
> ```
> build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite"
> ```
>  
> will test failed when using Java 11&17 
>  
> {code:java}
> - write jdbc *** FAILED ***
>   io.grpc.StatusRuntimeException: INTERNAL: No suitable driver
>   at io.grpc.Status.asRuntimeException(Status.java:535)
>   at 
> io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458)
>   at 
> org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43173) `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive`

2023-04-18 Thread Yang Jie (Jira)

Yang Jie created SPARK-43173:


 Summary: `write jdbc` in `ClientE2ETestSuite` will test fail with 
out `-Phive`
 Key: SPARK-43173
 URL: https://issues.apache.org/jira/browse/SPARK-43173
 Project: Spark
  Issue Type: Improvement
  Components: Connect, Tests
Affects Versions: 3.5.0
Reporter: Yang Jie


both

```

build/mvn clean install -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite   

```

and 

```

build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite"

```

 

will test failed 

 
{code:java}
- write jdbc *** FAILED ***
  io.grpc.StatusRuntimeException: INTERNAL: No suitable driver
  at io.grpc.Status.asRuntimeException(Status.java:535)
  at 
io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
  at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458)
  at 
org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257)
  at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43172) Expose host and bearer tokens from the spark connect client

2023-04-18 Thread Niranjan Jayakar (Jira)

Niranjan Jayakar created SPARK-43172:


 Summary: Expose host and bearer tokens from the spark connect 
client
 Key: SPARK-43172
 URL: https://issues.apache.org/jira/browse/SPARK-43172
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Niranjan Jayakar


The `SparkConnectClient` class takes in a connection string to connect with the 
spark connect service.

 

As part of setting up the connection, it parses the connection string. Expose 
the parsed host and bearer tokens as part of the class, so they may be accessed 
by consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713503#comment-17713503
 ] 

Yuming Wang commented on SPARK-43170:
-

Have you cached dwm_user_app_action_sum_all?

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42669) Short circuit local relation rpcs

2023-04-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713495#comment-17713495
 ] 

ASF GitHub Bot commented on SPARK-42669:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/40782

> Short circuit local relation rpcs
> -
>
> Key: SPARK-42669
> URL: https://issues.apache.org/jira/browse/SPARK-42669
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> Operations on LocalRelation can mostly be done locally (without sending 
> rpcs). We should leverage this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43137) Improve ArrayInsert if the position is foldable and equals to zero.

2023-04-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713493#comment-17713493
 ] 

ASF GitHub Bot commented on SPARK-43137:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/40833

> Improve ArrayInsert if the position is foldable and equals to zero.
> ---
>
> Key: SPARK-43137
> URL: https://issues.apache.org/jira/browse/SPARK-43137
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>
> We want make array_prepend reuse the implementation of array_insert, but it 
> seems a bit performance worse if the position is foldable and equals to zero.
> The reason is that always do the check for position is negative or positive, 
> and the code is too long. Too long code will lead to JIT failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42585) Streaming createDataFrame implementation

2023-04-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713494#comment-17713494
 ] 

ASF GitHub Bot commented on SPARK-42585:


User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/40827

> Streaming createDataFrame implementation
> 
>
> Key: SPARK-42585
> URL: https://issues.apache.org/jira/browse/SPARK-42585
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Max Gekk
>Priority: Major
>
> createDataFrame in Spark Connect is now one protobuf message which doesn't 
> allow creating a large local DataFrame. We should make it streaming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43137) Improve ArrayInsert if the position is foldable and equals to zero.

2023-04-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713492#comment-17713492
 ] 

ASF GitHub Bot commented on SPARK-43137:


User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/40833

> Improve ArrayInsert if the position is foldable and equals to zero.
> ---
>
> Key: SPARK-43137
> URL: https://issues.apache.org/jira/browse/SPARK-43137
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>
> We want make array_prepend reuse the implementation of array_insert, but it 
> seems a bit performance worse if the position is foldable and equals to zero.
> The reason is that always do the check for position is negative or positive, 
> and the code is too long. Too long code will lead to JIT failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42552) Get ParseException when run sql: "SELECT 1 UNION SELECT 1;"

2023-04-18 Thread Ignite TC Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713489#comment-17713489
 ] 

Ignite TC Bot commented on SPARK-42552:
---

User 'Hisoka-X' has created a pull request for this issue:
https://github.com/apache/spark/pull/40823

> Get ParseException when run sql: "SELECT 1 UNION SELECT 1;"
> ---
>
> Key: SPARK-42552
> URL: https://issues.apache.org/jira/browse/SPARK-42552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
> Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 
> 1.8.0_345)
> Spark version 3.2.3-SNAPSHOT
>Reporter: jiang13021
>Priority: Major
> Fix For: 3.2.3
>
>
> When I run sql
> {code:java}
> scala> spark.sql("SELECT 1 UNION SELECT 1;") {code}
> I get ParseException:
> {code:java}
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'SELECT' expecting {, ';'}(line 1, pos 15)== SQL ==
> SELECT 1 UNION SELECT 1;
> ---^^^  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:77)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:616)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:616)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
>   ... 47 elided
>  {code}
> If I run with parentheses , it works well 
> {code:java}
> scala> spark.sql("(SELECT 1) UNION (SELECT 1);") 
> res4: org.apache.spark.sql.DataFrame = [1: int]{code}
> This should be a bug
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42151) Align UPDATE assignments with table attributes

2023-04-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42151.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40308
[https://github.com/apache/spark/pull/40308]

> Align UPDATE assignments with table attributes
> --
>
> Key: SPARK-42151
> URL: https://issues.apache.org/jira/browse/SPARK-42151
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.5.0
>
>
> Assignment in UPDATE commands should be aligned with table attributes prior 
> to rewriting those UPDATE commands.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42151) Align UPDATE assignments with table attributes

2023-04-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42151:
---

Assignee: Anton Okolnychyi

> Align UPDATE assignments with table attributes
> --
>
> Key: SPARK-42151
> URL: https://issues.apache.org/jira/browse/SPARK-42151
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>
> Assignment in UPDATE commands should be aligned with table attributes prior 
> to rewriting those UPDATE commands.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43153) Skip Spark execution when the dataframe is local.

2023-04-18 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43153.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40806
[https://github.com/apache/spark/pull/40806]

> Skip Spark execution when the dataframe is local.
> -
>
> Key: SPARK-43153
> URL: https://issues.apache.org/jira/browse/SPARK-43153
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43153) Skip Spark execution when the dataframe is local.

2023-04-18 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43153:
-

Assignee: Takuya Ueshin

> Skip Spark execution when the dataframe is local.
> -
>
> Key: SPARK-43153
> URL: https://issues.apache.org/jira/browse/SPARK-43153
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

56 matches

Mail list logo