[jira] [Resolved] (SPARK-43165) Move canWrite to DataTypeUtils
[ https://issues.apache.org/jira/browse/SPARK-43165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43165. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40825 [https://github.com/apache/spark/pull/40825] > Move canWrite to DataTypeUtils > -- > > Key: SPARK-43165 > URL: https://issues.apache.org/jira/browse/SPARK-43165 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43179) Add option for applications to control saving of metadata in the External Shuffle Service LevelDB
[ https://issues.apache.org/jira/browse/SPARK-43179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated SPARK-43179: -- Summary: Add option for applications to control saving of metadata in the External Shuffle Service LevelDB (was: Add option for applications to control saving of metadata in External Shuffle Service LevelDB) > Add option for applications to control saving of metadata in the External > Shuffle Service LevelDB > - > > Key: SPARK-43179 > URL: https://issues.apache.org/jira/browse/SPARK-43179 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.4.0 >Reporter: Chandni Singh >Priority: Major > > Currently, the External Shuffle Service stores application metadata in > LevelDB. This is necessary to enable the shuffle server to resume serving > shuffle data for an application whose executors registered before the > NodeManager restarts. However, the metadata includes the application secret, > which is stored in LevelDB without encryption. This is a potential security > risk, particularly for applications with high security requirements. While > filesystem access control lists (ACLs) can help protect keys and > certificates, they may not be sufficient for some use cases. In response, we > have decided not to store metadata for these high-security applications in > LevelDB. As a result, these applications may experience more failures in the > event of a node restart, but we believe this trade-off is acceptable given > the increased security risk. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43179) Add option for applications to control saving of metadata in External Shuffle Service LevelDB
[ https://issues.apache.org/jira/browse/SPARK-43179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated SPARK-43179: -- Summary: Add option for applications to control saving of metadata in External Shuffle Service LevelDB (was: Allow applications to control whether their metadata gets saved by the shuffle server in the db) > Add option for applications to control saving of metadata in External Shuffle > Service LevelDB > - > > Key: SPARK-43179 > URL: https://issues.apache.org/jira/browse/SPARK-43179 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.4.0 >Reporter: Chandni Singh >Priority: Major > > Currently, the External Shuffle Service stores application metadata in > LevelDB. This is necessary to enable the shuffle server to resume serving > shuffle data for an application whose executors registered before the > NodeManager restarts. However, the metadata includes the application secret, > which is stored in LevelDB without encryption. This is a potential security > risk, particularly for applications with high security requirements. While > filesystem access control lists (ACLs) can help protect keys and > certificates, they may not be sufficient for some use cases. In response, we > have decided not to store metadata for these high-security applications in > LevelDB. As a result, these applications may experience more failures in the > event of a node restart, but we believe this trade-off is acceptable given > the increased security risk. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43179) Allow applications to control whether their metadata gets saved by the shuffle server in the db
Chandni Singh created SPARK-43179: - Summary: Allow applications to control whether their metadata gets saved by the shuffle server in the db Key: SPARK-43179 URL: https://issues.apache.org/jira/browse/SPARK-43179 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.4.0 Reporter: Chandni Singh Currently, the External Shuffle Service stores application metadata in LevelDB. This is necessary to enable the shuffle server to resume serving shuffle data for an application whose executors registered before the NodeManager restarts. However, the metadata includes the application secret, which is stored in LevelDB without encryption. This is a potential security risk, particularly for applications with high security requirements. While filesystem access control lists (ACLs) can help protect keys and certificates, they may not be sufficient for some use cases. In response, we have decided not to store metadata for these high-security applications in LevelDB. As a result, these applications may experience more failures in the event of a node restart, but we believe this trade-off is acceptable given the increased security risk. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35877) Spark Protobuf jar has CVE issue CVE-2015-5237
[ https://issues.apache.org/jira/browse/SPARK-35877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713891#comment-17713891 ] Abhay Dandekar commented on SPARK-35877: Dear team, any target version for this protobuf upgrade? I checked in the latest SPARK (spark-3.3.2-bin-hadoop3), and it is still using protobuf-java-2.5.0.jar. Thank you. > Spark Protobuf jar has CVE issue CVE-2015-5237 > -- > > Key: SPARK-35877 > URL: https://issues.apache.org/jira/browse/SPARK-35877 > Project: Spark > Issue Type: Bug > Components: Security, Spark Core >Affects Versions: 2.4.5, 3.1.1 >Reporter: jobit mathew >Priority: Minor > > Spark Protobuf jar has CVE issue CVE-2015-5237 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42845) Assign a name to the error class _LEGACY_ERROR_TEMP_2010
[ https://issues.apache.org/jira/browse/SPARK-42845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713886#comment-17713886 ] Snoot.io commented on SPARK-42845: -- User 'liang3zy22' has created a pull request for this issue: https://github.com/apache/spark/pull/40817 > Assign a name to the error class _LEGACY_ERROR_TEMP_2010 > > > Key: SPARK-42845 > URL: https://issues.apache.org/jira/browse/SPARK-42845 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2010* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42845) Assign a name to the error class _LEGACY_ERROR_TEMP_2010
[ https://issues.apache.org/jira/browse/SPARK-42845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713885#comment-17713885 ] Snoot.io commented on SPARK-42845: -- User 'liang3zy22' has created a pull request for this issue: https://github.com/apache/spark/pull/40817 > Assign a name to the error class _LEGACY_ERROR_TEMP_2010 > > > Key: SPARK-42845 > URL: https://issues.apache.org/jira/browse/SPARK-42845 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2010* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-43170. - Resolution: Not A Bug > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png, > image-2023-04-19-10-59-44-118.png, screenshot-1.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43170: Attachment: screenshot-1.png > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png, > image-2023-04-19-10-59-44-118.png, screenshot-1.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713861#comment-17713861 ] Yuming Wang commented on SPARK-43170: - Maybe your partition exists, but there is no data under the partition, such as the following: !screenshot-1.png! {noformat} yumwang@LM-SHC-16508156 dwm_user_app_action_sum_all2 % ls -R dt=20230412 ./dt=20230412: hour=23 ./dt=20230412/hour=23: appid=blibli.mobile.commerceappid=cn.shopee.br appid=cn.shopee.my appid=cn.shopee.app appid=cn.shopee.id appid=cn.shopee.ph ./dt=20230412/hour=23/appid=blibli.mobile.commerce: ./dt=20230412/hour=23/appid=cn.shopee.app: ./dt=20230412/hour=23/appid=cn.shopee.br: ./dt=20230412/hour=23/appid=cn.shopee.id: ./dt=20230412/hour=23/appid=cn.shopee.my: ./dt=20230412/hour=23/appid=cn.shopee.ph: {noformat} > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png, > image-2023-04-19-10-59-44-118.png, screenshot-1.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQU
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713858#comment-17713858 ] Yuming Wang commented on SPARK-43170: - I can't reproduce this issue: !image-2023-04-19-10-59-44-118.png! > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png, > image-2023-04-19-10-59-44-118.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43170: Attachment: image-2023-04-19-10-59-44-118.png > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png, > image-2023-04-19-10-59-44-118.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713855#comment-17713855 ] todd commented on SPARK-43170: -- [~yumwang] The code only executes spark.sql("xxx"), but does not perform cache-related operations. But the same code, why spark3.0 and spark3.2 have different results.If it is convenient for you, you can reproduce it. > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43098) Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty
[ https://issues.apache.org/jira/browse/SPARK-43098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-43098: --- Assignee: Jack Chen > Should not handle the COUNT bug when the GROUP BY clause of a correlated > scalar subquery is non-empty > - > > Key: SPARK-43098 > URL: https://issues.apache.org/jira/browse/SPARK-43098 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.4.1, 3.5.0 > > > From [~allisonwang-db] : > There is no COUNT bug when the correlated equality predicates are also in the > group by clause. However, the current logic to handle the COUNT bug still > adds default aggregate function value and returns incorrect results. > > {code:java} > create view t1(c1, c2) as values (0, 1), (1, 2); > create view t2(c1, c2) as values (0, 2), (0, 3); > select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from > t1; > -- Correct answer: [(0, 1, 2), (1, 2, null)] > +---+---+--+ > |c1 |c2 |scalarsubquery(c1)| > +---+---+--+ > |0 |1 |2 | > |1 |2 |0 | > +---+---+--+ > {code} > > This bug affects scalar subqueries in RewriteCorrelatedScalarSubquery, but > lateral subqueries handle it correctly in DecorrelateInnerQuery. Related: > https://issues.apache.org/jira/browse/SPARK-36113 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43098) Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty
[ https://issues.apache.org/jira/browse/SPARK-43098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43098. - Fix Version/s: 3.5.0 3.4.1 Resolution: Fixed Issue resolved by pull request 40811 [https://github.com/apache/spark/pull/40811] > Should not handle the COUNT bug when the GROUP BY clause of a correlated > scalar subquery is non-empty > - > > Key: SPARK-43098 > URL: https://issues.apache.org/jira/browse/SPARK-43098 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jack Chen >Priority: Major > Fix For: 3.5.0, 3.4.1 > > > From [~allisonwang-db] : > There is no COUNT bug when the correlated equality predicates are also in the > group by clause. However, the current logic to handle the COUNT bug still > adds default aggregate function value and returns incorrect results. > > {code:java} > create view t1(c1, c2) as values (0, 1), (1, 2); > create view t2(c1, c2) as values (0, 2), (0, 3); > select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from > t1; > -- Correct answer: [(0, 1, 2), (1, 2, null)] > +---+---+--+ > |c1 |c2 |scalarsubquery(c1)| > +---+---+--+ > |0 |1 |2 | > |1 |2 |0 | > +---+---+--+ > {code} > > This bug affects scalar subqueries in RewriteCorrelatedScalarSubquery, but > lateral subqueries handle it correctly in DecorrelateInnerQuery. Related: > https://issues.apache.org/jira/browse/SPARK-36113 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43146) Implement eager evaluation.
[ https://issues.apache.org/jira/browse/SPARK-43146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43146. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40800 [https://github.com/apache/spark/pull/40800] > Implement eager evaluation. > --- > > Key: SPARK-43146 > URL: https://issues.apache.org/jira/browse/SPARK-43146 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43146) Implement eager evaluation.
[ https://issues.apache.org/jira/browse/SPARK-43146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43146: Assignee: Takuya Ueshin > Implement eager evaluation. > --- > > Key: SPARK-43146 > URL: https://issues.apache.org/jira/browse/SPARK-43146 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42592) Document SS guide doc for supporting multiple stateful operators (especially chained aggregations)
[ https://issues.apache.org/jira/browse/SPARK-42592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713811#comment-17713811 ] Jungtaek Lim commented on SPARK-42592: -- [~XinrongM] Maybe we missed to re-tag fixed versions for PRs which happened in parallel with RCs. I changed the fixed version for this ticket from 3.4.1 to 3.4.0. > Document SS guide doc for supporting multiple stateful operators (especially > chained aggregations) > -- > > Key: SPARK-42592 > URL: https://issues.apache.org/jira/browse/SPARK-42592 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.4.0, 3.5.0 > > > We made a change on the guide doc for SPARK-40925 via SPARK-42105, but from > SPARK-42105 we only removed the section of "limitation of global watermark". > That said, we haven't provided any example of new functionality, especially > that users need to know about the change of SQL function (window) in chained > time window aggregations. > In this ticket, we will add the example of chained time window aggregations, > with introducing new functionality of SQL function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42592) Document SS guide doc for supporting multiple stateful operators (especially chained aggregations)
[ https://issues.apache.org/jira/browse/SPARK-42592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-42592: - Fix Version/s: 3.4.0 (was: 3.4.1) > Document SS guide doc for supporting multiple stateful operators (especially > chained aggregations) > -- > > Key: SPARK-42592 > URL: https://issues.apache.org/jira/browse/SPARK-42592 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.4.0, 3.5.0 > > > We made a change on the guide doc for SPARK-40925 via SPARK-42105, but from > SPARK-42105 we only removed the section of "limitation of global watermark". > That said, we haven't provided any example of new functionality, especially > that users need to know about the change of SQL function (window) in chained > time window aggregations. > In this ticket, we will add the example of chained time window aggregations, > with introducing new functionality of SQL function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43178) Migrate UDF errors into error class
Haejoon Lee created SPARK-43178: --- Summary: Migrate UDF errors into error class Key: SPARK-43178 URL: https://issues.apache.org/jira/browse/SPARK-43178 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Haejoon Lee Migrate pyspark/sql/udf.py errors into error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43173) `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive`
[ https://issues.apache.org/jira/browse/SPARK-43173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43173. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40837 [https://github.com/apache/spark/pull/40837] > `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive` > - > > Key: SPARK-43173 > URL: https://issues.apache.org/jira/browse/SPARK-43173 > Project: Spark > Issue Type: Improvement > Components: Connect, Tests >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > > both > ``` > build/mvn clean install -Dtest=none > -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite > ``` > and > ``` > build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite" > ``` > > will test failed when using Java 11&17 > > {code:java} > - write jdbc *** FAILED *** > io.grpc.StatusRuntimeException: INTERNAL: No suitable driver > at io.grpc.Status.asRuntimeException(Status.java:535) > at > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458) > at > org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257) > at > org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43173) `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive`
[ https://issues.apache.org/jira/browse/SPARK-43173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43173: Assignee: Yang Jie > `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive` > - > > Key: SPARK-43173 > URL: https://issues.apache.org/jira/browse/SPARK-43173 > Project: Spark > Issue Type: Improvement > Components: Connect, Tests >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > both > ``` > build/mvn clean install -Dtest=none > -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite > ``` > and > ``` > build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite" > ``` > > will test failed when using Java 11&17 > > {code:java} > - write jdbc *** FAILED *** > io.grpc.StatusRuntimeException: INTERNAL: No suitable driver > at io.grpc.Status.asRuntimeException(Status.java:535) > at > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458) > at > org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257) > at > org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43169) Update mima's previousSparkVersion to 3.4.0
[ https://issues.apache.org/jira/browse/SPARK-43169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43169. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40830 [https://github.com/apache/spark/pull/40830] > Update mima's previousSparkVersion to 3.4.0 > --- > > Key: SPARK-43169 > URL: https://issues.apache.org/jira/browse/SPARK-43169 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43169) Update mima's previousSparkVersion to 3.4.0
[ https://issues.apache.org/jira/browse/SPARK-43169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43169: Assignee: Yang Jie > Update mima's previousSparkVersion to 3.4.0 > --- > > Key: SPARK-43169 > URL: https://issues.apache.org/jira/browse/SPARK-43169 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713801#comment-17713801 ] Sun Chao commented on SPARK-42539: -- Oops my bad [~xkrogen] - you're right, this is not in Spark 3.5 release, sorry! I must have forgotten to mark it resolved when the second PR got merged. > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Priority: Major > Fix For: 3.5.0 > > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that this behavior is correct when using the > {{ChildFirstURLClassLoader}}.) > After SPARK-37446, the NoSuchMethodError is no longer an issue, but this > still breaks assumptions about how user JARs should be treated vs. system > JARs, and presents the ability for the client to break in other ways. For > example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have > been included; the changes in Hive 2.3.9 were needed to improve compatibility > with older HMS, so if a user were to accidentally include these older JARs, > it could break the ability of Spark to communicate with HMS 1.x > I see two solutions to this: > *(A) Remove the separate classloader entirely when using "builtin"* > Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even > create a new classloader when using "builtin". This makes sense, as [called > out in this > comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], > since the point of "builtin" is to use the existing JARs on the cl
[jira] [Assigned] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Chao reassigned SPARK-42539: Assignee: Erik Krogen > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.5.0 > > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that this behavior is correct when using the > {{ChildFirstURLClassLoader}}.) > After SPARK-37446, the NoSuchMethodError is no longer an issue, but this > still breaks assumptions about how user JARs should be treated vs. system > JARs, and presents the ability for the client to break in other ways. For > example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have > been included; the changes in Hive 2.3.9 were needed to improve compatibility > with older HMS, so if a user were to accidentally include these older JARs, > it could break the ability of Spark to communicate with HMS 1.x > I see two solutions to this: > *(A) Remove the separate classloader entirely when using "builtin"* > Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even > create a new classloader when using "builtin". This makes sense, as [called > out in this > comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], > since the point of "builtin" is to use the existing JARs on the classpath > anyway. This proposes simply extending the changes from SPARK-26839 to all > Java versions, instead of restricting to Java 9+ only. >
[jira] [Resolved] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen resolved SPARK-42539. - Resolution: Fixed [~csun] it looks like this didn't get marked as closed / fix-version updated when the PR was merged. I believe this went only into 3.5.0; the original PR went into branch-3.4 but was reverted and the second PR didn't make it to branch-3.4. I've marked the fix version as 3.5.0 but please correct me if I'm wrong here: {code:java} > glog apache/branch-3.4 | grep SPARK-42539 * 26009d47c1f 2023-02-28 Revert "[SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client" [Hyukjin Kwon ] * 40a4019dfc5 2023-02-27 [SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client [Erik Krogen ] > glog apache/master | grep SPARK-42539 * 2e34427d4f3 2023-03-01 [SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client [Erik Krogen ] * 5627ceeddb4 2023-02-28 Revert "[SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client" [Hyukjin Kwon ] * 27ad5830f9a 2023-02-27 [SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client [Erik Krogen ] {code} > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Priority: Major > Fix For: 3.5.0 > > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that thi
[jira] [Updated] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated SPARK-42539: Fix Version/s: 3.5.0 > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Priority: Major > Fix For: 3.5.0 > > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that this behavior is correct when using the > {{ChildFirstURLClassLoader}}.) > After SPARK-37446, the NoSuchMethodError is no longer an issue, but this > still breaks assumptions about how user JARs should be treated vs. system > JARs, and presents the ability for the client to break in other ways. For > example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have > been included; the changes in Hive 2.3.9 were needed to improve compatibility > with older HMS, so if a user were to accidentally include these older JARs, > it could break the ability of Spark to communicate with HMS 1.x > I see two solutions to this: > *(A) Remove the separate classloader entirely when using "builtin"* > Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even > create a new classloader when using "builtin". This makes sense, as [called > out in this > comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], > since the point of "builtin" is to use the existing JARs on the classpath > anyway. This proposes simply extending the changes from SPARK-26839 to all > Java versions, instead of restricting to Java 9+ only. > *(B) Reverse the ordering of parent/
[jira] [Commented] (SPARK-43167) Streaming Connect console output format support
[ https://issues.apache.org/jira/browse/SPARK-43167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713787#comment-17713787 ] Wei Liu commented on SPARK-43167: - Should be: ``` Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0-SNAPSHOT /_/ Using Python version 3.10.8 (main, Oct 13 2022 09:48:40) Spark context Web UI available at http://10.10.105.160:4040 Spark context available as 'sc' (master = local[*], app id = local-1681856185012). SparkSession available as 'spark'. >>> spark >>> q = >>> spark.readStream.format("rate").load().writeStream.format("console").start() 23/04/18 15:17:12 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkwgp/T/temporary-64d68668-bc6f-46aa-8ea5-b66ddae09f91. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort. 23/04/18 15:17:12 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled. --- Batch: 0 --- l+-+-+ |timestamp|value| +-+-+ +-+-+ --- Batch: 1 --- ++-+ | timestamp|value| ++-+ |2023-04-18 15:17:...| 0| |2023-04-18 15:17:...| 1| ++-+ --- Batch: 2 --- ++-+ | timestamp|value| ++-+ |2023-04-18 15:17:...| 2| |2023-04-18 15:17:...| 3| ++-+ --- Batch: 3 --- ++-+ | timestamp|value| ++-+ |2023-04-18 15:17:...| 4| ++-+ ``` > Streaming Connect console output format support > --- > > Key: SPARK-43167 > URL: https://issues.apache.org/jira/browse/SPARK-43167 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43177) Add deprecation warning for input_file_name()
Yaohua Zhao created SPARK-43177: --- Summary: Add deprecation warning for input_file_name() Key: SPARK-43177 URL: https://issues.apache.org/jira/browse/SPARK-43177 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yaohua Zhao With the new `_metadata` column, users shouldn’t need to use input_file_name() anymore. We should add a deprecation warning and update the docs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42452) Remove hadoop-2 profile from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-42452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-42452. -- Fix Version/s: 3.5.0 Assignee: Yang Jie Resolution: Fixed > Remove hadoop-2 profile from Apache Spark > - > > Key: SPARK-42452 > URL: https://issues.apache.org/jira/browse/SPARK-42452 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > SPARK-40651 Drop Hadoop2 binary distribtuion from release process and > SPARK-42447 Remove Hadoop 2 GitHub Action job > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path
[ https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713623#comment-17713623 ] Wojciech Indyk commented on SPARK-30542: Will be fixed by this PR: https://github.com/apache/spark/pull/40821 > Two Spark structured streaming jobs cannot write to same base path > -- > > Key: SPARK-30542 > URL: https://issues.apache.org/jira/browse/SPARK-30542 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Sivakumar >Priority: Major > > Hi All, > Spark Structured Streaming doesn't allow two structured streaming jobs to > write data to the same base directory which is possible with using dstreams. > As __spark___metadata directory will be created by default for one job, > second job cannot use the same directory as base path as already > _spark__metadata directory is created by other job, It is throwing exception. > Is there any workaround for this, other than creating separate base path's > for both the jobs. > Is it possible to create the __spark__metadata directory else where or > disable without any data loss. > If I had to change the base path for both the jobs, then my whole framework > will get impacted, So i don't want to do that. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43176) Deduplicate imports in Connect Tests
Ruifeng Zheng created SPARK-43176: - Summary: Deduplicate imports in Connect Tests Key: SPARK-43176 URL: https://issues.apache.org/jira/browse/SPARK-43176 Project: Spark Issue Type: Test Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43175) decom.sh can cause an UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-43175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iain Cardnell updated SPARK-43175: -- Description: decom.sh can cause an UnsupportedOperationException which then causes the Executor to die with a SparkUncaughtException and does not complete the decommission properly. *Problem:* SignalUtils.scala line 124: {code:java} if (escalate) { prevHandler.handle(sig) }{code} *Logs:* {noformat} failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, ); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail --pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ \\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, "java.lang.UnsupportedOperationException: invoking native signal handle not supported at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source) at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source) at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124) at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown Source) at java.base/jdk.internal.misc.Signal$1.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)",2023-04-17T23:44:35.407488217Z "2023-04-17 23:44:35 [SIGPWR handler] ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - Uncaught exception in thread Thread[SIGPWR handler,9,system] - {}",2023-04-17T23:44:35.407457859Z " ... 1 more",2023-04-17T23:44:35.405548994Z " at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)",2023-04-17T23:44:35.405542621Z " at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)",2023-04-17T23:44:35.405536674Z " at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)",2023-04-17T23:44:35.405516396Z " at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)",2023-04-17T23:44:35.405416352Z " at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)",2023-04-17T23:44:35.405410491Z " ... at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)",2023-04-17T23:44:35.405262304Z " at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)",2023-04-17T23:44:35.405256591Z " at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:209)",2023-04-17T23:44:35.405250814Z{noformat} In this case prevHandler is the NativeHandler (See [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280]) and it throws the exception. *Possible Solutions:* * Check if prevHandler is an instance of NativeHandler and do not call it in that case. * try catch around the invoke of the handler and log a warning/error on exceptions. was: decom.sh can cause an UnsupportedOperationException which then causes the Executor to die with a SparkUncaughtException and does not complete the decommission properly. *Problem:* SignalUtils.scala line 124: {code:java} if (escalate) { prevHandler.handle(sig) }{code} *Logs:* {noformat} failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, ); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail --pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ \\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, "java.lang.UnsupportedOperationException: invoking native signal handle not supported at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source) at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source)
[jira] [Updated] (SPARK-43175) decom.sh can cause an UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-43175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iain Cardnell updated SPARK-43175: -- Description: decom.sh can cause an UnsupportedOperationException which then causes the Executor to die with a SparkUncaughtException and does not complete the decommission properly. *Problem:* SignalUtils.scala line 124: {code:java} if (escalate) { prevHandler.handle(sig) }{code} *Logs:* {noformat} failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, ); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail --pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ \\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, "java.lang.UnsupportedOperationException: invoking native signal handle not supported at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source) at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source) at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124) at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown Source) at java.base/jdk.internal.misc.Signal$1.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)",2023-04-17T23:44:35.407488217Z "2023-04-17 23:44:35 [SIGPWR handler] ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - Uncaught exception in thread Thread[SIGPWR handler,9,system] - {}",2023-04-17T23:44:35.407457859Z " ... 1 more",2023-04-17T23:44:35.405548994Z " at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)",2023-04-17T23:44:35.405542621Z " at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)",2023-04-17T23:44:35.405536674Z " at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)",2023-04-17T23:44:35.405516396Z " at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)",2023-04-17T23:44:35.405416352Z " at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)",2023-04-17T23:44:35.405410491Z " ... at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)",2023-04-17T23:44:35.405262304Z " at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)",2023-04-17T23:44:35.405256591Z " at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:209)",2023-04-17T23:44:35.405250814Z{noformat} In this case prevHandler is the NativeHandler (See [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280]) and it throws the exception. *Possible Solutions:* * Check if prevHandler is an instance of NativeHandler and do not call it in that case. * try catch around the invoke of the handler and log a warning/error on exceptions. was: decom.sh can cause an UnsupportedOperationException which then causes the Executor to die with a SparkUncaughtException and does not complete the decommission properly. *Problem:* SignalUtils.scala line 124: {code:java} if (escalate) { prevHandler.handle(sig) }{code} *Logs:* {noformat} failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, ); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail --pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ \\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, "java.lang.UnsupportedOperationException: invoking native signal handle not supported at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source) at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source) at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124) at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown
[jira] [Updated] (SPARK-43175) decom.sh can cause an UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-43175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iain Cardnell updated SPARK-43175: -- Description: decom.sh can cause an UnsupportedOperationException which then causes the Executor to die with a SparkUncaughtException and does not complete the decommission properly. *Problem:* SignalUtils.scala line 124: {code:java} if (escalate) { prevHandler.handle(sig) }{code} *Logs:* {noformat} failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, ); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail --pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ \\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, "java.lang.UnsupportedOperationException: invoking native signal handle not supported at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source) at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source) at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124) at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown Source) at java.base/jdk.internal.misc.Signal$1.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)",2023-04-17T23:44:35.407488217Z "2023-04-17 23:44:35 [SIGPWR handler] ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - Uncaught exception in thread Thread[SIGPWR handler,9,system] - {}",2023-04-17T23:44:35.407457859Z " ... 1 more",2023-04-17T23:44:35.405548994Z " at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)",2023-04-17T23:44:35.405542621Z " at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)",2023-04-17T23:44:35.405536674Z " at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)",2023-04-17T23:44:35.405516396Z " at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)",2023-04-17T23:44:35.405416352Z " at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)",2023-04-17T23:44:35.405410491Z " ... at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)",2023-04-17T23:44:35.405262304Z " at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)",2023-04-17T23:44:35.405256591Z " at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:209)",2023-04-17T23:44:35.405250814Z{noformat} In this case prevHandler is the NativeHandler (See [https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/jdk/internal/misc/Signal.java#L280|Github JDK Source]) and it throws the exception. *Possible Solutions:* * Check if prevHandler is an instance of NativeHandler and do not call it in that case. * try catch around the invoke of the handler and log a warning/error on exceptions. was: decom.sh can cause an UnsupportedOperationException which then causes the Executor to die with a SparkUncaughtException and does not complete the decommission properly. *Problem:* SignalUtils.scala line 124: {code:java} if (escalate) { prevHandler.handle(sig) }{code} *Logs:* {noformat} failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, ); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail --pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ \\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, "java.lang.UnsupportedOperationException: invoking native signal handle not supported at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source) at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source) at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124) at jdk.unsupported/sun.misc.Signal$InternalMiscHandle
[jira] [Created] (SPARK-43175) decom.sh can cause an UnsupportedOperationException
Iain Cardnell created SPARK-43175: - Summary: decom.sh can cause an UnsupportedOperationException Key: SPARK-43175 URL: https://issues.apache.org/jira/browse/SPARK-43175 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.3.0 Reporter: Iain Cardnell decom.sh can cause an UnsupportedOperationException which then causes the Executor to die with a SparkUncaughtException and does not complete the decommission properly. *Problem:* SignalUtils.scala line 124: {code:java} if (escalate) { prevHandler.handle(sig) }{code} *Logs:* {noformat} failed - error: command '/opt/decom.sh' exited with 137: + echo 'Asked to decommission' + date + tee -a ++ ps -o pid -C java ++ awk '{ sub(/^[ \t]+/, ); print }' ++ tail -n 1 + WORKER_PID=17 + echo 'Using worker pid 17' + kill -s SIGPWR 17 + echo 'Waiting for worker pid to exit' + timeout 60 tail --pid=17 -f /dev/null , message: ""Asked to decommission\nMon Apr 17 23:44:35 UTC 2023\nUsing worker pid 17\nWaiting for worker pid to exit\n+ echo 'Asked to decommission'\n+ date\n+ tee -a\n++ ps -o pid -C java\n++ awk '{ sub(/^[ \\t]+/, \""\""); print }'\n++ tail -n 1\n+ WORKER_PID=17\n+ echo 'Using worker pid 17'\n+ kill -s SIGPWR 17\n+ echo 'Waiting for worker pid to exit'\n+ timeout 60 tail --pid=17 -f /dev/null\n""",2023-04-17T23:44:39Z, "java.lang.UnsupportedOperationException: invoking native signal handle not supported at java.base/jdk.internal.misc.Signal$NativeHandler.handle(Unknown Source) at jdk.unsupported/sun.misc.Signal$SunMiscHandler.handle(Unknown Source) at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:124) at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Unknown Source) at java.base/jdk.internal.misc.Signal$1.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)",2023-04-17T23:44:35.407488217Z "2023-04-17 23:44:35 [SIGPWR handler] ERROR org.apache.spark.util.SparkUncaughtExceptionHandler - Uncaught exception in thread Thread[SIGPWR handler,9,system] - {}",2023-04-17T23:44:35.407457859Z " ... 1 more",2023-04-17T23:44:35.405548994Z " at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)",2023-04-17T23:44:35.405542621Z " at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)",2023-04-17T23:44:35.405536674Z " at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)",2023-04-17T23:44:35.405516396Z " at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)",2023-04-17T23:44:35.405416352Z " at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)",2023-04-17T23:44:35.405410491Z " at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)",2023-04-17T23:44:35.405402143Z " at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)",2023-04-17T23:44:35.405396413Z " at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)",2023-04-17T23:44:35.405390525Z " at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)",2023-04-17T23:44:35.405384806Z " at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)",2023-04-17T23:44:35.405378755Z " at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)",2023-04-17T23:44:35.405372709Z " at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)",2023-04-17T23:44:35.405359325Z " at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)",2023-04-17T23:44:35.405353609Z " at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)",2023-04-17T23:44:35.405347958Z " at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)",2023-04-17T23:44:35.405342114Z " at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)",2023-04-17T23:44:35.405336302Z " at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)",2023-04-17T23:44:35.405330321Z " at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)",2023-04-17T23:44:35.405324741Z " at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)",2023-04-17T23:44:35.405319173Z " at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)",2023-04-17T23:44:35.405313526Z " at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.ja
[jira] [Commented] (SPARK-42657) Support to find and transfer client-side REPL classfiles to server as artifacts
[ https://issues.apache.org/jira/browse/SPARK-42657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713583#comment-17713583 ] GridGain Integration commented on SPARK-42657: -- User 'vicennial' has created a pull request for this issue: https://github.com/apache/spark/pull/40675 > Support to find and transfer client-side REPL classfiles to server as > artifacts > - > > Key: SPARK-42657 > URL: https://issues.apache.org/jira/browse/SPARK-42657 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.5.0 > > > To run UDFs which are defined on the client side REPL, we require a mechanism > that can find the local REPL classfiles and then utilise the mechanism from > https://issues.apache.org/jira/browse/SPARK-42653 to transfer them to the > server as artifacts. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43172) Expose host and bearer tokens from the spark connect client
[ https://issues.apache.org/jira/browse/SPARK-43172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43172. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40836 [https://github.com/apache/spark/pull/40836] > Expose host and bearer tokens from the spark connect client > --- > > Key: SPARK-43172 > URL: https://issues.apache.org/jira/browse/SPARK-43172 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > Fix For: 3.5.0 > > > The `SparkConnectClient` class takes in a connection string to connect with > the spark connect service. > > As part of setting up the connection, it parses the connection string. Expose > the parsed host and bearer tokens as part of the class, so they may be > accessed by consumers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43172) Expose host and bearer tokens from the spark connect client
[ https://issues.apache.org/jira/browse/SPARK-43172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43172: Assignee: Niranjan Jayakar > Expose host and bearer tokens from the spark connect client > --- > > Key: SPARK-43172 > URL: https://issues.apache.org/jira/browse/SPARK-43172 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > > The `SparkConnectClient` class takes in a connection string to connect with > the spark connect service. > > As part of setting up the connection, it parses the connection string. Expose > the parsed host and bearer tokens as part of the class, so they may be > accessed by consumers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713576#comment-17713576 ] Yuming Wang commented on SPARK-43170: - Why it is {{CachedRDDBuilder}}? {noformat} (2) InMemoryRelation Arguments: [appid#65], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] +- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) +- *(1) Project [appid#65] +- *(1) ColumnarToRow +- FileScan parquet ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> ,None) {noformat} > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: []
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713543#comment-17713543 ] todd commented on SPARK-43170: -- [~yumwang] no cache > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43174) Fix SparkSQLCLIDriver completer
Yuming Wang created SPARK-43174: --- Summary: Fix SparkSQLCLIDriver completer Key: SPARK-43174 URL: https://issues.apache.org/jira/browse/SPARK-43174 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43173) `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive`
[ https://issues.apache.org/jira/browse/SPARK-43173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-43173: - Description: both ``` build/mvn clean install -Dtest=none -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite ``` and ``` build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite" ``` will test failed when using Java 11&17 {code:java} - write jdbc *** FAILED *** io.grpc.StatusRuntimeException: INTERNAL: No suitable driver at io.grpc.Status.asRuntimeException(Status.java:535) at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458) at org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) {code} was: both ``` build/mvn clean install -Dtest=none -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite ``` and ``` build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite" ``` will test failed {code:java} - write jdbc *** FAILED *** io.grpc.StatusRuntimeException: INTERNAL: No suitable driver at io.grpc.Status.asRuntimeException(Status.java:535) at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458) at org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) {code} > `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive` > - > > Key: SPARK-43173 > URL: https://issues.apache.org/jira/browse/SPARK-43173 > Project: Spark > Issue Type: Improvement > Components: Connect, Tests >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > both > ``` > build/mvn clean install -Dtest=none > -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite > ``` > and > ``` > build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite" > ``` > > will test failed when using Java 11&17 > > {code:java} > - write jdbc *** FAILED *** > io.grpc.StatusRuntimeException: INTERNAL: No suitable driver > at io.grpc.Status.asRuntimeException(Status.java:535) > at > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458) > at > org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257) > at > org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43173) `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive`
Yang Jie created SPARK-43173: Summary: `write jdbc` in `ClientE2ETestSuite` will test fail with out `-Phive` Key: SPARK-43173 URL: https://issues.apache.org/jira/browse/SPARK-43173 Project: Spark Issue Type: Improvement Components: Connect, Tests Affects Versions: 3.5.0 Reporter: Yang Jie both ``` build/mvn clean install -Dtest=none -DwildcardSuites=org.apache.spark.sql.ClientE2ETestSuite ``` and ``` build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite" ``` will test failed {code:java} - write jdbc *** FAILED *** io.grpc.StatusRuntimeException: INTERNAL: No suitable driver at io.grpc.Status.asRuntimeException(Status.java:535) at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:458) at org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:218) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43172) Expose host and bearer tokens from the spark connect client
Niranjan Jayakar created SPARK-43172: Summary: Expose host and bearer tokens from the spark connect client Key: SPARK-43172 URL: https://issues.apache.org/jira/browse/SPARK-43172 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Niranjan Jayakar The `SparkConnectClient` class takes in a connection string to connect with the spark connect service. As part of setting up the connection, it parses the connection string. Expose the parsed host and bearer tokens as part of the class, so they may be accessed by consumers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713503#comment-17713503 ] Yuming Wang commented on SPARK-43170: - Have you cached dwm_user_app_action_sum_all? > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42669) Short circuit local relation rpcs
[ https://issues.apache.org/jira/browse/SPARK-42669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713495#comment-17713495 ] ASF GitHub Bot commented on SPARK-42669: User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40782 > Short circuit local relation rpcs > - > > Key: SPARK-42669 > URL: https://issues.apache.org/jira/browse/SPARK-42669 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > > Operations on LocalRelation can mostly be done locally (without sending > rpcs). We should leverage this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43137) Improve ArrayInsert if the position is foldable and equals to zero.
[ https://issues.apache.org/jira/browse/SPARK-43137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713493#comment-17713493 ] ASF GitHub Bot commented on SPARK-43137: User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40833 > Improve ArrayInsert if the position is foldable and equals to zero. > --- > > Key: SPARK-43137 > URL: https://issues.apache.org/jira/browse/SPARK-43137 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > > We want make array_prepend reuse the implementation of array_insert, but it > seems a bit performance worse if the position is foldable and equals to zero. > The reason is that always do the check for position is negative or positive, > and the code is too long. Too long code will lead to JIT failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42585) Streaming createDataFrame implementation
[ https://issues.apache.org/jira/browse/SPARK-42585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713494#comment-17713494 ] ASF GitHub Bot commented on SPARK-42585: User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/40827 > Streaming createDataFrame implementation > > > Key: SPARK-42585 > URL: https://issues.apache.org/jira/browse/SPARK-42585 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Max Gekk >Priority: Major > > createDataFrame in Spark Connect is now one protobuf message which doesn't > allow creating a large local DataFrame. We should make it streaming. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43137) Improve ArrayInsert if the position is foldable and equals to zero.
[ https://issues.apache.org/jira/browse/SPARK-43137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713492#comment-17713492 ] ASF GitHub Bot commented on SPARK-43137: User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40833 > Improve ArrayInsert if the position is foldable and equals to zero. > --- > > Key: SPARK-43137 > URL: https://issues.apache.org/jira/browse/SPARK-43137 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > > We want make array_prepend reuse the implementation of array_insert, but it > seems a bit performance worse if the position is foldable and equals to zero. > The reason is that always do the check for position is negative or positive, > and the code is too long. Too long code will lead to JIT failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42552) Get ParseException when run sql: "SELECT 1 UNION SELECT 1;"
[ https://issues.apache.org/jira/browse/SPARK-42552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713489#comment-17713489 ] Ignite TC Bot commented on SPARK-42552: --- User 'Hisoka-X' has created a pull request for this issue: https://github.com/apache/spark/pull/40823 > Get ParseException when run sql: "SELECT 1 UNION SELECT 1;" > --- > > Key: SPARK-42552 > URL: https://issues.apache.org/jira/browse/SPARK-42552 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 > Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java > 1.8.0_345) > Spark version 3.2.3-SNAPSHOT >Reporter: jiang13021 >Priority: Major > Fix For: 3.2.3 > > > When I run sql > {code:java} > scala> spark.sql("SELECT 1 UNION SELECT 1;") {code} > I get ParseException: > {code:java} > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'SELECT' expecting {, ';'}(line 1, pos 15)== SQL == > SELECT 1 UNION SELECT 1; > ---^^^ at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:127) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:77) > at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:616) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:616) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613) > ... 47 elided > {code} > If I run with parentheses , it works well > {code:java} > scala> spark.sql("(SELECT 1) UNION (SELECT 1);") > res4: org.apache.spark.sql.DataFrame = [1: int]{code} > This should be a bug > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42151) Align UPDATE assignments with table attributes
[ https://issues.apache.org/jira/browse/SPARK-42151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42151. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40308 [https://github.com/apache/spark/pull/40308] > Align UPDATE assignments with table attributes > -- > > Key: SPARK-42151 > URL: https://issues.apache.org/jira/browse/SPARK-42151 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.5.0 > > > Assignment in UPDATE commands should be aligned with table attributes prior > to rewriting those UPDATE commands. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42151) Align UPDATE assignments with table attributes
[ https://issues.apache.org/jira/browse/SPARK-42151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42151: --- Assignee: Anton Okolnychyi > Align UPDATE assignments with table attributes > -- > > Key: SPARK-42151 > URL: https://issues.apache.org/jira/browse/SPARK-42151 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > > Assignment in UPDATE commands should be aligned with table attributes prior > to rewriting those UPDATE commands. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43153) Skip Spark execution when the dataframe is local.
[ https://issues.apache.org/jira/browse/SPARK-43153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43153. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40806 [https://github.com/apache/spark/pull/40806] > Skip Spark execution when the dataframe is local. > - > > Key: SPARK-43153 > URL: https://issues.apache.org/jira/browse/SPARK-43153 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43153) Skip Spark execution when the dataframe is local.
[ https://issues.apache.org/jira/browse/SPARK-43153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43153: - Assignee: Takuya Ueshin > Skip Spark execution when the dataframe is local. > - > > Key: SPARK-43153 > URL: https://issues.apache.org/jira/browse/SPARK-43153 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org