[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717611#comment-17717611 ] Steve Loughran commented on SPARK-43170: FWIW, using S3 URLs 's3://x/dwm_user_app_action_sum_all' means it's an AWS EMR deployment, with their private fork of spark, etc. you might want to raise a support case there > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png, > image-2023-04-19-10-59-44-118.png, screenshot-1.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713861#comment-17713861 ] Yuming Wang commented on SPARK-43170: - Maybe your partition exists, but there is no data under the partition, such as the following: !screenshot-1.png! {noformat} yumwang@LM-SHC-16508156 dwm_user_app_action_sum_all2 % ls -R dt=20230412 ./dt=20230412: hour=23 ./dt=20230412/hour=23: appid=blibli.mobile.commerceappid=cn.shopee.br appid=cn.shopee.my appid=cn.shopee.app appid=cn.shopee.id appid=cn.shopee.ph ./dt=20230412/hour=23/appid=blibli.mobile.commerce: ./dt=20230412/hour=23/appid=cn.shopee.app: ./dt=20230412/hour=23/appid=cn.shopee.br: ./dt=20230412/hour=23/appid=cn.shopee.id: ./dt=20230412/hour=23/appid=cn.shopee.my: ./dt=20230412/hour=23/appid=cn.shopee.ph: {noformat} > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png, > image-2023-04-19-10-59-44-118.png, screenshot-1.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQU
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713858#comment-17713858 ] Yuming Wang commented on SPARK-43170: - I can't reproduce this issue: !image-2023-04-19-10-59-44-118.png! > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png, > image-2023-04-19-10-59-44-118.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713855#comment-17713855 ] todd commented on SPARK-43170: -- [~yumwang] The code only executes spark.sql("xxx"), but does not perform cache-related operations. But the same code, why spark3.0 and spark3.2 have different results.If it is convenient for you, you can reproduce it. > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713576#comment-17713576 ] Yuming Wang commented on SPARK-43170: - Why it is {{CachedRDDBuilder}}? {noformat} (2) InMemoryRelation Arguments: [appid#65], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] +- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) +- *(1) Project [appid#65] +- *(1) ColumnarToRow +- FileScan parquet ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> ,None) {noformat} > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: []
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713543#comment-17713543 ] todd commented on SPARK-43170: -- [~yumwang] no cache > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713503#comment-17713503 ] Yuming Wang commented on SPARK-43170: - Have you cached dwm_user_app_action_sum_all? > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713412#comment-17713412 ] Hyukjin Kwon commented on SPARK-43170: -- There'd be no more releases in Spark 3.2.X unless there's an important security issue, etc. > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713403#comment-17713403 ] todd commented on SPARK-43170: -- Spark3.2.x is currently used in production, and there is no plan to upgrade to a higher version for the time being. If it's a bug, isn't the spark3.2 version going to be fixed? > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried
[ https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713373#comment-17713373 ] Hyukjin Kwon commented on SPARK-43170: -- Spark 3.2.X is EOL. Mind trying if the same persists in higher versions? > The spark sql like statement is pushed down to parquet for execution, but the > data cannot be queried > > > Key: SPARK-43170 > URL: https://issues.apache.org/jira/browse/SPARK-43170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2 >Reporter: todd >Priority: Major > Attachments: image-2023-04-18-10-59-30-199.png > > > --DDL > CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` ( > `gaid` STRING COMMENT '', > `beyla_id` STRING COMMENT '', > `dt` STRING, > `hour` STRING, > `appid` STRING COMMENT '包名') > USING parquet > PARTITIONED BY (dt, hour, appid) > LOCATION 's3://x/dwm_user_app_action_sum_all' > – partitions info > show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION > (dt='20230412'); > > dt=20230412/hour=23/appid=blibli.mobile.commerce > dt=20230412/hour=23/appid=cn.shopee.app > dt=20230412/hour=23/appid=cn.shopee.br > dt=20230412/hour=23/appid=cn.shopee.id > dt=20230412/hour=23/appid=cn.shopee.my > dt=20230412/hour=23/appid=cn.shopee.ph > > — query > select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all > where dt='20230412' and appid like '%shopee%' > > --result > nodata > > — other > I use spark3.0.1 version and trino query engine to query the data。 > > > The physical execution node formed by spark 3.2 > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, > hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: > InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), > Contains(appid#65, shopee)] ReadSchema: struct<> > > > !image-2023-04-18-10-59-30-199.png! > > – sql plan detail > {code:java} > == Physical Plan == > CollectLimit (9) > +- InMemoryTableScan (1) > +- InMemoryRelation (2) > +- * HashAggregate (8) >+- Exchange (7) > +- * HashAggregate (6) > +- * Project (5) > +- * ColumnarToRow (4) >+- Scan parquet > ecom_dwm.dwm_user_app_action_sum_all (3) > (1) InMemoryTableScan > Output [1]: [appid#65] > Arguments: [appid#65] > (2) InMemoryRelation > Arguments: [appid#65], > CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk, > memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], > functions=[], output=[appid#65]) > +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] >+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65]) > +- *(1) Project [appid#65] > +- *(1) ColumnarToRow > +- FileScan parquet > ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<> > ,None) > (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all > Output [3]: [dt#63, hour#64, appid#65] > Batched: true > Location: InMemoryFileIndex [] > PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), > StartsWith(appid#65, com)] > ReadSchema: struct<> > (4) ColumnarToRow [codegen id : 1] > Input [3]: [dt#63, hour#64, appid#65] > (5) Project [codegen id : 1] > Output [1]: [appid#65] > Input [3]: [dt#63, hour#64, appid#65] > (6) HashAggregate [codegen id : 1] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (7) Exchange > Input [1]: [appid#65] > Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24] > (8) HashAggregate [codegen id : 2] > Input [1]: [appid#65] > Keys [1]: [appid#65] > Functions: [] > Aggregate Attributes: [] > Results [1]: [appid#65] > (9) CollectLimit > Input [1]: [appid#65] > Arguments: 1 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org