[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-28 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717611#comment-17717611
 ] 

Steve Loughran commented on SPARK-43170:


FWIW, using  S3 URLs  's3://x/dwm_user_app_action_sum_all' means it's an 
AWS EMR deployment, with their private fork of spark, etc. you might want to 
raise a support case there

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png, 
> image-2023-04-19-10-59-44-118.png, screenshot-1.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713861#comment-17713861
 ] 

Yuming Wang commented on SPARK-43170:
-

Maybe your partition exists, but there is no data under the partition, such as 
the following:
 !screenshot-1.png! 
{noformat}
yumwang@LM-SHC-16508156 dwm_user_app_action_sum_all2 % ls -R
dt=20230412

./dt=20230412:
hour=23

./dt=20230412/hour=23:
appid=blibli.mobile.commerceappid=cn.shopee.br  
appid=cn.shopee.my
appid=cn.shopee.app appid=cn.shopee.id  
appid=cn.shopee.ph

./dt=20230412/hour=23/appid=blibli.mobile.commerce:

./dt=20230412/hour=23/appid=cn.shopee.app:

./dt=20230412/hour=23/appid=cn.shopee.br:

./dt=20230412/hour=23/appid=cn.shopee.id:

./dt=20230412/hour=23/appid=cn.shopee.my:

./dt=20230412/hour=23/appid=cn.shopee.ph:
{noformat}

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png, 
> image-2023-04-19-10-59-44-118.png, screenshot-1.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQU

[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713858#comment-17713858
 ] 

Yuming Wang commented on SPARK-43170:
-

I can't reproduce this issue:

!image-2023-04-19-10-59-44-118.png!

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png, 
> image-2023-04-19-10-59-44-118.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread todd (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713855#comment-17713855
 ] 

todd commented on SPARK-43170:
--

[~yumwang]  The code only executes spark.sql("xxx"), but does not perform 
cache-related operations. But the same code, why spark3.0 and spark3.2 have 
different results.If it is convenient for you, you can reproduce it.

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713576#comment-17713576
 ] 

Yuming Wang commented on SPARK-43170:
-

Why it is {{CachedRDDBuilder}}?
{noformat}
(2) InMemoryRelation
Arguments: [appid#65], 
CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
 memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
functions=[], output=[appid#65])
+- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
   +- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
  +- *(1) Project [appid#65]
 +- *(1) ColumnarToRow
+- FileScan parquet 
ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
,None)
{noformat}


> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []

[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread todd (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713543#comment-17713543
 ] 

todd commented on SPARK-43170:
--

[~yumwang]  no cache

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-18 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713503#comment-17713503
 ] 

Yuming Wang commented on SPARK-43170:
-

Have you cached dwm_user_app_action_sum_all?

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-17 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713412#comment-17713412
 ] 

Hyukjin Kwon commented on SPARK-43170:
--

There'd be no more releases in Spark 3.2.X unless there's an important security 
issue, etc.

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-17 Thread todd (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713403#comment-17713403
 ] 

todd commented on SPARK-43170:
--

Spark3.2.x is currently used in production, and there is no plan to upgrade to 
a higher version for the time being.   If it's a bug, isn't the spark3.2 
version going to be fixed?

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

2023-04-17 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713373#comment-17713373
 ] 

Hyukjin Kwon commented on SPARK-43170:
--

Spark 3.2.X is EOL. Mind trying if the same persists in higher versions?

> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> 
>
> Key: SPARK-43170
> URL: https://issues.apache.org/jira/browse/SPARK-43170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Major
> Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://x/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>   +- InMemoryRelation (2)
> +- * HashAggregate (8)
>+- Exchange (7)
>   +- * HashAggregate (6)
>  +- * Project (5)
> +- * ColumnarToRow (4)
>+- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>+- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>   +- *(1) Project [appid#65]
>  +- *(1) ColumnarToRow
> +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org