[jira] [Updated] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

todd (Jira) Mon, 17 Apr 2023 20:01:06 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-43170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


todd updated SPARK-43170:
-------------------------
    Description: 
--DDL

CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
  `gaid` STRING COMMENT '',
  `beyla_id` STRING COMMENT '',
  `dt` STRING,
  `hour` STRING,
  `appid` STRING COMMENT '包名')
USING parquet
PARTITIONED BY (dt, hour, appid)
LOCATION 's3://xxxxx/dwm_user_app_action_sum_all'

– partitions  info
show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION (dt='20230412');
 
dt=20230412/hour=23/appid=blibli.mobile.commerce
dt=20230412/hour=23/appid=cn.shopee.app
dt=20230412/hour=23/appid=cn.shopee.br
dt=20230412/hour=23/appid=cn.shopee.id
dt=20230412/hour=23/appid=cn.shopee.my
dt=20230412/hour=23/appid=cn.shopee.ph
 
— query
select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
where dt='20230412' and appid like '%shopee%'
 
--result
 nodata 
 
— other
I use spark3.0.1 version and trino query engine to query the data。
 
 
The physical execution node formed by spark 3.2
(3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
InMemoryFileIndex []
PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
Contains(appid#65, shopee)] ReadSchema: struct<>
 
 
!image-2023-04-18-10-59-30-199.png!
 
 – sql plan detail
{code:java}
== Physical Plan ==
CollectLimit (9)
+- InMemoryTableScan (1)
      +- InMemoryRelation (2)
            +- * HashAggregate (8)
               +- Exchange (7)
                  +- * HashAggregate (6)
                     +- * Project (5)
                        +- * ColumnarToRow (4)
                           +- Scan parquet ecom_dwm.dwm_user_app_action_sum_all 
(3)


(1) InMemoryTableScan
Output [1]: [appid#65]
Arguments: [appid#65]

(2) InMemoryRelation
Arguments: [appid#65], 
CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
 memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
functions=[], output=[appid#65])
+- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
   +- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
      +- *(1) Project [appid#65]
         +- *(1) ColumnarToRow
            +- FileScan parquet 
ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
,None)

(3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
Output [3]: [dt#63, hour#64, appid#65]
Batched: true
Location: InMemoryFileIndex []
PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
StartsWith(appid#65, com)]
ReadSchema: struct<>

(4) ColumnarToRow [codegen id : 1]
Input [3]: [dt#63, hour#64, appid#65]

(5) Project [codegen id : 1]
Output [1]: [appid#65]
Input [3]: [dt#63, hour#64, appid#65]

(6) HashAggregate [codegen id : 1]
Input [1]: [appid#65]
Keys [1]: [appid#65]
Functions: []
Aggregate Attributes: []
Results [1]: [appid#65]

(7) Exchange
Input [1]: [appid#65]
Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]

(8) HashAggregate [codegen id : 2]
Input [1]: [appid#65]
Keys [1]: [appid#65]
Functions: []
Aggregate Attributes: []
Results [1]: [appid#65]

(9) CollectLimit
Input [1]: [appid#65]
Arguments: 1 {code}

  was:
--DDL

CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
  `gaid` STRING COMMENT '',
  `beyla_id` STRING COMMENT '',
  `dt` STRING,
  `hour` STRING,
  `appid` STRING COMMENT '包名')
USING parquet
PARTITIONED BY (dt, hour, appid)
LOCATION 's3://xxxxx/dwm_user_app_action_sum_all'

-- partitions  info
show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION (dt='20230412');
 
dt=20230412/hour=23/appid=blibli.mobile.commerce
dt=20230412/hour=23/appid=cn.shopee.app
dt=20230412/hour=23/appid=cn.shopee.br
dt=20230412/hour=23/appid=cn.shopee.id
dt=20230412/hour=23/appid=cn.shopee.my
dt=20230412/hour=23/appid=cn.shopee.ph
 
--- query
select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
where dt='20230412' and appid like '%shopee%'
 
--result
 nodata 
 
--- other
I use spark3.0.1 version and trino query engine to query the data。
 
 
The physical execution node formed by spark 3.2
(3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
hour#64, appid#65] Batched: true Location: InMemoryFileIndex []
PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
Contains(appid#65, shopee)] ReadSchema: struct<>
 
 
!image-2023-04-18-10-59-30-199.png!
 
 


> The spark sql like statement is pushed down to parquet for execution, but the 
> data cannot be queried
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-43170
>                 URL: https://issues.apache.org/jira/browse/SPARK-43170
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.2
>            Reporter: todd
>            Priority: Blocker
>         Attachments: image-2023-04-18-10-59-30-199.png
>
>
> --DDL
> CREATE TABLE `ecom_dwm`.`dwm_user_app_action_sum_all` (
>   `gaid` STRING COMMENT '',
>   `beyla_id` STRING COMMENT '',
>   `dt` STRING,
>   `hour` STRING,
>   `appid` STRING COMMENT '包名')
> USING parquet
> PARTITIONED BY (dt, hour, appid)
> LOCATION 's3://xxxxx/dwm_user_app_action_sum_all'
> – partitions  info
> show partitions ecom_dwm.dwm_user_app_action_sum_all PARTITION 
> (dt='20230412');
>  
> dt=20230412/hour=23/appid=blibli.mobile.commerce
> dt=20230412/hour=23/appid=cn.shopee.app
> dt=20230412/hour=23/appid=cn.shopee.br
> dt=20230412/hour=23/appid=cn.shopee.id
> dt=20230412/hour=23/appid=cn.shopee.my
> dt=20230412/hour=23/appid=cn.shopee.ph
>  
> — query
> select DISTINCT(appid) from ecom_dwm.dwm_user_app_action_sum_all
> where dt='20230412' and appid like '%shopee%'
>  
> --result
>  nodata 
>  
> — other
> I use spark3.0.1 version and trino query engine to query the data。
>  
>  
> The physical execution node formed by spark 3.2
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all Output [3]: [dt#63, 
> hour#64, appid#65|#63, hour#64, appid#65] Batched: true Location: 
> InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)|#63), isnotnull(appid#65), (dt#63 = 20230412), 
> Contains(appid#65, shopee)] ReadSchema: struct<>
>  
>  
> !image-2023-04-18-10-59-30-199.png!
>  
>  – sql plan detail
> {code:java}
> == Physical Plan ==
> CollectLimit (9)
> +- InMemoryTableScan (1)
>       +- InMemoryRelation (2)
>             +- * HashAggregate (8)
>                +- Exchange (7)
>                   +- * HashAggregate (6)
>                      +- * Project (5)
>                         +- * ColumnarToRow (4)
>                            +- Scan parquet 
> ecom_dwm.dwm_user_app_action_sum_all (3)
> (1) InMemoryTableScan
> Output [1]: [appid#65]
> Arguments: [appid#65]
> (2) InMemoryRelation
> Arguments: [appid#65], 
> CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer@ab5af13,StorageLevel(disk,
>  memory, deserialized, 1 replicas),*(2) HashAggregate(keys=[appid#65], 
> functions=[], output=[appid#65])
> +- Exchange hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
>    +- *(1) HashAggregate(keys=[appid#65], functions=[], output=[appid#65])
>       +- *(1) Project [appid#65]
>          +- *(1) ColumnarToRow
>             +- FileScan parquet 
> ecom_dwm.dwm_user_app_action_sum_all[dt#63,hour#64,appid#65] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)], PushedFilters: [], ReadSchema: struct<>
> ,None)
> (3) Scan parquet ecom_dwm.dwm_user_app_action_sum_all
> Output [3]: [dt#63, hour#64, appid#65]
> Batched: true
> Location: InMemoryFileIndex []
> PartitionFilters: [isnotnull(dt#63), isnotnull(appid#65), (dt#63 = 20230412), 
> StartsWith(appid#65, com)]
> ReadSchema: struct<>
> (4) ColumnarToRow [codegen id : 1]
> Input [3]: [dt#63, hour#64, appid#65]
> (5) Project [codegen id : 1]
> Output [1]: [appid#65]
> Input [3]: [dt#63, hour#64, appid#65]
> (6) HashAggregate [codegen id : 1]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (7) Exchange
> Input [1]: [appid#65]
> Arguments: hashpartitioning(appid#65, 200), ENSURE_REQUIREMENTS, [plan_id=24]
> (8) HashAggregate [codegen id : 2]
> Input [1]: [appid#65]
> Keys [1]: [appid#65]
> Functions: []
> Aggregate Attributes: []
> Results [1]: [appid#65]
> (9) CollectLimit
> Input [1]: [appid#65]
> Arguments: 1 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43170) The spark sql like statement is pushed down to parquet for execution, but the data cannot be queried

Reply via email to