Hi, I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS. Here is the SQL query to read just *one partition*.
``` spark.sql(""" SELECT rtb_platform_id, SUM(e_cpm) FROM raw_logs.fact_request WHERE day = '2019-08-01' AND hour = '00' AND platform = 'US' AND request_status = '3' AND is_sampled = 1 GROUP BY rtb_platform_id """).show ``` However, from the Spark web UI, the stage description shows: ``` Listing leaf files and directories for 201616 paths: viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, ... ``` It seems the job is reading all of the partitions of the table and the job takes too long for just one partition. One workaround is using `spark.read.parquet` API to read parquet files directly. Spark has partition-awareness for partitioned directories. But still, I would like to know if there is a way to leverage partition-awareness via Hive by using `spark.sql` API? Any help is highly appreciated! Thank you. -- Hao Ren