Hi folks, I have created a table in the following manner:
CREATE EXTERNAL TABLE IF NOT EXISTS rum_beacon_partition (
.... list of columns
)
COMMENT 'User Infomation'
PARTITIONED BY (account_id String,
product String,
group_id String,
year String,
month String,
day String)
STORED AS PARQUET
LOCATION '/stream2store/nt_tp_collation'
I then ran MSCK REPAIR TABLE to generate the partition information.
I think partitions got generated correctly -- here is a query and it's
output:
"show table extended like 'rum_beacon_partition'
partition(account_id='XXXX',product='rum',group_id='XXXX',year='2017',month='09',day='12')
location:ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation
/account_id=XXXX/product=rum/group_id= XXXX/year=2017/month=09/day=12
However, it does appear that when I issue a SQL query, the predicates do
not correctly limit the files touched:
explain extended select uri from rum_beacon_partition where
account_id='XXXX' and product='rum' and group_id='XXXX' and year='2017' and
month='09' limit 2
Produces output that seems to indicate that every file is being touched
(unless I'm misreading the output). It also crashes my filesystem so I
suspect there is some truth to it.
Optimized logical plan looks fine I think:
== Optimized Logical Plan == |
| Limit 2 |
| Project [uri#16519] |
| Filter (((((account_id#16511 = XXXX) && (product#16512 = rum)) &&
(group_id#16513 = XXXX)) && (year#16514 = 2017)) && (month#16515 = 09)) |
But in the physical plan it seems that a ton of files are touched (both in
account and date partitions)
Scan
ParquetRelation[ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=16,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=17,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=18,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=19,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=20,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=21,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=22,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=23,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=24,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=25,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=26,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=27,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=28,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=29,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=30,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=31,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=06/day=01,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=06/day=02
I am hoping someone can offer debugging tips / advice on what to look for
in the logs. I'm on a pretty old version of Spark (1.5.2) but this seems
like something that I'm doing wrong.