Hi folks, I have created a table in the following manner:

CREATE EXTERNAL TABLE IF NOT EXISTS  rum_beacon_partition (
     .... list of columns

)

COMMENT 'User Infomation'

PARTITIONED BY (account_id String,

product String,

group_id String,

year String,

month String,

day String)

STORED AS PARQUET

LOCATION '/stream2store/nt_tp_collation'


I then ran MSCK REPAIR TABLE to generate the partition information.


I think partitions got generated correctly -- here is a query and it's
output:


"show table extended like 'rum_beacon_partition'
partition(account_id='XXXX',product='rum',group_id='XXXX',year='2017',month='09',day='12')


 location:ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation
/account_id=XXXX/product=rum/group_id= XXXX/year=2017/month=09/day=12

However, it does appear that when I issue a SQL query, the predicates do
not correctly limit the files touched:

explain extended select uri from rum_beacon_partition where
account_id='XXXX' and product='rum' and group_id='XXXX' and year='2017' and
month='09' limit 2

Produces output that seems to indicate that every file is being touched
(unless I'm misreading the output). It also crashes my filesystem so I
suspect there is some truth to it.

Optimized logical plan looks fine I think:

== Optimized Logical Plan == |

| Limit 2 |

| Project [uri#16519] |

| Filter (((((account_id#16511 = XXXX) && (product#16512 = rum)) &&
(group_id#16513 = XXXX)) && (year#16514 = 2017)) && (month#16515 = 09)) |


But in the physical plan it seems that a ton of files are touched (both in
account and date partitions)

Scan
ParquetRelation[ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=16,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=17,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=18,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=19,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=20,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=21,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=22,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=23,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=24,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=25,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=26,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=27,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=28,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=29,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=30,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=05/day=31,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=06/day=01,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=XXXX/product=rum/group_id=XXXX/year=2017/month=06/day=02


I am hoping someone can offer debugging tips / advice on what to look for
in the logs. I'm on a pretty old version of Spark (1.5.2) but this seems
like something that I'm doing wrong.

Reply via email to