Hi all, 

We have a hive table stored in S3 and registered in a hive metastore. 
This table is partitionned with a key "day". 


So we access this table through the spark dataframe API as : 


sqlContext.read() 
.table("tablename) 
.where(col("day").between("2016-08-01","2016-08-02")) 


When the job is launched, we can see that spark have tasks "table" which have a 
small duration (seconds) but takes minutes. 
In the logs we see that every paths for every partitions are listed, regardless 
the partition key values, during minutes. 


16/08/03 13:17:16 INFO HadoopFsRelation: Listing s3a://buckets3/day=2016-07-24 
16/08/03 13:17:16 INFO HadoopFsRelation: Listing s3a://buckets3/day=2016-07-25 
.... 


Is it a normal behaviour? Do we could specify something in the read().table, 
maybe some options? 
I tried to find such options but i cannot find anything. 


Thanks, 
Mehdi 

Reply via email to