efficiently accessing partition data for datasets in S3 with SparkSQL

2015-07-05 Thread Steve Lindemann
I'm trying to use SparkSQL to efficiently query structured data from datasets in S3. The data is naturally partitioned by date, so I've laid it out in S3 as follows: s3://bucket/dataset/dt=2015-07-05/ s3://bucket/dataset/dt=2015-07-04/ s3://bucket/dataset/dt=2015-07-03/ etc. In each directory, da

calling HiveContext.table or running a query reads files unnecessarily in S3

2015-07-04 Thread Steve Lindemann
Hi, I'm just getting started with Spark so apologies if this I'm missing something obvious. In the below, I'm using Spark 1.4. I've created a partitioned table in S3 (call it 'dataset'), with basic structure like so: s3://bucket/dataset/pk=a s3://bucket/dataset/pk=b s3://bucket/dataset/pk=c In