Re: Query data in subdirectories in Hive Partitions using Spark SQL

Jon Gregg Sat, 18 Feb 2017 07:49:23 -0800

Spark has partition discovery if your data is laid out in a
parquet-friendly directory structure:
http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery


You can also use wildcards to get subdirectories (I'm using spark 1.6 here)
>>
data2 = sqlContext.read.load("/my/data/parquetTable/*", "parquet") # gets
all subdirectories
>>

<http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery>Another
option would be to CREATE a Hive table on top of your data that uses
PARTITIONED BY to identify the subdirectories, and then use Spark SQL to
query that Hive table.  There might be a cleaner way to do this in Spark
2.0+ but that is a common pattern for me in Spark 1.6 when I know the
directory structure but don't have "=" signs in the paths.

Jon Gregg

On Fri, Feb 17, 2017 at 7:02 PM, 颜发才(Yan Facai) <facai....@gmail.com> wrote:

> Hi, Abdelfatah,
> How to you read these files? spark.read.parquet or spark.sql?
> Could you show some code?
>
>
> On Wed, Feb 15, 2017 at 8:47 PM, Ahmed Kamal Abdelfatah <
> ahmed.abdelfa...@careem.com> wrote:
>
>> Hi folks,
>>
>>
>>
>> How can I force spark sql to recursively get data stored in parquet
>> format from subdirectories ?  In Hive, I could achieve this by setting few
>> Hive configs.
>>
>>
>>
>> set hive.input.dir.recursive=true;
>>
>> set hive.mapred.supports.subdirectories=true;
>>
>> set hive.supports.subdirectories=true;
>>
>> set mapred.input.dir.recursive=true;
>>
>>
>>
>> I tried to set these configs through spark sql queries but I get 0
>> records all the times compared to hive which get me the expected results. I
>> also put these confs in hive-site.xml file but nothing changed. How can I
>> handle this issue ?
>>
>>
>>
>> Spark Version : 2.1.0
>>
>> I used Hive 2.1.1  on emr-5.3.1
>>
>>
>>
>> *Regards, *
>>
>>
>>
>>
>> *Ahmed Kamal*
>> *MTS in Data Science*
>>
>> *Email: **ahmed.abdelfa...@careem.com <ahmed.abdelfa...@careem.com>*
>>
>>
>>
>>
>>
>
>

Re: Query data in subdirectories in Hive Partitions using Spark SQL

Reply via email to