Re: How to load only the data of the last partition

Rabin Banerjee Fri, 18 Nov 2016 07:48:40 -0800

HI ,

 In order to do that you can write code to read/list a HDFS directory first
, then list its sub-directories . In this way using custom logic ,first
identify the latest year/month/version , then read the avro in that dir in
a DF, then add year/month/version to that DF using withColumn.


Regards,
R Banerjee

On Fri, Nov 18, 2016 at 2:41 PM, Samy Dindane <s...@dindane.com> wrote:

> Thank you Daniel. Unfortunately, we don't use Hive but bare (Avro) files.
>
>
> On 11/17/2016 08:47 PM, Daniel Haviv wrote:
>
>> Hi Samy,
>> If you're working with hive you could create a partitioned table and
>> update it's partitions' locations to the last version so when you'll query
>> it using spark, you'll always get the latest version.
>>
>> Daniel
>>
>> On Thu, Nov 17, 2016 at 9:05 PM, Samy Dindane <s...@dindane.com <mailto:
>> s...@dindane.com>> wrote:
>>
>>     Hi,
>>
>>     I have some data partitioned this way:
>>
>>     /data/year=2016/month=9/version=0
>>     /data/year=2016/month=10/version=0
>>     /data/year=2016/month=10/version=1
>>     /data/year=2016/month=10/version=2
>>     /data/year=2016/month=10/version=3
>>     /data/year=2016/month=11/version=0
>>     /data/year=2016/month=11/version=1
>>
>>     When using this data, I'd like to load the last version only of each
>> month.
>>
>>     A simple way to do this is to do 
>> `load("/data/year=2016/month=11/version=3")`
>> instead of doing `load("/data")`.
>>     The drawback of this solution is the loss of partitioning information
>> such as `year` and `month`, which means it would not be possible to apply
>> operations based on the year or the month anymore.
>>
>>     Is it possible to ask Spark to load the last version only of each
>> month? How would you go about this?
>>
>>     Thank you,
>>
>>     Samy
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe e-mail: user-unsubscr...@spark.apache.org <mailto:
>> user-unsubscr...@spark.apache.org>
>>
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: How to load only the data of the last partition

Reply via email to