During the planning process, the PartitionPruner gets the list of all the
partitions from metadata and evaluates the full expression against the
partition definition's fields, including calling any UDFs, to determine
whether it needs to include each partition.  If you turn up the logging
level for "hive.ql.optimizer.ppr.PartitionPruner", you can see it walk
through each partition in turn to decide whether to include it.

On 10/15/09 5:24 PM, "Vijay" <[email protected]> wrote:

> Thanks Ashish! This is the same approach I'm using as well and it seems to be
> working very good
> 
> One thing I wasn't sure at first but was later surprised was how Hive is able
> to figure out which partitions to work on using WHERE clauses. What I mean by
> that is if I do something like WHERE month(ds)=9, it is able to figure out
> that it needs to just use the partitions 2009-09-01 to 2009-09-30. How does
> the query engine know this? Does it evaluate partition column related
> expressions locally?
> 
> Thanks,
> Vijay
> 
> On Tue, Oct 13, 2009 at 2:25 PM, Ashish Thusoo <[email protected]> wrote:
>> We store the the partitioning as
>>  
>> YYYY-MM-DD
>>  
>> in that format the string representation of the date has the same
>> lexicographical ordering as the date itself. So if you have that as the
>> format 
>> of the string in the ds column (hive does not have date functions yet), then
>> the expressions of the kind
>>  
>> ds >= '2009-08-15' and ds <= '2009-09-15'
>>  
>> will pick up the right partitions.
>>  
>> For doing counts over the month you can either extract the month from the
>> date 
>> string using the substring(ds, 5, 2) udf in hive or you can use month(ds) and
>> then put
>> that in the group by clause of the query.
>>  
>> Ashish
>> 
>> 
>> From: Vijay [mailto:[email protected]]
>> Sent: Monday, October 12, 2009 5:05 PM
>> To: [email protected]
>> Subject: Questions on date arithmetic/calculations
>> 
>> Hi,
>> 
>> I have some basic questions on how hive handles dates and date arithmetic. I
>> apologize if this has already been addressed. Per most samples on this site
>> and elsewhere, I can have an access log table defined with a partition scheme
>> that looks like this: ds='09-08-09'. This is obviously pretty good to
>> partition the data. However, how can this information be used later in
>> queries? For example, if I want to select data for all dates between 08/15/09
>> and 09/15/09, how would I do that? The partition column ds cannot be used
>> with 
>>> = and similar operators right? Additionally, when is partitioned this way,
>> how can I do counts on month, etc? Obviously all of these queries need to be
>> expressed in a way hive can still take advantage of the partitioning scheme.
>> I 
>> hope that makes sense.
>> 
>> Thanks,
>> Vijay
> 

Reply via email to