Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause

Edward Capriolo Wed, 06 Oct 2010 19:56:34 -0700
On Wed, Oct 6, 2010 at 8:05 PM, Steven Wong <sw...@netflix.com> wrote:
> What Hive version are you running? Try an “explain extended” on your insert
> query and see if unneeded partitions are included.
>
>
>
> Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time (PDT)
> is UTC-07:00. To convert UTC to PDT, the condition should be:
>
> (HF.dt = '2010-09-29' AND HF.hr >= '07' ) OR (HF.dt = '2010-09-30' AND HF.hr
> < '07' )
>
> instead of:
>
> (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr
> <= '07' )
>
>
>
> Good luck on the days we spring forward or fall back. J/L
>
>
>
>
>
> From: Marc Limotte [mailto:mslimo...@gmail.com]
> Sent: Wednesday, October 06, 2010 11:12 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: RE: hive query doesn't seem to limit itself to partitions based
> on the WHERE clause
>
>
>
> Thanks for the response, Edward.
>
> The source table (hourly_fact) is partitioned on dt (date) and hr (hour),
> and I've confirmed that they are both String fields (CREATE stmt is below).
>
> The hourly_fact table contains 'number of requests' for each hour by a few
> dimensions.  The query is just trying to get a daily aggregation across
> those same dimensions.  The only trick is that the hourly_fact table has dt
> and hour in UTC time.  And the daily aggregation is being done for a PST
> (pacific std) day, hence the 7 hour offset.
>
> CREATE TABLE IF NOT EXISTS hourly_fact (
>   tagtype               STRING,
>   country               STRING,
>   company               INT,
>   request_keyword       STRING,
>   receiver_code         STRING,
>   referrer_domain       STRING,
>   num_requests          INT,
>   num_new_user_requests INT
> )
> PARTITIONED BY (dt STRING, hr STRING)
> ROW FORMAT DELIMITED
> STORED AS SEQUENCEFILE
> LOCATION "...";
>
> Marc
>
> On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
> On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <mslimo...@gmail.com> wrote:
>> Hi Namit,
>>
>> Hourly_fact is partitioned on dt and hr.
>>
>> Marc
>>
>> On Oct 3, 2010 10:00 PM, "Namit Jain" <nj...@facebook.com> wrote:
>>> What is your table hourly_fact partitioned on ?
>>>
>>> ________________________________________
>>> From: Marc Limotte [mslimo...@gmail.com]
>>> Sent: Friday, October 01, 2010 2:10 PM
>>> To: hive-user@hadoop.apache.org
>>> Subject: hive query doesn't seem to limit itself to partitions based on
>>> the WHERE clause
>>>
>>> Hi,
>>>
>>> From looking at the hive log output, it seems that my job is accessing
>>> many more partitions than it needs to? For example, my query is something
>>> like:
>>>
>>> INSERT OVERWRITE TABLE daily_fact
>>> PARTITION (dt='2010-09-29')
>>> SELECT
>>> 20100929 as stamp,
>>> tagtype,
>>> country,
>>> sum(num_requests) AS num_requests
>>> FROM
>>> hourly_fact HF
>>> WHERE
>>> (HF.dt = '2010-09-29' AND HF.hr > '07' )
>>> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' )
>>> GROUP BY
>>> 20100929, tagtype, country
>>>
>>> Based on the WHERE clause, I would expect it to look only at partitions
>>> in
>>> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the
>>> log
>>> contains entries like:
>>>
>>> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file
>>> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10
>>>
>>> And many other hours outside my WHERE constraint. I assume this means
>>> that
>>> it's processing those directories. The answer still comes out right, but
>>> I'm
>>> concerned about the performance.
>>>
>>> Would appreciate some help understanding what this means and how to fix
>>> it.
>>>
>>> Thanks,
>>> Marc
>>>
>>>
>>
>
> Possibly you defined HF.hr <= '07'  as an int column and comparing it
> as a string is resulting in a full table scan. Can you explain the
> query?
>
>
Since you defined '07' as a string you are getting a lexicographic
comparison rather then a numeric one. That is why you are seeing more
columns then you expect. = will work the same but < > will not. You
can try to cast the query, or drop and add the partition using a
numeric type.
Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause

Reply via email to