[jira] [Created] (HIVE-9630) partition filter using date_sub result produces non-optimal plans

Robert Miller (JIRA) Mon, 09 Feb 2015 16:28:06 -0800

Robert Miller created HIVE-9630:
-----------------------------------

             Summary: partition filter using date_sub result produces 
non-optimal plans
                 Key: HIVE-9630
                 URL: https://issues.apache.org/jira/browse/HIVE-9630
             Project: Hive
          Issue Type: Bug
          Components: CLI
    Affects Versions: 0.14.0
         Environment: CentOS 6.5, HDP 2.2
            Reporter: Robert Miller



We queries which execute many times faster when comparing a date partition 
value to a date string than when comparing to the results of a date_sub 
(specifically: date_sub(to_date(from_unixtime(unix_timestamp())), 7) -- Looking 
for seven days earlier).

For example:  The following query requires a MR job to be spun-up and requires 
1917 seconds to tell me there are roughly 428 MM rows:

select count(*) from event_histories eh where eh.adjusted_activity_date_utc > 
date_sub(to_date(from_unixtime(unix_timestamp())), 7);

Replacing the "date_sub(to_date(from_unixtime(unix_timestamp())), 7)" with the 
literal value '2015-02-02' (the value for today) results in no MR job and the 
result of 428 MM is returned in 0.075 seconds.

There are queries which are using the same date_sub structure which appear to 
perform a full scan of the table

I expect the results of the date_sub function are not classed equivalent to the 
literal even though there is nothing in the date_sub function which will change 
during the course of the query.

Explaining the two queries:

explain select count(*) from event_histories eh where 
eh.adjusted_activity_date_utc > 
date_sub(to_date(from_unixtime(unix_timestamp())), 7);
OK
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: eh
            Statistics: Num rows: 1993093006 Data size: 901863505920 Basic 
stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: (adjusted_activity_date_utc > 
date_sub(to_date(from_unixtime(unix_timestamp())), 7)) (type: boolean)
              Statistics: Num rows: 664364335 Data size: 300621168489 Basic 
stats: COMPLETE Column stats: NONE
              Select Operator
                Statistics: Num rows: 664364335 Data size: 300621168489 Basic 
stats: COMPLETE Column stats: NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  Reduce Output Operator
                    sort order:
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                    value expressions: _col0 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: NONE
          Select Operator
            expressions: _col0 (type: bigint)
            outputColumnNames: _col0
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 1.853 seconds, Fetched: 49 row(s)


explain select count(*) from event_histories eh where 
eh.adjusted_activity_date_utc > '2015-02-02';
OK
Explain
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: 1
      Processor Tree:
        ListSink

Time taken: 0.06 seconds, Fetched: 10 row(s)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9630) partition filter using date_sub result produces non-optimal plans

Reply via email to