[ 
https://issues.apache.org/jira/browse/HIVE-15876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Attardo updated HIVE-15876:
---------------------------------
    Description: 
I'm not entirely sure the solution implied by the subject is the best one, but 
it's a useful shorthand for addressing the problem I'm seeing.

I have a use case for wanting to see if a given unix timestamp (with 
microseconds) falls within a date range expressed by a start_date and an 
end_date in 'yyyy-MM-dd' format. Without any UDFs, it would look something like 
this:

...
WHERE
time >= unix_timestamp('${start_date}','yyyy-MM-dd')*1000
AND time < unix_timestamp(date_add('${end_date}', 1),'yyyy-MM-dd')*1000

This condition is obviously quite a pain to read and write, and one that's easy 
to get wrong when trying to reproduce. I would like to simplify to something 
like:

...
WHERE time_in_range(time, '${start_date}','${end_date}')

I was able to write a UDF for the above relatively easily, but when testing it, 
it performed notably worse than the first example. The reason for that quickly 
became clear. Even though both functions are deterministic, that is largely 
irrelevant in the latter example, because the 'time' variable is not static. 
The query optimizer can do nothing with it, and must do the full function 
evaluation on each row. Whereas in the first example the optimizer can see that 
the "sub-functions" of unix_timestamp and date_add are both deterministic and 
have static inputs, and will only evaluate them once for the whole query.

I would like a way to define a single function that maintains the ability for 
the optimizer to see which of its constituent parts only need be evaluated 
once. Whether that is some syntax in the CREATE FUNCTION DDL or some 
annotations within scala; either would be incredibly useful.

  was:
I'm not entirely sure the solution implied by the subject is the best one, but 
it's a useful shorthand for addressing the problem I'm seeing.

I have a use case for wanting to see if a given unix timestamp (with 
milliseconds) falls within a date range expressed by a start_date and an 
end_date in 'yyyy-MM-dd' format. Without any UDFs, it would look something like 
this:

...
WHERE
time >= unix_timestamp('${start_date}','yyyy-MM-dd')*1000
AND time < unix_timestamp(date_add('${end_date}', 1),'yyyy-MM-dd')*1000

This condition is obviously quite a pain to read and write, and one that's easy 
to get wrong when trying to reproduce. I would like to simplify to something 
like:

...
WHERE time_in_range(time, '${start_date}','${end_date}')

I was able to write a UDF for the above relatively easily, but when testing it, 
it performed notably worse than the first example. The reason for that quickly 
became clear. Even though both functions are deterministic, that is largely 
irrelevant in the latter example, because the 'time' variable is not static. 
The query optimizer can do nothing with it, and must do the full function 
evaluation on each row. Whereas in the first example the optimizer can see that 
the "sub-functions" of unix_timestamp and date_add are both deterministic and 
have static inputs, and will only evaluate them once for the whole query.

I would like a way to define a single function that maintains the ability for 
the optimizer to see which of its constituent parts only need be evaluated 
once. Whether that is some syntax in the CREATE FUNCTION DDL or some 
annotations within scala; either would be incredibly useful.


> Add the ability to define composed functions
> --------------------------------------------
>
>                 Key: HIVE-15876
>                 URL: https://issues.apache.org/jira/browse/HIVE-15876
>             Project: Hive
>          Issue Type: Wish
>            Reporter: Peter Attardo
>            Priority: Minor
>
> I'm not entirely sure the solution implied by the subject is the best one, 
> but it's a useful shorthand for addressing the problem I'm seeing.
> I have a use case for wanting to see if a given unix timestamp (with 
> microseconds) falls within a date range expressed by a start_date and an 
> end_date in 'yyyy-MM-dd' format. Without any UDFs, it would look something 
> like this:
> ...
> WHERE
> time >= unix_timestamp('${start_date}','yyyy-MM-dd')*1000
> AND time < unix_timestamp(date_add('${end_date}', 1),'yyyy-MM-dd')*1000
> This condition is obviously quite a pain to read and write, and one that's 
> easy to get wrong when trying to reproduce. I would like to simplify to 
> something like:
> ...
> WHERE time_in_range(time, '${start_date}','${end_date}')
> I was able to write a UDF for the above relatively easily, but when testing 
> it, it performed notably worse than the first example. The reason for that 
> quickly became clear. Even though both functions are deterministic, that is 
> largely irrelevant in the latter example, because the 'time' variable is not 
> static. The query optimizer can do nothing with it, and must do the full 
> function evaluation on each row. Whereas in the first example the optimizer 
> can see that the "sub-functions" of unix_timestamp and date_add are both 
> deterministic and have static inputs, and will only evaluate them once for 
> the whole query.
> I would like a way to define a single function that maintains the ability for 
> the optimizer to see which of its constituent parts only need be evaluated 
> once. Whether that is some syntax in the CREATE FUNCTION DDL or some 
> annotations within scala; either would be incredibly useful.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to