Hi Jan,

I have date in different format also, so that is the reason I was thinking
to do by this approach. How can I make sure this will work on the selected
partition only and it will not scan the entire table. I will add your
suggestion in my UDF as deterministic thing.

My simple question here is- How to get the Yesterdays date which I can use
on the Date Partition I cannot use hiveconf here as I am working with Hive
0.6




*Raihan Jamal*



On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <dolik....@gmail.com> wrote:

> I'm afraid that  he query
>
> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>
> will scan entire table, because the functions is evaluated at runtime, so
> Hive doesn't know what the value is when it decides which files to scan. I
> am not 100% sure though, you should try it.
>
> Also, you might want to try to add annotation to your UDF saying that the
> function is deterministic:
> @*UDFType(deterministic=false*)
>
> I think Hive might be clever enough to evaluate it early enough to use the
> partition pruning correctly, since it operates on constant expression. But
> again, I'm not really sure, maybe someone with deeper knowledge of Hive
> optimizations will tell us more. It is actually quite interesting question.
>
> Another way to help Hive with the optimizations might be to skip passing
> the format string argument, if you have all dates in same format, you can
> call the function just like 'yesterdaydate()' and hardcode the format in
> the function.
>
> Jan
>
>
> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <jamalrai...@gmail.com>wrote:
>
>> Hi Jan,
>>
>>
>> I figured that out, it is working fine for me now. The only question I
>> have is, if I am doing like this-
>>
>>
>>
>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>
>>
>>
>> Then the above query will be evaluated as below right?
>>
>>
>>
>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>
>>
>>
>> So that means it will look for data in the corresponding dt partition 
>> *(20120806)
>> *only right as above table is partitioned on dt column ? And it will not
>> scan the whole table right?**
>>
>>
>>
>> *Raihan Jamal*
>>
>>
>>
>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <dolik....@gmail.com> wrote:
>>
>>> Hi Jamal,
>>>
>>> Check if the function really returns what it should and that your data
>>> are really in yyyyMMdd format. You can do this by simple query like this:
>>>
>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>
>>> I don't see anything wrong with the function itself, it works well for
>>> me (although I tested it in hive 0.7.1). The only thing I would change
>>> about it would be to optimize it by calling 'new' only at the time of
>>> construction and reusing the object when the function is called, but that
>>> should not affect the functionality at all.
>>>
>>> Best regards,
>>> Jan
>>>
>>>
>>>
>>>
>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <jamalrai...@gmail.com>wrote:
>>>
>>>> *Problem*
>>>>
>>>> I created the below UserDefinedFunction to get the yesterday's day in
>>>> the format I wanted as I will be passing the format into this below method
>>>> from the query.
>>>>
>>>>
>>>>
>>>> *public final class YesterdayDate extends UDF {*
>>>>
>>>> * *
>>>>
>>>> *                public String evaluate(final String format) { *
>>>>
>>>> *                                DateFormat dateFormat = new
>>>> SimpleDateFormat(format); *
>>>>
>>>> *                                Calendar cal = Calendar.getInstance();
>>>> *
>>>>
>>>> *                                cal.add(Calendar.DATE, -1);     *
>>>>
>>>> *                                return
>>>> dateFormat.format(cal.getTime()).toString(); *
>>>>
>>>> *                } *
>>>>
>>>> *}*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> So whenever I try to run the query like below by adding the jar to
>>>> classpath and creating the temporary function yesterdaydate, I always get
>>>> zero result back-
>>>>
>>>>
>>>>
>>>> hive> create temporary function *yesterdaydate* as
>>>> 'com.example.hive.udf.YesterdayDate';
>>>>
>>>> OK
>>>>
>>>> Time taken: 0.512 seconds
>>>>
>>>>
>>>>
>>>> Below is the query I am running-
>>>>
>>>>
>>>>
>>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>>> LIMIT 10;*
>>>>
>>>> *OK*
>>>>
>>>> * *
>>>>
>>>> And I always get zero result back but the data is there in that table
>>>> for Aug 5th.**
>>>>
>>>>
>>>>
>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>>> substitution thing, so I cannot use hiveconf here and the above table has
>>>> been partitioned on dt(date) column.**
>>>>
>>>
>>>
>>
>

Reply via email to