here is the document
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions


2013/11/25 郭士伟 <guoshi...@gmail.com>

> or using the Date functions? Such as unix_timestamp to convert date
> string to unix timestamp and from_unixtime convert unix timestamp to
> string
>
>
> 2013/11/25 John Omernik <j...@omernik.com>
>
>> I wouldn't worry about efficiency to much:
>>
>> concat('20', split(date_field, '\\/')[2], '-', split(date_field,
>> '\\/')[1], '-', split(date_field, '\\/')[0]) as proper_date -- YYYY-MM-DD
>>
>>
>>
>>
>> On Sun, Nov 24, 2013 at 12:13 PM, Baahu <bahub...@gmail.com> wrote:
>>
>>> Hi John,
>>> Thanks for the reply,I have been given a new format of data and now the
>>> logs aren't as messy as they were earlier, but yes your mail gave me
>>> pointers which helped me is handling the new data.
>>>
>>> Now..I am stuck while handling a format of date,I am getting date in the
>>> form 22/11/13 which is dd/mm/yy, I have to rearrange this to yyyy/mm/dd,
>>> can you please shed some light on this. I think we need to use split() to
>>> get the tokens and then rearrange, but I am  not able to think of an
>>> efficient way to do this.
>>>
>>> Thanks.
>>>
>>>
>>> On Sun, Nov 24, 2013 at 5:25 PM, John Omernik <j...@omernik.com> wrote:
>>>
>>>> Put the logile into a location on HDFS, and create an external table
>>>> pointing to that location. The External table should just have one column,
>>>> a string,
>>>> CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION
>>>> '/etl/logfile'
>>>>
>>>> I think that should work.
>>>>
>>>> Then Create another table
>>>>
>>>> CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED
>>>> BY (DAY STRING) STORED AS ORC;
>>>>
>>>> copy files into /etc/logfile
>>>>
>>>> run this hive file:
>>>>
>>>> SET hive.exec.dynamic.partition=true;
>>>>
>>>> SET hive.exec.dynamic.partition.mode=nonstrict;
>>>>
>>>> SET hive.exec.max.dynamic.partitions=10000;
>>>>
>>>> SET hive.exec.max.dynamic.partitions.pernode=1000;
>>>>
>>>> INSERT INTO logfile
>>>> select substring(message_line, 0, 17) as ts
>>>> regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
>>>> regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
>>>> concat('20', substring(messageline, 0, 8)) as day
>>>> from logfile_etl
>>>>
>>>> delete the the files /etl/logfile (or move them to an archival)
>>>>
>>>> That will get you a day partitioned (I added the 20 in front of your
>>>> date so that string sorts well, although it probably would without it, it'
>>>> early, and I have not had coffee yet) ORC file table (with compression and
>>>> ORC good ness. The regexs are a little messy, by based on your one line of
>>>> data, should work. Also: If you have data from pre 2000 obviously, the
>>>> concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
>>>> date... why? It appears to be properly padded data, therefore a substring
>>>> is fast. This type of stuff has so many ways to skin a cat, so your way may
>>>> be totally different from my way, but this is how I'd approach it long
>>>> term. (if it's a one time thing, I may not create the managed the table,
>>>> but if so, having partitions and ORC files will make things faster).  If
>>>> there are syntax errors I apologize, see earlier disclaimer about lack of
>>>> proper bean sourced stimulants.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Nov 23, 2013 at 7:36 AM, Baahu <bahub...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> I have a messy log file which I want to use to create a table, I am
>>>>> only interested to retrieve 3 columns (time,ADD,files),which are in bold.
>>>>> Sample entry from log file
>>>>> *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
>>>>> 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip
>>>>>
>>>>> Can you please let me know how I should go about, regex seems to be
>>>>> way out,but I am struggling with that as well !!
>>>>>
>>>>> Thanks,
>>>>> Baahu
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Twitter:http://twitter.com/Baahu
>>>
>>>
>>
>

Reply via email to