here is the document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
2013/11/25 郭士伟 <guoshi...@gmail.com> > or using the Date functions? Such as unix_timestamp to convert date > string to unix timestamp and from_unixtime convert unix timestamp to > string > > > 2013/11/25 John Omernik <j...@omernik.com> > >> I wouldn't worry about efficiency to much: >> >> concat('20', split(date_field, '\\/')[2], '-', split(date_field, >> '\\/')[1], '-', split(date_field, '\\/')[0]) as proper_date -- YYYY-MM-DD >> >> >> >> >> On Sun, Nov 24, 2013 at 12:13 PM, Baahu <bahub...@gmail.com> wrote: >> >>> Hi John, >>> Thanks for the reply,I have been given a new format of data and now the >>> logs aren't as messy as they were earlier, but yes your mail gave me >>> pointers which helped me is handling the new data. >>> >>> Now..I am stuck while handling a format of date,I am getting date in the >>> form 22/11/13 which is dd/mm/yy, I have to rearrange this to yyyy/mm/dd, >>> can you please shed some light on this. I think we need to use split() to >>> get the tokens and then rearrange, but I am not able to think of an >>> efficient way to do this. >>> >>> Thanks. >>> >>> >>> On Sun, Nov 24, 2013 at 5:25 PM, John Omernik <j...@omernik.com> wrote: >>> >>>> Put the logile into a location on HDFS, and create an external table >>>> pointing to that location. The External table should just have one column, >>>> a string, >>>> CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION >>>> '/etl/logfile' >>>> >>>> I think that should work. >>>> >>>> Then Create another table >>>> >>>> CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED >>>> BY (DAY STRING) STORED AS ORC; >>>> >>>> copy files into /etc/logfile >>>> >>>> run this hive file: >>>> >>>> SET hive.exec.dynamic.partition=true; >>>> >>>> SET hive.exec.dynamic.partition.mode=nonstrict; >>>> >>>> SET hive.exec.max.dynamic.partitions=10000; >>>> >>>> SET hive.exec.max.dynamic.partitions.pernode=1000; >>>> >>>> INSERT INTO logfile >>>> select substring(message_line, 0, 17) as ts >>>> regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD, >>>> regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files, >>>> concat('20', substring(messageline, 0, 8)) as day >>>> from logfile_etl >>>> >>>> delete the the files /etl/logfile (or move them to an archival) >>>> >>>> That will get you a day partitioned (I added the 20 in front of your >>>> date so that string sorts well, although it probably would without it, it' >>>> early, and I have not had coffee yet) ORC file table (with compression and >>>> ORC good ness. The regexs are a little messy, by based on your one line of >>>> data, should work. Also: If you have data from pre 2000 obviously, the >>>> concat('20' thing needs to be updated. Note, I didn't use a regex on the >>>> date... why? It appears to be properly padded data, therefore a substring >>>> is fast. This type of stuff has so many ways to skin a cat, so your way may >>>> be totally different from my way, but this is how I'd approach it long >>>> term. (if it's a one time thing, I may not create the managed the table, >>>> but if so, having partitions and ORC files will make things faster). If >>>> there are syntax errors I apologize, see earlier disclaimer about lack of >>>> proper bean sourced stimulants. >>>> >>>> >>>> >>>> >>>> >>>> On Sat, Nov 23, 2013 at 7:36 AM, Baahu <bahub...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> I have a messy log file which I want to use to create a table, I am >>>>> only interested to retrieve 3 columns (time,ADD,files),which are in bold. >>>>> Sample entry from log file >>>>> *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] * junkjunk|2013-11-23 >>>>> 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip >>>>> >>>>> Can you please let me know how I should go about, regex seems to be >>>>> way out,but I am struggling with that as well !! >>>>> >>>>> Thanks, >>>>> Baahu >>>>> >>>>> >>>> >>> >>> >>> -- >>> Twitter:http://twitter.com/Baahu >>> >>> >> >