Re: Table creation for logfile data
Put the logile into a location on HDFS, and create an external table pointing to that location. The External table should just have one column, a string, CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION '/etl/logfile' I think that should work. Then Create another table CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY (DAY STRING) STORED AS ORC; copy files into /etc/logfile run this hive file: SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions=1; SET hive.exec.max.dynamic.partitions.pernode=1000; INSERT INTO logfile select substring(message_line, 0, 17) as ts regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD, regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files, concat('20', substring(messageline, 0, 8)) as day from logfile_etl delete the the files /etl/logfile (or move them to an archival) That will get you a day partitioned (I added the 20 in front of your date so that string sorts well, although it probably would without it, it' early, and I have not had coffee yet) ORC file table (with compression and ORC good ness. The regexs are a little messy, by based on your one line of data, should work. Also: If you have data from pre 2000 obviously, the concat('20' thing needs to be updated. Note, I didn't use a regex on the date... why? It appears to be properly padded data, therefore a substring is fast. This type of stuff has so many ways to skin a cat, so your way may be totally different from my way, but this is how I'd approach it long term. (if it's a one time thing, I may not create the managed the table, but if so, having partitions and ORC files will make things faster). If there are syntax errors I apologize, see earlier disclaimer about lack of proper bean sourced stimulants. On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote: Hi, I have a messy log file which I want to use to create a table, I am only interested to retrieve 3 columns (time,ADD,files),which are in bold. Sample entry from log file *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] * junkjunk|2013-11-23 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip Can you please let me know how I should go about, regex seems to be way out,but I am struggling with that as well !! Thanks, Baahu
Re: Table creation for logfile data
Hi John, Thanks for the reply,I have been given a new format of data and now the logs aren't as messy as they were earlier, but yes your mail gave me pointers which helped me is handling the new data. Now..I am stuck while handling a format of date,I am getting date in the form 22/11/13 which is dd/mm/yy, I have to rearrange this to /mm/dd, can you please shed some light on this. I think we need to use split() to get the tokens and then rearrange, but I am not able to think of an efficient way to do this. Thanks. On Sun, Nov 24, 2013 at 5:25 PM, John Omernik j...@omernik.com wrote: Put the logile into a location on HDFS, and create an external table pointing to that location. The External table should just have one column, a string, CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION '/etl/logfile' I think that should work. Then Create another table CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY (DAY STRING) STORED AS ORC; copy files into /etc/logfile run this hive file: SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions=1; SET hive.exec.max.dynamic.partitions.pernode=1000; INSERT INTO logfile select substring(message_line, 0, 17) as ts regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD, regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files, concat('20', substring(messageline, 0, 8)) as day from logfile_etl delete the the files /etl/logfile (or move them to an archival) That will get you a day partitioned (I added the 20 in front of your date so that string sorts well, although it probably would without it, it' early, and I have not had coffee yet) ORC file table (with compression and ORC good ness. The regexs are a little messy, by based on your one line of data, should work. Also: If you have data from pre 2000 obviously, the concat('20' thing needs to be updated. Note, I didn't use a regex on the date... why? It appears to be properly padded data, therefore a substring is fast. This type of stuff has so many ways to skin a cat, so your way may be totally different from my way, but this is how I'd approach it long term. (if it's a one time thing, I may not create the managed the table, but if so, having partitions and ORC files will make things faster). If there are syntax errors I apologize, see earlier disclaimer about lack of proper bean sourced stimulants. On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote: Hi, I have a messy log file which I want to use to create a table, I am only interested to retrieve 3 columns (time,ADD,files),which are in bold. Sample entry from log file *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] * junkjunk|2013-11-23 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip Can you please let me know how I should go about, regex seems to be way out,but I am struggling with that as well !! Thanks, Baahu -- Twitter:http://twitter.com/Baahu
Re: Table creation for logfile data
I wouldn't worry about efficiency to much: concat('20', split(date_field, '\\/')[2], '-', split(date_field, '\\/')[1], '-', split(date_field, '\\/')[0]) as proper_date -- -MM-DD On Sun, Nov 24, 2013 at 12:13 PM, Baahu bahub...@gmail.com wrote: Hi John, Thanks for the reply,I have been given a new format of data and now the logs aren't as messy as they were earlier, but yes your mail gave me pointers which helped me is handling the new data. Now..I am stuck while handling a format of date,I am getting date in the form 22/11/13 which is dd/mm/yy, I have to rearrange this to /mm/dd, can you please shed some light on this. I think we need to use split() to get the tokens and then rearrange, but I am not able to think of an efficient way to do this. Thanks. On Sun, Nov 24, 2013 at 5:25 PM, John Omernik j...@omernik.com wrote: Put the logile into a location on HDFS, and create an external table pointing to that location. The External table should just have one column, a string, CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION '/etl/logfile' I think that should work. Then Create another table CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY (DAY STRING) STORED AS ORC; copy files into /etc/logfile run this hive file: SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions=1; SET hive.exec.max.dynamic.partitions.pernode=1000; INSERT INTO logfile select substring(message_line, 0, 17) as ts regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD, regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files, concat('20', substring(messageline, 0, 8)) as day from logfile_etl delete the the files /etl/logfile (or move them to an archival) That will get you a day partitioned (I added the 20 in front of your date so that string sorts well, although it probably would without it, it' early, and I have not had coffee yet) ORC file table (with compression and ORC good ness. The regexs are a little messy, by based on your one line of data, should work. Also: If you have data from pre 2000 obviously, the concat('20' thing needs to be updated. Note, I didn't use a regex on the date... why? It appears to be properly padded data, therefore a substring is fast. This type of stuff has so many ways to skin a cat, so your way may be totally different from my way, but this is how I'd approach it long term. (if it's a one time thing, I may not create the managed the table, but if so, having partitions and ORC files will make things faster). If there are syntax errors I apologize, see earlier disclaimer about lack of proper bean sourced stimulants. On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote: Hi, I have a messy log file which I want to use to create a table, I am only interested to retrieve 3 columns (time,ADD,files),which are in bold. Sample entry from log file *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] * junkjunk|2013-11-23 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip Can you please let me know how I should go about, regex seems to be way out,but I am struggling with that as well !! Thanks, Baahu -- Twitter:http://twitter.com/Baahu
Re: Table creation for logfile data
or using the Date functions? Such as unix_timestamp to convert date string to unix timestamp and from_unixtime convert unix timestamp to string 2013/11/25 John Omernik j...@omernik.com I wouldn't worry about efficiency to much: concat('20', split(date_field, '\\/')[2], '-', split(date_field, '\\/')[1], '-', split(date_field, '\\/')[0]) as proper_date -- -MM-DD On Sun, Nov 24, 2013 at 12:13 PM, Baahu bahub...@gmail.com wrote: Hi John, Thanks for the reply,I have been given a new format of data and now the logs aren't as messy as they were earlier, but yes your mail gave me pointers which helped me is handling the new data. Now..I am stuck while handling a format of date,I am getting date in the form 22/11/13 which is dd/mm/yy, I have to rearrange this to /mm/dd, can you please shed some light on this. I think we need to use split() to get the tokens and then rearrange, but I am not able to think of an efficient way to do this. Thanks. On Sun, Nov 24, 2013 at 5:25 PM, John Omernik j...@omernik.com wrote: Put the logile into a location on HDFS, and create an external table pointing to that location. The External table should just have one column, a string, CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION '/etl/logfile' I think that should work. Then Create another table CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY (DAY STRING) STORED AS ORC; copy files into /etc/logfile run this hive file: SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions=1; SET hive.exec.max.dynamic.partitions.pernode=1000; INSERT INTO logfile select substring(message_line, 0, 17) as ts regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD, regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files, concat('20', substring(messageline, 0, 8)) as day from logfile_etl delete the the files /etl/logfile (or move them to an archival) That will get you a day partitioned (I added the 20 in front of your date so that string sorts well, although it probably would without it, it' early, and I have not had coffee yet) ORC file table (with compression and ORC good ness. The regexs are a little messy, by based on your one line of data, should work. Also: If you have data from pre 2000 obviously, the concat('20' thing needs to be updated. Note, I didn't use a regex on the date... why? It appears to be properly padded data, therefore a substring is fast. This type of stuff has so many ways to skin a cat, so your way may be totally different from my way, but this is how I'd approach it long term. (if it's a one time thing, I may not create the managed the table, but if so, having partitions and ORC files will make things faster). If there are syntax errors I apologize, see earlier disclaimer about lack of proper bean sourced stimulants. On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote: Hi, I have a messy log file which I want to use to create a table, I am only interested to retrieve 3 columns (time,ADD,files),which are in bold. Sample entry from log file *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] * junkjunk|2013-11-23 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip Can you please let me know how I should go about, regex seems to be way out,but I am struggling with that as well !! Thanks, Baahu -- Twitter:http://twitter.com/Baahu
Re: Table creation for logfile data
here is the document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions 2013/11/25 郭士伟 guoshi...@gmail.com or using the Date functions? Such as unix_timestamp to convert date string to unix timestamp and from_unixtime convert unix timestamp to string 2013/11/25 John Omernik j...@omernik.com I wouldn't worry about efficiency to much: concat('20', split(date_field, '\\/')[2], '-', split(date_field, '\\/')[1], '-', split(date_field, '\\/')[0]) as proper_date -- -MM-DD On Sun, Nov 24, 2013 at 12:13 PM, Baahu bahub...@gmail.com wrote: Hi John, Thanks for the reply,I have been given a new format of data and now the logs aren't as messy as they were earlier, but yes your mail gave me pointers which helped me is handling the new data. Now..I am stuck while handling a format of date,I am getting date in the form 22/11/13 which is dd/mm/yy, I have to rearrange this to /mm/dd, can you please shed some light on this. I think we need to use split() to get the tokens and then rearrange, but I am not able to think of an efficient way to do this. Thanks. On Sun, Nov 24, 2013 at 5:25 PM, John Omernik j...@omernik.com wrote: Put the logile into a location on HDFS, and create an external table pointing to that location. The External table should just have one column, a string, CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION '/etl/logfile' I think that should work. Then Create another table CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY (DAY STRING) STORED AS ORC; copy files into /etc/logfile run this hive file: SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions=1; SET hive.exec.max.dynamic.partitions.pernode=1000; INSERT INTO logfile select substring(message_line, 0, 17) as ts regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD, regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files, concat('20', substring(messageline, 0, 8)) as day from logfile_etl delete the the files /etl/logfile (or move them to an archival) That will get you a day partitioned (I added the 20 in front of your date so that string sorts well, although it probably would without it, it' early, and I have not had coffee yet) ORC file table (with compression and ORC good ness. The regexs are a little messy, by based on your one line of data, should work. Also: If you have data from pre 2000 obviously, the concat('20' thing needs to be updated. Note, I didn't use a regex on the date... why? It appears to be properly padded data, therefore a substring is fast. This type of stuff has so many ways to skin a cat, so your way may be totally different from my way, but this is how I'd approach it long term. (if it's a one time thing, I may not create the managed the table, but if so, having partitions and ORC files will make things faster). If there are syntax errors I apologize, see earlier disclaimer about lack of proper bean sourced stimulants. On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote: Hi, I have a messy log file which I want to use to create a table, I am only interested to retrieve 3 columns (time,ADD,files),which are in bold. Sample entry from log file *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] * junkjunk|2013-11-23 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip Can you please let me know how I should go about, regex seems to be way out,but I am struggling with that as well !! Thanks, Baahu -- Twitter:http://twitter.com/Baahu