Re: Table creation for logfile data

2013-11-24 Thread John Omernik
Put the logile into a location on HDFS, and create an external table
pointing to that location. The External table should just have one column,
a string,
CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION '/etl/logfile'

I think that should work.

Then Create another table

CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY
(DAY STRING) STORED AS ORC;

copy files into /etc/logfile

run this hive file:

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

SET hive.exec.max.dynamic.partitions=1;

SET hive.exec.max.dynamic.partitions.pernode=1000;

INSERT INTO logfile
select substring(message_line, 0, 17) as ts
regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
concat('20', substring(messageline, 0, 8)) as day
from logfile_etl

delete the the files /etl/logfile (or move them to an archival)

That will get you a day partitioned (I added the 20 in front of your date
so that string sorts well, although it probably would without it, it'
early, and I have not had coffee yet) ORC file table (with compression and
ORC good ness. The regexs are a little messy, by based on your one line of
data, should work. Also: If you have data from pre 2000 obviously, the
concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
date... why? It appears to be properly padded data, therefore a substring
is fast. This type of stuff has so many ways to skin a cat, so your way may
be totally different from my way, but this is how I'd approach it long
term. (if it's a one time thing, I may not create the managed the table,
but if so, having partitions and ORC files will make things faster).  If
there are syntax errors I apologize, see earlier disclaimer about lack of
proper bean sourced stimulants.





On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote:

 Hi,
 I have a messy log file which I want to use to create a table, I am only
 interested to retrieve 3 columns (time,ADD,files),which are in bold.
 Sample entry from log file
 *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip

 Can you please let me know how I should go about, regex seems to be way
 out,but I am struggling with that as well !!

 Thanks,
 Baahu




Re: Table creation for logfile data

2013-11-24 Thread Baahu
Hi John,
Thanks for the reply,I have been given a new format of data and now the
logs aren't as messy as they were earlier, but yes your mail gave me
pointers which helped me is handling the new data.

Now..I am stuck while handling a format of date,I am getting date in the
form 22/11/13 which is dd/mm/yy, I have to rearrange this to /mm/dd,
can you please shed some light on this. I think we need to use split() to
get the tokens and then rearrange, but I am  not able to think of an
efficient way to do this.

Thanks.


On Sun, Nov 24, 2013 at 5:25 PM, John Omernik j...@omernik.com wrote:

 Put the logile into a location on HDFS, and create an external table
 pointing to that location. The External table should just have one column,
 a string,
 CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION '/etl/logfile'

 I think that should work.

 Then Create another table

 CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY
 (DAY STRING) STORED AS ORC;

 copy files into /etc/logfile

 run this hive file:

 SET hive.exec.dynamic.partition=true;

 SET hive.exec.dynamic.partition.mode=nonstrict;

 SET hive.exec.max.dynamic.partitions=1;

 SET hive.exec.max.dynamic.partitions.pernode=1000;

 INSERT INTO logfile
 select substring(message_line, 0, 17) as ts
 regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
 regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
 concat('20', substring(messageline, 0, 8)) as day
 from logfile_etl

 delete the the files /etl/logfile (or move them to an archival)

 That will get you a day partitioned (I added the 20 in front of your date
 so that string sorts well, although it probably would without it, it'
 early, and I have not had coffee yet) ORC file table (with compression and
 ORC good ness. The regexs are a little messy, by based on your one line of
 data, should work. Also: If you have data from pre 2000 obviously, the
 concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
 date... why? It appears to be properly padded data, therefore a substring
 is fast. This type of stuff has so many ways to skin a cat, so your way may
 be totally different from my way, but this is how I'd approach it long
 term. (if it's a one time thing, I may not create the managed the table,
 but if so, having partitions and ORC files will make things faster).  If
 there are syntax errors I apologize, see earlier disclaimer about lack of
 proper bean sourced stimulants.





 On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote:

 Hi,
 I have a messy log file which I want to use to create a table, I am only
 interested to retrieve 3 columns (time,ADD,files),which are in bold.
 Sample entry from log file
 *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip

 Can you please let me know how I should go about, regex seems to be way
 out,but I am struggling with that as well !!

 Thanks,
 Baahu





-- 
Twitter:http://twitter.com/Baahu


Re: Table creation for logfile data

2013-11-24 Thread John Omernik
I wouldn't worry about efficiency to much:

concat('20', split(date_field, '\\/')[2], '-', split(date_field, '\\/')[1],
'-', split(date_field, '\\/')[0]) as proper_date -- -MM-DD




On Sun, Nov 24, 2013 at 12:13 PM, Baahu bahub...@gmail.com wrote:

 Hi John,
 Thanks for the reply,I have been given a new format of data and now the
 logs aren't as messy as they were earlier, but yes your mail gave me
 pointers which helped me is handling the new data.

 Now..I am stuck while handling a format of date,I am getting date in the
 form 22/11/13 which is dd/mm/yy, I have to rearrange this to /mm/dd,
 can you please shed some light on this. I think we need to use split() to
 get the tokens and then rearrange, but I am  not able to think of an
 efficient way to do this.

 Thanks.


 On Sun, Nov 24, 2013 at 5:25 PM, John Omernik j...@omernik.com wrote:

 Put the logile into a location on HDFS, and create an external table
 pointing to that location. The External table should just have one column,
 a string,
 CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION
 '/etl/logfile'

 I think that should work.

 Then Create another table

 CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY
 (DAY STRING) STORED AS ORC;

 copy files into /etc/logfile

 run this hive file:

 SET hive.exec.dynamic.partition=true;

 SET hive.exec.dynamic.partition.mode=nonstrict;

 SET hive.exec.max.dynamic.partitions=1;

 SET hive.exec.max.dynamic.partitions.pernode=1000;

 INSERT INTO logfile
 select substring(message_line, 0, 17) as ts
 regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
 regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
 concat('20', substring(messageline, 0, 8)) as day
 from logfile_etl

 delete the the files /etl/logfile (or move them to an archival)

 That will get you a day partitioned (I added the 20 in front of your date
 so that string sorts well, although it probably would without it, it'
 early, and I have not had coffee yet) ORC file table (with compression and
 ORC good ness. The regexs are a little messy, by based on your one line of
 data, should work. Also: If you have data from pre 2000 obviously, the
 concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
 date... why? It appears to be properly padded data, therefore a substring
 is fast. This type of stuff has so many ways to skin a cat, so your way may
 be totally different from my way, but this is how I'd approach it long
 term. (if it's a one time thing, I may not create the managed the table,
 but if so, having partitions and ORC files will make things faster).  If
 there are syntax errors I apologize, see earlier disclaimer about lack of
 proper bean sourced stimulants.





 On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote:

 Hi,
 I have a messy log file which I want to use to create a table, I am only
 interested to retrieve 3 columns (time,ADD,files),which are in bold.
 Sample entry from log file
 *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip

 Can you please let me know how I should go about, regex seems to be way
 out,but I am struggling with that as well !!

 Thanks,
 Baahu





 --
 Twitter:http://twitter.com/Baahu




Re: Table creation for logfile data

2013-11-24 Thread 郭士伟
or using the Date functions? Such as unix_timestamp to convert date string
to unix timestamp and from_unixtime convert unix timestamp to string


2013/11/25 John Omernik j...@omernik.com

 I wouldn't worry about efficiency to much:

 concat('20', split(date_field, '\\/')[2], '-', split(date_field,
 '\\/')[1], '-', split(date_field, '\\/')[0]) as proper_date -- -MM-DD




 On Sun, Nov 24, 2013 at 12:13 PM, Baahu bahub...@gmail.com wrote:

 Hi John,
 Thanks for the reply,I have been given a new format of data and now the
 logs aren't as messy as they were earlier, but yes your mail gave me
 pointers which helped me is handling the new data.

 Now..I am stuck while handling a format of date,I am getting date in the
 form 22/11/13 which is dd/mm/yy, I have to rearrange this to /mm/dd,
 can you please shed some light on this. I think we need to use split() to
 get the tokens and then rearrange, but I am  not able to think of an
 efficient way to do this.

 Thanks.


 On Sun, Nov 24, 2013 at 5:25 PM, John Omernik j...@omernik.com wrote:

 Put the logile into a location on HDFS, and create an external table
 pointing to that location. The External table should just have one column,
 a string,
 CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION
 '/etl/logfile'

 I think that should work.

 Then Create another table

 CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED
 BY (DAY STRING) STORED AS ORC;

 copy files into /etc/logfile

 run this hive file:

 SET hive.exec.dynamic.partition=true;

 SET hive.exec.dynamic.partition.mode=nonstrict;

 SET hive.exec.max.dynamic.partitions=1;

 SET hive.exec.max.dynamic.partitions.pernode=1000;

 INSERT INTO logfile
 select substring(message_line, 0, 17) as ts
 regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
 regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
 concat('20', substring(messageline, 0, 8)) as day
 from logfile_etl

 delete the the files /etl/logfile (or move them to an archival)

 That will get you a day partitioned (I added the 20 in front of your
 date so that string sorts well, although it probably would without it, it'
 early, and I have not had coffee yet) ORC file table (with compression and
 ORC good ness. The regexs are a little messy, by based on your one line of
 data, should work. Also: If you have data from pre 2000 obviously, the
 concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
 date... why? It appears to be properly padded data, therefore a substring
 is fast. This type of stuff has so many ways to skin a cat, so your way may
 be totally different from my way, but this is how I'd approach it long
 term. (if it's a one time thing, I may not create the managed the table,
 but if so, having partitions and ORC files will make things faster).  If
 there are syntax errors I apologize, see earlier disclaimer about lack of
 proper bean sourced stimulants.





 On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote:

 Hi,
 I have a messy log file which I want to use to create a table, I am
 only interested to retrieve 3 columns (time,ADD,files),which are in bold.
 Sample entry from log file
 *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip

 Can you please let me know how I should go about, regex seems to be way
 out,but I am struggling with that as well !!

 Thanks,
 Baahu





 --
 Twitter:http://twitter.com/Baahu





Re: Table creation for logfile data

2013-11-24 Thread 郭士伟
here is the document
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions


2013/11/25 郭士伟 guoshi...@gmail.com

 or using the Date functions? Such as unix_timestamp to convert date
 string to unix timestamp and from_unixtime convert unix timestamp to
 string


 2013/11/25 John Omernik j...@omernik.com

 I wouldn't worry about efficiency to much:

 concat('20', split(date_field, '\\/')[2], '-', split(date_field,
 '\\/')[1], '-', split(date_field, '\\/')[0]) as proper_date -- -MM-DD




 On Sun, Nov 24, 2013 at 12:13 PM, Baahu bahub...@gmail.com wrote:

 Hi John,
 Thanks for the reply,I have been given a new format of data and now the
 logs aren't as messy as they were earlier, but yes your mail gave me
 pointers which helped me is handling the new data.

 Now..I am stuck while handling a format of date,I am getting date in the
 form 22/11/13 which is dd/mm/yy, I have to rearrange this to /mm/dd,
 can you please shed some light on this. I think we need to use split() to
 get the tokens and then rearrange, but I am  not able to think of an
 efficient way to do this.

 Thanks.


 On Sun, Nov 24, 2013 at 5:25 PM, John Omernik j...@omernik.com wrote:

 Put the logile into a location on HDFS, and create an external table
 pointing to that location. The External table should just have one column,
 a string,
 CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION
 '/etl/logfile'

 I think that should work.

 Then Create another table

 CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED
 BY (DAY STRING) STORED AS ORC;

 copy files into /etc/logfile

 run this hive file:

 SET hive.exec.dynamic.partition=true;

 SET hive.exec.dynamic.partition.mode=nonstrict;

 SET hive.exec.max.dynamic.partitions=1;

 SET hive.exec.max.dynamic.partitions.pernode=1000;

 INSERT INTO logfile
 select substring(message_line, 0, 17) as ts
 regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
 regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
 concat('20', substring(messageline, 0, 8)) as day
 from logfile_etl

 delete the the files /etl/logfile (or move them to an archival)

 That will get you a day partitioned (I added the 20 in front of your
 date so that string sorts well, although it probably would without it, it'
 early, and I have not had coffee yet) ORC file table (with compression and
 ORC good ness. The regexs are a little messy, by based on your one line of
 data, should work. Also: If you have data from pre 2000 obviously, the
 concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
 date... why? It appears to be properly padded data, therefore a substring
 is fast. This type of stuff has so many ways to skin a cat, so your way may
 be totally different from my way, but this is how I'd approach it long
 term. (if it's a one time thing, I may not create the managed the table,
 but if so, having partitions and ORC files will make things faster).  If
 there are syntax errors I apologize, see earlier disclaimer about lack of
 proper bean sourced stimulants.





 On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote:

 Hi,
 I have a messy log file which I want to use to create a table, I am
 only interested to retrieve 3 columns (time,ADD,files),which are in bold.
 Sample entry from log file
 *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip

 Can you please let me know how I should go about, regex seems to be
 way out,but I am struggling with that as well !!

 Thanks,
 Baahu





 --
 Twitter:http://twitter.com/Baahu