Dudu, Thanks for your continued support. I need one more quick help. I have one more log file as shown below.
STD-SERV 2016-06-29 12:10:39.142 c.f.c.s.F.ProcessingHandler [INFO] TID:101114719017567668 cluster1 ProcessingHandler Message timestamp: 1467216639090 STD-SERV 2016-06-29 12:10:39.143 c.f.c.s.F.ProcessingHandler [INFO] TID: 101114719017567668 cluster1: Processed request STD-SERV 2016-06-29 12:10:39.163 c.f.c.s.F.WritingHandler [INFO] TID: 101114719017567668 Request received in writer STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: 101114719017567668 Processed request STD-SERV 2016-06-29 12:10:39.273 c.f.c.s.F.WritingHandler [INFO] TID: 101114719017567668 Total time: 10 ms I need to create 3 views for 3 requirements. 1) create a view to get timestamp, TID number and cluster1 for lines "ProcessingHandler Message timestamp". But, for this line there is no space between TID: and TID number. 2) create a view to get timestamp, TID for the lines "Request received in writer". There is a space between TID: and TID number. 3) Create a view to get timestamp, TID for the lines "Total time:". There is a space between TID: and TID number. How do I create base table and views? I am planning to join these 3 views based on TID. Do I need to take any special considerations? Regards, Venkat On Fri, Jun 24, 2016 at 5:17 PM, Arun Patel <arunp.bigd...@gmail.com> wrote: > Dudu, Thanks for the clarification. Looks like I have an issue with my > Hive installation. I tried in a different cluster and it works. > > Thanks again. > > > On Fri, Jun 24, 2016 at 4:59 PM, Markovitz, Dudu <dmarkov...@paypal.com> > wrote: > >> This is a tested, working code. >> >> If you’re using https://regex101.com ,first replace backslash pairs (\\ >> ) with a single backslash (\) and also use the ‘g’ modifier in order to >> find all of the matches. >> >> >> >> The regular expression is - >> >> (\S+)\s+([0-9]{4}-[0-9]{2}-[0-9]{2} >> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\s+(\S+)\s+\[([^]]+)\]\s+(\S+)\s+:\s+(TID:\s\d+)?\s*(.*) >> >> >> >> I’ll send you a screen shot in private, since you don’t want to expose >> the data. >> >> >> >> Dudu >> >> >> >> >> >> *From:* Arun Patel [mailto:arunp.bigd...@gmail.com] >> *Sent:* Friday, June 24, 2016 9:33 PM >> >> *To:* user@hive.apache.org >> *Subject:* Re: RegexSerDe with Filters >> >> >> >> Looks like Regex pattern is not working. I tested the pattern on >> https://regex101.com/ and it does not find any match. >> >> >> >> Any suggestions? >> >> >> >> On Thu, Jun 23, 2016 at 3:01 PM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> My pleasure. >> >> Please feel free to reach me if needed. >> >> >> >> Dudu >> >> >> >> *From:* Arun Patel [mailto:arunp.bigd...@gmail.com] >> *Sent:* Wednesday, June 22, 2016 2:57 AM >> *To:* user@hive.apache.org >> *Subject:* Re: RegexSerDe with Filters >> >> >> >> Thank you very much, Dudu. This really helps. >> >> >> >> On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> Hi >> >> >> >> Here is the code (without the log data). >> >> >> >> I’ve created some of the views using different text processing technics. >> >> The rest of the views could be create in similar ways. >> >> >> >> >> >> Dudu >> >> >> >> >> >> >> >> bash >> >> >> ---------------------------------------------------------------------------------------------------- >> >> >> >> hdfs dfs -mkdir -p /tmp/log/20160621 >> >> hdfs dfs –put logfile.txt /tmp/log/20160621 >> >> >> >> >> >> hive >> >> >> ---------------------------------------------------------------------------------------------------- >> >> >> >> /* >> >> External table log >> >> >> >> Defines all common columns + optional column 'tid' which appears in >> most log records + the rest of the log ('txt') >> >> >> >> */ >> >> >> >> drop table if exists log; >> >> >> >> create external table log >> >> ( >> >> c1 string >> >> ,ts string >> >> ,ts_frac string >> >> ,log_rec_level string >> >> ,c4 string >> >> ,c5 string >> >> ,tid string >> >> ,txt string >> >> ) >> >> partitioned by (dt date) >> >> row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' >> >> with serdeproperties ('input.regex'='( >> \\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2} >> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3}) >> \\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)') >> >> stored as textfile >> >> location '/tmp/log' >> >> ; >> >> >> >> alter table log add partition (dt=date '2016-06-21') location >> '/tmp/log/20160621'; >> >> >> >> select * from log; >> >> >> ---------------------------------------------------------------------------------------------------- >> >> >> >> /* >> >> View log_v >> >> >> >> Base view for all other views >> >> >> >> */ >> >> >> >> drop view if exists log_v; >> >> >> >> create view log_v >> >> as >> >> select c1 >> >> ,cast (concat_ws ('.',ts,ts_frac) as timestamp) as ts >> >> ,log_rec_level >> >> ,c4 >> >> ,c5 >> >> ,cast (ltrim(substr (tid,5)) as bigint) as tid >> >> ,txt >> >> >> >> from log >> >> ; >> >> >> >> select * from log_v; >> >> >> >> >> ---------------------------------------------------------------------------------------------------- >> >> >> >> drop view if exists log_v_reaping_path; >> >> >> >> create view log_v_reaping_path >> >> as >> >> select c1 >> >> ,ts >> >> ,log_rec_level >> >> ,c4 >> >> ,c5 >> >> ,substr (txt,15) as reaping_path >> >> >> >> from log_V >> >> >> >> where txt like 'Reaping path: %' >> >> ; >> >> >> >> select * from log_v_reaping_path; >> >> >> >> >> ---------------------------------------------------------------------------------------------------- >> >> >> >> drop view if exists log_v_published_to_kafka; >> >> >> >> create view log_v_published_to_kafka >> >> as >> >> select c1 >> >> ,ts >> >> ,log_rec_level >> >> ,c4 >> >> ,c5 >> >> ,tid >> >> >> >> , ltrim (kv [' Key'] ) as key >> >> ,cast (ltrim (kv [' size'] ) as bigint ) as size >> >> , ltrim (kv [' topic'] ) as topic >> >> ,cast (ltrim (kv [' partition']) as int ) as partition >> >> ,cast (ltrim (kv [' offset'] ) as bigint ) as offset >> >> >> >> from (select c1 >> >> ,ts >> >> ,log_rec_level >> >> ,c4 >> >> ,c5 >> >> ,tid >> >> ,str_to_map (substr (txt ,locate >> ('.',txt)+1),',',':') as kv >> >> >> >> from log_V >> >> >> >> where txt like 'Published to Kafka. %' >> >> ) >> >> as t >> >> ; >> >> >> >> select * from log_v_published_to_kafka; >> >> >> >> >> ---------------------------------------------------------------------------------------------------- >> >> >> >> drop view if exists log_v_get_request; >> >> >> >> create view log_v_get_request >> >> as >> >> select c1 >> >> ,ts >> >> ,log_rec_level >> >> ,c4 >> >> ,c5 >> >> ,tid >> >> ,substr (txt,31) as path >> >> >> >> from log_V >> >> >> >> where txt like 'GET request received for path %' >> >> ; >> >> >> >> select * from log_v_get_request; >> >> >> >> >> ---------------------------------------------------------------------------------------------------- >> >> >> >> drop view if exists log_v_unlock_request; >> >> >> >> create view log_v_unlock_request >> >> as >> >> select c1 >> >> ,ts >> >> ,log_rec_level >> >> ,c4 >> >> ,c5 >> >> ,tid >> >> ,regexp_extract (txt,'rowkey (\\S+)',1) as rowkey >> >> ,regexp_extract (txt,'lock id (\\S+)',1) as lock_id >> >> >> >> from log_V >> >> >> >> where txt like 'Unlock request for schema DU %' >> >> ; >> >> >> >> >> >> *From:* Markovitz, Dudu [mailto:dmarkov...@paypal.com] >> *Sent:* Tuesday, June 21, 2016 2:26 PM >> *To:* user@hive.apache.org >> *Subject:* RE: RegexSerDe with Filters >> >> >> >> Hi >> >> >> >> I would suggest creating a single external table with daily partitions >> and multiple views each with the appropriate filtering. >> >> If you’ll send me log sample (~100 rows) I’ll send you an example. >> >> >> >> Dudu >> >> >> >> *From:* Arun Patel [mailto:arunp.bigd...@gmail.com >> <arunp.bigd...@gmail.com>] >> *Sent:* Tuesday, June 21, 2016 1:51 AM >> *To:* user@hive.apache.org >> *Subject:* RegexSerDe with Filters >> >> >> >> Hello Hive Experts, >> >> >> >> I use flume to ingest application specific logs from Syslog to HDFS. >> Currently, I grep the HDFS directory for specific patterns (for multiple >> types of requests) and then create reports. However, generating reports >> for Weekly and Monthly are not salable. >> >> >> >> I would like to create multiple external on the daily HDFS directory >> partitioned by date with RegexSerde and then create separate Parquet tables >> for every kind of request. >> >> >> >> Question is - How do I create multiple (about 20) RegexSerde tables on >> same data applying filters? This will be just like 20 grep commands I am >> running today. >> >> >> >> Example: hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE >> Request Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}' >> >> hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN >> Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}' >> >> hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL >> TIME' | awk '{print $4, $24}' >> >> >> >> I would like to create a tables which does this kind of job and then >> writes output to Parquet tables. >> >> >> >> Please let me know how this can be done. Thank you! >> >> >> >> Regards, >> >> Arun >> >> >> >> >> > >