My pleasure. Please feel free to reach me if needed. Dudu
From: Arun Patel [mailto:arunp.bigd...@gmail.com] Sent: Wednesday, June 22, 2016 2:57 AM To: user@hive.apache.org Subject: Re: RegexSerDe with Filters Thank you very much, Dudu. This really helps. On Tue, Jun 21, 2016 at 7:48 PM, Markovitz, Dudu <dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote: Hi Here is the code (without the log data). I’ve created some of the views using different text processing technics. The rest of the views could be create in similar ways. Dudu bash ---------------------------------------------------------------------------------------------------- hdfs dfs -mkdir -p /tmp/log/20160621 hdfs dfs –put logfile.txt /tmp/log/20160621 hive ---------------------------------------------------------------------------------------------------- /* External table log Defines all common columns + optional column 'tid' which appears in most log records + the rest of the log ('txt') */ drop table if exists log; create external table log ( c1 string ,ts string ,ts_frac string ,log_rec_level string ,c4 string ,c5 string ,tid string ,txt string ) partitioned by (dt date) row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' with serdeproperties ('input.regex'='(\\S+)\\s+([0-9]{4}-[0-9]{2}-[0-9]{2}<file:///\\S+)\s+(%5b0-9%5d%7b4%7d-%5b0-9%5d%7b2%7d-%5b0-9%5d%7b2%7d> [0-9]{2}:[0-9]{2}:[0-9]{2}),([0-9]{3})\\s+(\\S+)\\s+\\[([^]]+)\\]\\s+(\\S+)\\s+:\\s+(TID:\\s\\d+)?\\s*(.*)')<file:///\\s+(\S+)\s+\%5b(%5b%5e%5d%5d+)\%5d\s+(\S+)\s+:\s+(TID:\s\d+)%3f\s*(.*)')> stored as textfile location '/tmp/log' ; alter table log add partition (dt=date '2016-06-21') location '/tmp/log/20160621'; select * from log; ---------------------------------------------------------------------------------------------------- /* View log_v Base view for all other views */ drop view if exists log_v; create view log_v as select c1 ,cast (concat_ws ('.',ts,ts_frac) as timestamp) as ts ,log_rec_level ,c4 ,c5 ,cast (ltrim(substr (tid,5)) as bigint) as tid ,txt from log ; select * from log_v; ---------------------------------------------------------------------------------------------------- drop view if exists log_v_reaping_path; create view log_v_reaping_path as select c1 ,ts ,log_rec_level ,c4 ,c5 ,substr (txt,15) as reaping_path from log_V where txt like 'Reaping path: %' ; select * from log_v_reaping_path; ---------------------------------------------------------------------------------------------------- drop view if exists log_v_published_to_kafka; create view log_v_published_to_kafka as select c1 ,ts ,log_rec_level ,c4 ,c5 ,tid , ltrim (kv [' Key'] ) as key ,cast (ltrim (kv [' size'] ) as bigint ) as size , ltrim (kv [' topic'] ) as topic ,cast (ltrim (kv [' partition']) as int ) as partition ,cast (ltrim (kv [' offset'] ) as bigint ) as offset from (select c1 ,ts ,log_rec_level ,c4 ,c5 ,tid ,str_to_map (substr (txt ,locate ('.',txt)+1),',',':') as kv from log_V where txt like 'Published to Kafka. %' ) as t ; select * from log_v_published_to_kafka; ---------------------------------------------------------------------------------------------------- drop view if exists log_v_get_request; create view log_v_get_request as select c1 ,ts ,log_rec_level ,c4 ,c5 ,tid ,substr (txt,31) as path from log_V where txt like 'GET request received for path %' ; select * from log_v_get_request; ---------------------------------------------------------------------------------------------------- drop view if exists log_v_unlock_request; create view log_v_unlock_request as select c1 ,ts ,log_rec_level ,c4 ,c5 ,tid ,regexp_extract (txt,'rowkey (\\S+)',1<file:///\\S+)',1>) as rowkey ,regexp_extract (txt,'lock id (\\S+)',1<file:///\\S+)',1>) as lock_id from log_V where txt like 'Unlock request for schema DU %' ; From: Markovitz, Dudu [mailto:dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>] Sent: Tuesday, June 21, 2016 2:26 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: RE: RegexSerDe with Filters Hi I would suggest creating a single external table with daily partitions and multiple views each with the appropriate filtering. If you’ll send me log sample (~100 rows) I’ll send you an example. Dudu From: Arun Patel [mailto:arunp.bigd...@gmail.com] Sent: Tuesday, June 21, 2016 1:51 AM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: RegexSerDe with Filters Hello Hive Experts, I use flume to ingest application specific logs from Syslog to HDFS. Currently, I grep the HDFS directory for specific patterns (for multiple types of requests) and then create reports. However, generating reports for Weekly and Monthly are not salable. I would like to create multiple external on the daily HDFS directory partitioned by date with RegexSerde and then create separate Parquet tables for every kind of request. Question is - How do I create multiple (about 20) RegexSerde tables on same data applying filters? This will be just like 20 grep commands I am running today. Example: hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'STORE Request Received for APPXXXX' | awk '{print $4, $13, $14, $17, $20}' hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'SCAN Request Received for APPYYYY' | awk '{print $4, $14, $19, $21, $22}' hadoop fs -cat /user/ffffprod/2016-06-20/* | grep 'TOTAL TIME' | awk '{print $4, $24}' I would like to create a tables which does this kind of job and then writes output to Parquet tables. Please let me know how this can be done. Thank you! Regards, Arun