Hi all,
I want to know that, whether Hive supports using a custom MapReduce script
to load data into a table with static/hybrid/dynamic partitions?
I'm working on a log analysis program, for instance, the Tomcat log.
Firstly, I import the raw/original log file into Hive, then I need to do
some analysis(eg. parse URIs to detect vulnerbility), and store the results
in another Hive table(probably adding several columns). I've achieved what
I've described above using a custom MapReduce script written in Perl.
But what disappoints me is that the query quite slow. So I decide to speed
up the query using Hive (dynamic) partitions. Unfortunately, I didn't
succed.
Then I began to wonder whether Hive supports using a custom MapReduce
script to load data into a table with partitions.
Here is the related HiveQL script:
------------------------------------------------------------------------------------------
------------------------------ original table, tomcat0
CREATE EXTERNAL TABLE if not exists tomcat0
(full_time STRING, client_ip STRING, hyphen STRING, server_ip STRING,
method STRING, uri STRING, protocol STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
WITH SERDEPROPERTIES (
'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
'quote.delim'='("|\\[|\\])',
'field.delim'=' ',
'serialization.null.format'='-')
STORED AS TEXTFILE
LOCATION '/var/hive_data/logdata_external';
load data local inpath '/home/hduser/tomcat_log' overwrite INTO TABLE
tomcat0;
------------------------------ target table, tomcat1
CREATE TABLE if not exists tomcat1
(time TIMESTAMP, time_zone STRING, client_ip STRING, hyphen STRING,
server_ip STRING, method STRING, uri STRING, protocol STRING,
uri_attack_type INT)
PARTITIONED BY (partition_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
add file /home/hduser/Hive.pl/Tomcat.pl;
INSERT overwrite TABLE tomcat1 partition(partition_date)
SELECT
TRANSFORM (full_time, client_ip, hyphen, server_ip, method, uri, protocol)
USING '/usr/bin/perl Tomcat.pl'
AS (time, time_zone, client_ip, hyphen, server_ip, method, uri, protocol,
uri_attack_type, partition_date)
FROM tomcat0
distribute by partition_date limit 2;
------------------------------------------------------------------------------------------
I will give a short explanation:
I'm trying use dynamic partition by date(not Hive timestamp), and the
partition column is generated by the custom MapReduce script. I don't know
whether it's acceptable in Hive.
I also tried to add static partition in advance:
alter table tomcat1 add partition (partition_time='2013-08-31');
but still no luck. :-(
The typical input/output of the custom MapReduce script, Tomcat.pl:
INPUT (7 columns):
31/Aug/2012:00:14:20+0800 202.104.70.250 - 192.168.202.245 GET
/wxcs/mains.html?1=1' HTTP/1.1
OUTPUT (10 columns):
2012-08-31 00:14:20 +0800 202.104.70.250 - 192.168.202.245 GET
/wxcs/mains.html?1=1' HTTP/1.1 1 2012-08-31
Error messages are typically like this:
------------------------------------------------------------------------------------------
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20001]:
An error occurred while reading or writing to your custom script. It may
have crashed with an error.
at
org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:410)
at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
at
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
at
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
at
org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:501)
at
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
at
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
... 9 more
------------------------------------------------------------------------------------------
I don't think the problem directly exists in my Perl script, though Hive
complains on this.
Sorry for such a long e-mail. :-(
Any input is appreciated. :-)