I looked in hive.log while doing the CREATE TABLE and found: 2009-11-30 17:51:53,240 WARN serde2.RegexSerDe (RegexSerDe.java:deserialize(179)) - 1 unmatched rows are found: Nov 2 23:00:04 sf45-cc2 sendmail[10905]: [ID 801593 mail.info] nA3704qO010905: from=nagios, size=286, class=0, nrcpts=1, msgid=<[email protected]>, relay=nag...@localhost
So instead of grabbing a log file at random, I went back and used exactly the same one I did when I was testing the Serde. This time, the SELECT statement worked fine. Maybe I'll run my regex over a large sampling of our syslog files and see how many don't conform to it. I may have to change my regex to be more forgiving of deviations... _____________________________________________ From: Barclay, Ken Sent: Monday, November 30, 2009 5:07 PM To: [email protected] Subject: Table created using RegexSerDe doesn't work when made external Hello, I'm getting only NULLs when trying to read data from an external table for which the data was staged. I'm using Cloudera's hive-0.4.0+14.tar.gz with hadoop-0.20.1+152.tar.gz on a Centos machine. My steps were: # CREATE TABLE CREATE EXTERNAL TABLE mailog_stg(month STRING, day STRING, time STRING, host STRING, logline STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) (.+)$", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s" ) STORED AS TEXTFILE LOCATION '/user/data/staging/mailog'; # COPY DATA FILE hadoop dfs -put /ken/sf45-cc2-maillogs/mail.20091102 /user/data/staging/mailog # DO A SELECT hive> select * from mailog_stg; OK NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL I verified my data is out there: [r...@jims-desktop ~]# hadoop dfs -cat /user/data/staging/mailog/m*|wc 4 77 1016 And in fact when I used LOAD DATA INPATH with a table defined the same way but without making it external, it worked fine. Any suggestions how to debug this? Thanks Ken
