Hi,
I have a question about the behavior of the class
org.apache.hadoop.hive.contrib.serde2.RegexSerDe. Here is the example I tested
using the Cloudra hive-0.7.1-cdh3u3 release. The above class did NOT do what I
expect, any one knows the reason?
user:~/tmp> more Test.javaimport java.io.*;import java.text.*;
class Test { public static void main (String[] argv) throws Exception {
String line = "aaa,\"bbb\",\"cc,c\""; String[] tokens =
line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"); int i = 1;
for(String t : tokens) { System.out.println(i + "> "+t);
i++; } }}
:~/tmp> java Test1> aaa2> "bbb"3> "cc,c"
As you can see, the Java regular expression ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"
did what I want it to do, it parse the string aaa,"bbb","cc,c" to 3 tokens:
(aaa), ("bbb"), and ("cc,c"). So the regular expression works fine.
Now in the hive:
:~> more test.txtaaa,"bbb","cc,c":~> hiveHive history
file=/tmp/user/hive_job_log_user_201204031242_591028210.txthive> create table
test( > c1 string, > c2 string, > c3 string > ) > row format
> SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > WITH
SERDEPROPERTIES ( > "input.regex" = ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)" > )
> STORED AS TEXTFILE;OKTime taken: 0.401 secondshive> load data local inpath
'test.txt' overwrite into table test;Copying data from
file:/home/user/test.txtCopying file: file:/home/user/test.txtLoading data to
table dev.testDeleted hdfs://host/user/hive/warehouse/dev.db/testOKTime taken:
0.282 secondshive> select * from test;
OKNULL NULL NULL
When I query this table, I don't get what I expected. I expect the output
should be the 3 strings like this -----> aaa "bbb" "cc,c"
Why the output gives me 3 NULLs?
Thanks for your help.