Hi Ankit, I know your problem because I had to deal with a thorn 'þ' separated file too. Hive ,so far, cannot handle multibyte separators so I turned to the custom SerDe option myself. If you manage to capture the 'þ' in the regex you could try
I tried: ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)") 'þ' is recognized by 'þ' in my case, but this regex was too greedy. In the end I had to regex all the fields in between the separators and that was so complicated that I wrote a MR job to replace the 'þ' by the '~' which hive accepts as a field separator (ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'. I turned to another solution, and happy I did. Keep us posted if you find another way. Jasper 2011/5/8 ankit bhatnagar <abhatna...@gmail.com> > Hi > > I am facing a weird issue with the file parsing. My log files have a thorn > 'þ' as separator. > I tried writing a test case for deserializer and kind of confused by the > fact it works fine as I pass the line to the deserializer, however when i > run it on hive the line is not split into columns and table inside hive has > thorn as it is. > > Any help would be appreciated. > > Thanks > Ankit > -- Kind Regards \ Met Vriendelijke Groet, Jasper Knulst BI Consultant VLC Den Haag Gildeweg 5B 2632 BD Nootdorp M: +31 (0)6 19 66 75 11 T: +31 (0)15 764 07 50 ------------------------------------------------------------ Skype: jasper_knulst_vlc
<<image001.gif>>