Hi Ankit,

I know your problem because I had to deal with a thorn 'þ' separated file
too. Hive ,so far, cannot handle multibyte separators so I turned to the
custom SerDe option myself. If you manage to capture the 'þ' in the regex
you could try


I tried:

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" =
"(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)")

'þ' is recognized by 'þ' in my case, but this regex was too greedy. In the
end I had to regex all the fields in between the separators and that was so
complicated that I wrote a MR job to replace the 'þ' by the '~' which hive
accepts as a field separator  (ROW FORMAT DELIMITED FIELDS TERMINATED BY
'~'.

I turned to another solution, and happy I did. Keep us posted if you find
another way.

Jasper

2011/5/8 ankit bhatnagar <abhatna...@gmail.com>

> Hi
>
> I am facing a weird issue with the file parsing. My log files have a thorn
> 'þ' as separator.
> I tried writing a test case for deserializer  and kind of confused by the
> fact it works fine as I pass the line to the deserializer, however when i
> run it on hive the line is not split into columns and table inside hive has
> thorn as it is.
>
> Any help would be appreciated.
>
> Thanks
> Ankit
>



-- 
Kind Regards \ Met Vriendelijke Groet,





Jasper Knulst

BI Consultant





VLC Den Haag
Gildeweg 5B
2632 BD  Nootdorp


M: +31 (0)6 19 66 75 11

T: +31 (0)15 764 07 50
------------------------------------------------------------

Skype: jasper_knulst_vlc

<<image001.gif>>

Reply via email to