Re: regexp_replace with unicode chars

Tom Hall Fri, 01 Mar 2013 09:25:00 -0800

Thanks Dean,

I dont think translate would work as the set of things to remove is massive.
Yeah, it's a one-off cleanup job while exporting to try redshift on our
datasets.
My guess is it's something about the way hive handles strings? Tried
"\\ufffd" as the replacement str but no joy either.


Cheers again,
Tom



On 1 March 2013 17:08, Dean Wampler <[email protected]>wrote:

> I think this should work, but you might investigate using the translate
> function instead. I suspect it will provide much better performance than
> using regexps. Also, Are you planning to do this once to create your final
> tables? If so, the performance overhead won't matter much.
>
> dean
>
>
> On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall <[email protected]> wrote:
>
>> I would like to remove unicode chars that are outside the Basic
>> Multilingual Plane [1]
>>
>> I thought
>> select regexp_replace(some_column,"[^\\u0000-\\uffff]","\ufffd") from
>> my_table
>> would work but while the regexp does work the replacement str does not (I
>> can paste in the literal �, which you may or may not be able to see here
>> but it somehow did not fell right)
>>
>> I saw Deans previous post on using octals [2] but I think \ufffd is
>> outside the allowable range.
>>
>> Cheers,
>> Tom
>>
>>
>> [1]
>> http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
>> [2]
>> http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>

Re: regexp_replace with unicode chars

Reply via email to