Re: regexp_replace with unicode chars

2013-03-01 Thread Mark Grover
The translate UDF does take care of non-ascii characters. It uses
codepoints instead of characters.

Here is the unit test to demonstrate that:
https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/udf_translate.q#L36

But you guys are right. It doesn't solve Tom's original problem
because it doesn't take ranges.

I created HIVE-4100 for improving regex_replace UDF.


Mark

On Fri, Mar 1, 2013 at 9:31 AM, Dean Wampler
 wrote:
> Anyone know if translate takes ranges, like some implementations? e.g.,
>
> translate ('[a-z]', '[A-Z]')
>
> Of course, that probably doesn't work for non-ascii characters.
>
>
> On Fri, Mar 1, 2013 at 11:24 AM, Tom Hall  wrote:
>>
>> Thanks Dean,
>>
>> I dont think translate would work as the set of things to remove is
>> massive.
>> Yeah, it's a one-off cleanup job while exporting to try redshift on our
>> datasets.
>> My guess is it's something about the way hive handles strings? Tried
>> "\\ufffd" as the replacement str but no joy either.
>>
>> Cheers again,
>> Tom
>>
>>
>>
>> On 1 March 2013 17:08, Dean Wampler 
>> wrote:
>>>
>>> I think this should work, but you might investigate using the translate
>>> function instead. I suspect it will provide much better performance than
>>> using regexps. Also, Are you planning to do this once to create your final
>>> tables? If so, the performance overhead won't matter much.
>>>
>>> dean
>>>
>>>
>>> On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall 
>>> wrote:

 I would like to remove unicode chars that are outside the Basic
 Multilingual Plane [1]

 I thought
 select regexp_replace(some_column,"[^\\u-\\u]","\ufffd") from
 my_table
 would work but while the regexp does work the replacement str does not
 (I can paste in the literal �, which you may or may not be able to see here
 but it somehow did not fell right)

 I saw Deans previous post on using octals [2] but I think \ufffd is
 outside the allowable range.

 Cheers,
 Tom


 [1]
 http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
 [2]
 http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter
>>>
>>>
>>>
>>>
>>> --
>>> Dean Wampler, Ph.D.
>>> thinkbiganalytics.com
>>> +1-312-339-1330
>>>
>>
>
>
>
> --
> Dean Wampler, Ph.D.
> thinkbiganalytics.com
> +1-312-339-1330
>


Re: regexp_replace with unicode chars

2013-03-01 Thread Dean Wampler
Anyone know if translate takes ranges, like some implementations? e.g.,

translate ('[a-z]', '[A-Z]')

Of course, that probably doesn't work for non-ascii characters.

On Fri, Mar 1, 2013 at 11:24 AM, Tom Hall  wrote:

> Thanks Dean,
>
> I dont think translate would work as the set of things to remove is
> massive.
> Yeah, it's a one-off cleanup job while exporting to try redshift on our
> datasets.
> My guess is it's something about the way hive handles strings? Tried
> "\\ufffd" as the replacement str but no joy either.
>
> Cheers again,
> Tom
>
>
>
> On 1 March 2013 17:08, Dean Wampler wrote:
>
>> I think this should work, but you might investigate using the translate
>> function instead. I suspect it will provide much better performance than
>> using regexps. Also, Are you planning to do this once to create your final
>> tables? If so, the performance overhead won't matter much.
>>
>> dean
>>
>>
>> On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall wrote:
>>
>>> I would like to remove unicode chars that are outside the Basic
>>> Multilingual Plane [1]
>>>
>>> I thought
>>> select regexp_replace(some_column,"[^\\u-\\u]","\ufffd") from
>>> my_table
>>> would work but while the regexp does work the replacement str does not
>>> (I can paste in the literal �, which you may or may not be able to see here
>>> but it somehow did not fell right)
>>>
>>> I saw Deans previous post on using octals [2] but I think \ufffd is
>>> outside the allowable range.
>>>
>>> Cheers,
>>> Tom
>>>
>>>
>>> [1]
>>> http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
>>> [2]
>>> http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330


Re: regexp_replace with unicode chars

2013-03-01 Thread Tom Hall
Thanks Dean,

I dont think translate would work as the set of things to remove is massive.
Yeah, it's a one-off cleanup job while exporting to try redshift on our
datasets.
My guess is it's something about the way hive handles strings? Tried
"\\ufffd" as the replacement str but no joy either.

Cheers again,
Tom



On 1 March 2013 17:08, Dean Wampler wrote:

> I think this should work, but you might investigate using the translate
> function instead. I suspect it will provide much better performance than
> using regexps. Also, Are you planning to do this once to create your final
> tables? If so, the performance overhead won't matter much.
>
> dean
>
>
> On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall  wrote:
>
>> I would like to remove unicode chars that are outside the Basic
>> Multilingual Plane [1]
>>
>> I thought
>> select regexp_replace(some_column,"[^\\u-\\u]","\ufffd") from
>> my_table
>> would work but while the regexp does work the replacement str does not (I
>> can paste in the literal �, which you may or may not be able to see here
>> but it somehow did not fell right)
>>
>> I saw Deans previous post on using octals [2] but I think \ufffd is
>> outside the allowable range.
>>
>> Cheers,
>> Tom
>>
>>
>> [1]
>> http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
>> [2]
>> http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>


Re: regexp_replace with unicode chars

2013-03-01 Thread Dean Wampler
I think this should work, but you might investigate using the translate
function instead. I suspect it will provide much better performance than
using regexps. Also, Are you planning to do this once to create your final
tables? If so, the performance overhead won't matter much.

dean

On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall  wrote:

> I would like to remove unicode chars that are outside the Basic
> Multilingual Plane [1]
>
> I thought
> select regexp_replace(some_column,"[^\\u-\\u]","\ufffd") from
> my_table
> would work but while the regexp does work the replacement str does not (I
> can paste in the literal �, which you may or may not be able to see here
> but it somehow did not fell right)
>
> I saw Deans previous post on using octals [2] but I think \ufffd is
> outside the allowable range.
>
> Cheers,
> Tom
>
>
> [1]
> http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
> [2]
> http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330


regexp_replace with unicode chars

2013-03-01 Thread Tom Hall
I would like to remove unicode chars that are outside the Basic
Multilingual Plane [1]

I thought
select regexp_replace(some_column,"[^\\u-\\u]","\ufffd") from
my_table
would work but while the regexp does work the replacement str does not (I
can paste in the literal �, which you may or may not be able to see here
but it somehow did not fell right)

I saw Deans previous post on using octals [2] but I think \ufffd is outside
the allowable range.

Cheers,
Tom


[1]
http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
[2] http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter