Re: testing a string for unicode characters

Paul Hastings Thu, 08 Dec 2005 21:12:27 -0800

S. Isaac Dealey wrote:
> I just noticed today (or rather, John Ashenfelter pointed out to me :)
> that some regular expression engines include \uXXXX to match unicode
> characters.


that's what unicode calls it's level 1 support. bare butt minimum. i'm 
weak w/regex so take all of the following w/a grain of salt. i think 
regex in cf is based on perl which is described w/"some" unicode support 
via \p, \P, \X, and \u none of which cf seems to support.

> So I did a couple tests on some data we had in a legacy database (it
> doesn't have any nvarchar columns... yet) and it turns out that using
> the \uXXXX pattern in a regular expression will mangle the hell out of
> an ASCII string... I'm also assuming (and I may be way off base) that

why can't you use chr()?

> using #chr(1-32)# (for a nonprinting ascii character) won't always
> match the same character in a unicode string. (I woudl think it
> depends on whether or not the string uses single or double-byte for
> the individual character, since I recall reading that unicode doesn't
> always use double-byte representation, but I suspect the regex engine
> does if you use \uXXXX.)

you're thinking about unicode transforms where the number of bytes can 
vary per char or maybe in some now rare cases of creating graphemes via 
combining chars: "a"+combining char+"'" = á (latin small a w/acute). if 
you're dealing w/this internally i wouldn't think that would be an issue.

> 1) Does the regex engine in CF 6-7 support \uXXXX?

apparently not.
bigA="A";
p=refind("\u0041",bigA);
p returns 0.

i think if you need to use that kind of notation then maybe core java's 
java.util.regex icu4j also has some nifty search classes that might be 
useful.

> 2) am I wrong about #chr(1-32)#? Will it always match the same
> character in a UTF-8 string?

unless you're using EBCDIC?? the original ASCII stuff went straight into 
  unicode as is to keep the adoption uproar to a minimum.

> 3) if CF 6+ supports \uXXXX and #chr(1-32)# won't always match the
> unicode equivalent, is there a way to test a string in CF to determine
> if it's unicode (digging into Java maybe)?

icu4j has charset detector, core java's java.util.regex might help too. 
but forensically determining a charset isn't pretty nor always correct.

> exception of tab (9), newline (10,13) and space (32) characters. This
> database has some data containing vertical tabs (11) which I'm

this won't work?
bigA="AA#chr(11)#AA#chr(11)#AA";
replacedBigA=replace(bigA,chr(11),"zzzz","ALL");

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Discover CFTicket - The leading ColdFusion Help Desk and Trouble 
Ticket application

http://www.houseoffusion.com/banners/view.cfm?bannerid=48

Message: http://www.houseoffusion.com/lists.cfm/link=i:4:226649
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54

Re: testing a string for unicode characters

Reply via email to