S. Isaac Dealey wrote:
I just noticed today (or rather, John Ashenfelter pointed out to me :)
that some regular expression engines include \u to match unicode
characters.
that's what unicode calls it's level 1 support. bare butt minimum. i'm
weak w/regex so take all of the following w/a grain of salt. i think
regex in cf is based on perl which is described w/some unicode support
via \p, \P, \X, and \u none of which cf seems to support.
So I did a couple tests on some data we had in a legacy database (it
doesn't have any nvarchar columns... yet) and it turns out that using
the \u pattern in a regular expression will mangle the hell out of
an ASCII string... I'm also assuming (and I may be way off base) that
why can't you use chr()?
using #chr(1-32)# (for a nonprinting ascii character) won't always
match the same character in a unicode string. (I woudl think it
depends on whether or not the string uses single or double-byte for
the individual character, since I recall reading that unicode doesn't
always use double-byte representation, but I suspect the regex engine
does if you use \u.)
you're thinking about unicode transforms where the number of bytes can
vary per char or maybe in some now rare cases of creating graphemes via
combining chars: a+combining char+' = รก (latin small a w/acute). if
you're dealing w/this internally i wouldn't think that would be an issue.
1) Does the regex engine in CF 6-7 support \u?
apparently not.
bigA=A;
p=refind(\u0041,bigA);
p returns 0.
i think if you need to use that kind of notation then maybe core java's
java.util.regex icu4j also has some nifty search classes that might be
useful.
2) am I wrong about #chr(1-32)#? Will it always match the same
character in a UTF-8 string?
unless you're using EBCDIC?? the original ASCII stuff went straight into
unicode as is to keep the adoption uproar to a minimum.
3) if CF 6+ supports \u and #chr(1-32)# won't always match the
unicode equivalent, is there a way to test a string in CF to determine
if it's unicode (digging into Java maybe)?
icu4j has charset detector, core java's java.util.regex might help too.
but forensically determining a charset isn't pretty nor always correct.
exception of tab (9), newline (10,13) and space (32) characters. This
database has some data containing vertical tabs (11) which I'm
this won't work?
bigA=AA#chr(11)#AA#chr(11)#AA;
replacedBigA=replace(bigA,chr(11),,ALL);
~|
Discover CFTicket - The leading ColdFusion Help Desk and Trouble
Ticket application
http://www.houseoffusion.com/banners/view.cfm?bannerid=48
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:226649
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe:
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Donations Support: http://www.houseoffusion.com/tiny.cfm/54