testing a string for unicode characters

2005-12-08 Thread S . Isaac Dealey
Strange question, I'm hoping somebody knows...

I just noticed today (or rather, John Ashenfelter pointed out to me :)
that some regular expression engines include \u to match unicode
characters.

So I did a couple tests on some data we had in a legacy database (it
doesn't have any nvarchar columns... yet) and it turns out that using
the \u pattern in a regular expression will mangle the hell out of
an ASCII string... I'm also assuming (and I may be way off base) that
using #chr(1-32)# (for a nonprinting ascii character) won't always
match the same character in a unicode string. (I woudl think it
depends on whether or not the string uses single or double-byte for
the individual character, since I recall reading that unicode doesn't
always use double-byte representation, but I suspect the regex engine
does if you use \u.)

So this brings up a couple of questions:

1) Does the regex engine in CF 6-7 support \u?

2) am I wrong about #chr(1-32)#? Will it always match the same
character in a UTF-8 string?

3) if CF 6+ supports \u and #chr(1-32)# won't always match the
unicode equivalent, is there a way to test a string in CF to determine
if it's unicode (digging into Java maybe)?


The reason I need this info. is because I also just realized that all
the non-printing characters aren't valid in an XML document with the
exception of tab (9), newline (10,13) and space (32) characters. This
database has some data containing vertical tabs (11) which I'm
guessing were pasted from MS Word, and as a result is liable to be a
recurring problem, so I need to find a way to strip these characters
reliably from a string without mangling the string.


s. isaac dealey 434.293.6201
new epoch : isn't it time for a change?

add features without fixtures with
the onTap open source framework

http://www.fusiontap.com
http://coldfusion.sys-con.com/author/4806Dealey.htm


~|
Discover CFTicket - The leading ColdFusion Help Desk and Trouble 
Ticket application

http://www.houseoffusion.com/banners/view.cfm?bannerid=48

Message: http://www.houseoffusion.com/lists.cfm/link=i:4:226593
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Donations  Support: http://www.houseoffusion.com/tiny.cfm/54


Re: testing a string for unicode characters

2005-12-08 Thread Paul Hastings
S. Isaac Dealey wrote:
 I just noticed today (or rather, John Ashenfelter pointed out to me :)
 that some regular expression engines include \u to match unicode
 characters.

that's what unicode calls it's level 1 support. bare butt minimum. i'm 
weak w/regex so take all of the following w/a grain of salt. i think 
regex in cf is based on perl which is described w/some unicode support 
via \p, \P, \X, and \u none of which cf seems to support.

 So I did a couple tests on some data we had in a legacy database (it
 doesn't have any nvarchar columns... yet) and it turns out that using
 the \u pattern in a regular expression will mangle the hell out of
 an ASCII string... I'm also assuming (and I may be way off base) that

why can't you use chr()?

 using #chr(1-32)# (for a nonprinting ascii character) won't always
 match the same character in a unicode string. (I woudl think it
 depends on whether or not the string uses single or double-byte for
 the individual character, since I recall reading that unicode doesn't
 always use double-byte representation, but I suspect the regex engine
 does if you use \u.)

you're thinking about unicode transforms where the number of bytes can 
vary per char or maybe in some now rare cases of creating graphemes via 
combining chars: a+combining char+' = รก (latin small a w/acute). if 
you're dealing w/this internally i wouldn't think that would be an issue.

 1) Does the regex engine in CF 6-7 support \u?

apparently not.
bigA=A;
p=refind(\u0041,bigA);
p returns 0.

i think if you need to use that kind of notation then maybe core java's 
java.util.regex icu4j also has some nifty search classes that might be 
useful.

 2) am I wrong about #chr(1-32)#? Will it always match the same
 character in a UTF-8 string?

unless you're using EBCDIC?? the original ASCII stuff went straight into 
  unicode as is to keep the adoption uproar to a minimum.

 3) if CF 6+ supports \u and #chr(1-32)# won't always match the
 unicode equivalent, is there a way to test a string in CF to determine
 if it's unicode (digging into Java maybe)?

icu4j has charset detector, core java's java.util.regex might help too. 
but forensically determining a charset isn't pretty nor always correct.

 exception of tab (9), newline (10,13) and space (32) characters. This
 database has some data containing vertical tabs (11) which I'm

this won't work?
bigA=AA#chr(11)#AA#chr(11)#AA;
replacedBigA=replace(bigA,chr(11),,ALL);

~|
Discover CFTicket - The leading ColdFusion Help Desk and Trouble 
Ticket application

http://www.houseoffusion.com/banners/view.cfm?bannerid=48

Message: http://www.houseoffusion.com/lists.cfm/link=i:4:226649
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Donations  Support: http://www.houseoffusion.com/tiny.cfm/54