Re: Bad character crashing cfxml

2006-04-09 Thread Rob Wilkerson
I opt to store user-defined data in CDATA blocks:


 


Another route is to use the XMLFormat() function.

On 4/9/06, Josh Nathanson <[EMAIL PROTECTED]> wrote:
> Hello All,
>
> When creating an xml document using cfxml, I get the following error:
>
> ---
> An error occured while Parsing an XML document.
> An invalid XML character (Unicode: 0x1c) was found in the element content of
> the document.
> --
>
> The values in question are text strings entered by users and could contain
> tabs, carriage returns, line feeds etc.  I've tried stripping those out
> using Replace but no luck, and tried XmlFormat(variable) with no luck.
>
> When I look up the unicode character 0x1c00, it says it's unassigned.  I
> don't see anything weird in the string values that might be causing the
> error.
>
> Any thoughts/ideas are appreciated.
>
> -- Josh Nathanson
>
>
> 

~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:237309
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


Re: Bad character crashing cfxml

2006-04-09 Thread Ryan Guill
use cdata?

If you ever have characters that could make the xml mis-formed,
especially all user entered data, put it in a cdata.






That will parse just fine.

http://www.w3schools.com/xml/xml_cdata.asp


On 4/9/06, Josh Nathanson <[EMAIL PROTECTED]> wrote:
> Hello All,
>
> When creating an xml document using cfxml, I get the following error:
>
> ---
> An error occured while Parsing an XML document.
> An invalid XML character (Unicode: 0x1c) was found in the element content of
> the document.
> --
>
> The values in question are text strings entered by users and could contain
> tabs, carriage returns, line feeds etc.  I've tried stripping those out
> using Replace but no luck, and tried XmlFormat(variable) with no luck.
>
> When I look up the unicode character 0x1c00, it says it's unassigned.  I
> don't see anything weird in the string values that might be causing the
> error.
>
> Any thoughts/ideas are appreciated.
>
> -- Josh Nathanson
>
>
> 

~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:237310
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


Re: Bad character crashing cfxml

2006-04-09 Thread S . Isaac Dealey
> I opt to store user-defined data in CDATA blocks:

> 
>  
> 

> Another route is to use the XMLFormat() function.

Unfortunately the XMLFormat() function doesn't handle non-printing
characters. I believe the character 0x1C or "1C" is ASCII character
27, with characters 1-30 being non-printing. All the non-printing
characters with the exception of carriage return (13) line feed (10)
and space (20?) are invalid in an XML document.

I'm not sure if the CDATA segment will allow you to use these
characters or if XML simply expects you to find your own method of
handling them that conforms to the standard of not including
non-printing characters. I'm reasonably certain that it rejects the
entities ( for example would technically be the ASCII beep code)
and since that would prevent their use in attributes, my guess would
be they are also not allowed in CDATA segments.

This is all guess-work on my part, I haven't memorized the spec or
anything. :)

That being the case, when I found vertical tabs in some of my data
which needed to be embedded in XML documents, I created a wrapper
function for the native XMLFormat() function which removes all
non-printing characters except 13, 10 and the space.

Here is the code I used to accomplish this task (sorry for the
line-wrap):

function xmlstring(mystring) {
  // remove any non-printing characters
  // with the exception of tabs and line-breaks
  mystring = rereplace(mystring,"[#chr(1)#-#chr(8)##chr(11)#-#chr(12)#
#chr(14)#-#chr(31)#]","","ALL");
  // replace the single-quote character with
  // the character-code entity the xml parser
  // in early versions of ColdFusion MX doesn't
  // understand the ' entity
  return replacenocase(xmlformat(mystring),"'","&##39;","ALL");
}

hth


s. isaac dealey 434.293.6201
new epoch : isn't it time for a change?

add features without fixtures with
the onTap open source framework

http://www.fusiontap.com
http://coldfusion.sys-con.com/author/4806Dealey.htm


~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:237311
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


Re: Bad character crashing cfxml

2006-04-09 Thread Josh Nathanson
Thanks guys.  I did try the XmlFormat function (without the additional 
rereplace stuff) and it still didn't work.  I'll let you know shortly if the 
CDATA works, that's what I'll try next.  If THAT doesn't work, I'll try the 
rereplace method S. Isaac suggested.

-- Josh






- Original Message - 
From: "S. Isaac Dealey" <[EMAIL PROTECTED]>
To: "CF-Talk" 
Sent: Sunday, April 09, 2006 5:20 PM
Subject: Re: Bad character crashing cfxml


>> I opt to store user-defined data in CDATA blocks:
>
>> 
>>  
>> 
>
>> Another route is to use the XMLFormat() function.
>
> Unfortunately the XMLFormat() function doesn't handle non-printing
> characters. I believe the character 0x1C or "1C" is ASCII character
> 27, with characters 1-30 being non-printing. All the non-printing
> characters with the exception of carriage return (13) line feed (10)
> and space (20?) are invalid in an XML document.
>
> I'm not sure if the CDATA segment will allow you to use these
> characters or if XML simply expects you to find your own method of
> handling them that conforms to the standard of not including
> non-printing characters. I'm reasonably certain that it rejects the
> entities ( for example would technically be the ASCII beep code)
> and since that would prevent their use in attributes, my guess would
> be they are also not allowed in CDATA segments.
>
> This is all guess-work on my part, I haven't memorized the spec or
> anything. :)
>
> That being the case, when I found vertical tabs in some of my data
> which needed to be embedded in XML documents, I created a wrapper
> function for the native XMLFormat() function which removes all
> non-printing characters except 13, 10 and the space.
>
> Here is the code I used to accomplish this task (sorry for the
> line-wrap):
>
> function xmlstring(mystring) {
>  // remove any non-printing characters
>  // with the exception of tabs and line-breaks
>  mystring = rereplace(mystring,"[#chr(1)#-#chr(8)##chr(11)#-#chr(12)#
> #chr(14)#-#chr(31)#]","","ALL");
>  // replace the single-quote character with
>  // the character-code entity the xml parser
>  // in early versions of ColdFusion MX doesn't
>  // understand the ' entity
>  return replacenocase(xmlformat(mystring),"'","&##39;","ALL");
> }
>
> hth
>
>
> s. isaac dealey 434.293.6201
> new epoch : isn't it time for a change?
>
> add features without fixtures with
> the onTap open source framework
>
> http://www.fusiontap.com
> http://coldfusion.sys-con.com/author/4806Dealey.htm
>
>
> 

~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:237313
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


Re: Bad character crashing cfxml

2006-04-09 Thread Denny Valliant
> mystring =
rereplace(mystring,"[#chr(1)#-#chr(8)##chr(11)#-#chr(12)##chr(14)#-#chr(31)#]","","ALL");

The only problem with this is that that you have to add stuff that you want
removed.

You could also list what's allowed, vs. what's not allowed, like






Which you never need to add to unless you need to add to it, so to speak.
Only useful if you don't know before hand what's gonna get entered i
guess... um... :-P Whatever. I go now.
:Den


~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:237316
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


Re: Bad character crashing cfxml

2006-04-09 Thread S . Isaac Dealey
>> mystring =
> rereplace(mystring,"[#chr(1)#-#chr(8)##chr(11)#-#chr(12)##
> chr(14)#-#chr(31)#]","","ALL");

> The only problem with this is that that you have to add
> stuff that you want
> removed.

> You could also list what's allowed, vs. what's not
> allowed, like

>  output="false">
>  type="string">
>  rereplacenocase(stringToClean,'[^a-z|A-Z|0-9|_]','','all')
> >
> 

> Which you never need to add to unless you need to add to
> it, so to speak.
> Only useful if you don't know before hand what's gonna get
> entered i
> guess... um... :-P Whatever. I go now.
> :Den

Heh... Well, there is that. :)

The reason why I went with the removal of characters that aren't
allowed in XML is because most punctuation marks and unicode
characters above Z, including both accented lattin characters
(umlauts, etc) and non-latin characters such as Kanji are also
allowed. I was trying to ensure a solution which would work
universally, and so I didn't want to inadvertently remove any
character that would be legal.

Incidentally, you only need to specify a-z or A-Z when using
rereplacenocase -- if using rereplace then you need both) -- also the
| characters aren't necessary within the character class denoted with
[] and actually will behave differently because it's not a special
character in that context. In this expression it will allow |
characters in the list of characters not removed, as opposed to
separating sub-expressions, which is what the | character does within
parenthesis (each pair of which represent one or more subexpressions,
separated by pipes, i.e. (mary|joe) matches either "mary" or "joe").
So, rereplacenocase(string,'[^a-z0-9_]','','all') will have the effect
you intended.

Hope that's useful to you. :)

s. isaac dealey 434.293.6201
new epoch : isn't it time for a change?

add features without fixtures with
the onTap open source framework

http://www.fusiontap.com
http://coldfusion.sys-con.com/author/4806Dealey.htm


~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:237317
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


Re: Bad character crashing cfxml

2006-04-10 Thread Jochem van Dieten
Josh Nathanson wrote:
> 
> ---
> An error occured while Parsing an XML document.
> An invalid XML character (Unicode: 0x1c) was found in the element content of 
> the document.
> --
> 
> The values in question are text strings entered by users and could contain 
> tabs, carriage returns, line feeds etc.  I've tried stripping those out 
> using Replace but no luck, and tried XmlFormat(variable) with no luck.
> 
> When I look up the unicode character 0x1c00, it says it's unassigned.

Your user is probably entering ISO-8859-x and you are trying to 
process it as Unicode. Make sure your charsets match along the 
whole process.

Jochem

~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:237327
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


Re: Bad character crashing cfxml

2006-04-10 Thread Bryan Stevenson
I found CFXML a real pain to get all the whitespace etc out...even with 
CFSILENT 
..or even worse...all on one line.  I opted for the object method for 
producing XML...less obvious but far tighter IMHO

Bryan Stevenson B.Comm.
VP & Director of E-Commerce Development
Electric Edge Systems Group Inc.
phone: 250.480.0642
fax: 250.480.1264
cell: 250.920.8830
e-mail: [EMAIL PROTECTED]
web: www.electricedgesystems.com 


~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:237348
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


SOLVED Re: Bad character crashing cfxml

2006-04-10 Thread Josh Nathanson
All, thanks for the help with this issue.  Using CDATA didn't work as
apparently CF does parse whatever's within the CDATA block and was still
choking on the bad character.  I ultimately implemented the rereplace
suggested by S. Isaac, and after some tweaking it worked.  Here's the final
code:


#itemNameClean#

I had to add in chr(28) and chr(29) into the rereplace as those are the
equivalents to Unicode 0x1c and 0x1d which were the bad characters that the
user had entered somehow.  Also added chr(38) which is the '&' character,
also a baddie.

-- Josh




- Original Message - 
From: "Jochem van Dieten" <[EMAIL PROTECTED]>
To: "CF-Talk" 
Sent: Monday, April 10, 2006 2:31 AM
Subject: Re: Bad character crashing cfxml


> Josh Nathanson wrote:
>>
>> ---
>> An error occured while Parsing an XML document.
>> An invalid XML character (Unicode: 0x1c) was found in the element content
>> of
>> the document.
>> --
>>
>> The values in question are text strings entered by users and could
>> contain
>> tabs, carriage returns, line feeds etc.  I've tried stripping those out
>> using Replace but no luck, and tried XmlFormat(variable) with no luck.
>>
>> When I look up the unicode character 0x1c00, it says it's unassigned.
>
> Your user is probably entering ISO-8859-x and you are trying to
> process it as Unicode. Make sure your charsets match along the
> whole process.
>
> Jochem
>
> 

~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:237359
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


Re: SOLVED Re: Bad character crashing cfxml

2006-04-10 Thread S . Isaac Dealey
> All, thanks for the help with this issue.  Using CDATA
> didn't work as apparently CF does parse whatever's
> within the CDATA block and was still choking on the
> bad character.  I ultimately implemented the rereplace
> suggested by S. Isaac, and after some tweaking it
> worked. Here's the final code:

>  rereplace(itemName,"[#chr(1)#-#chr(8)#-#chr(11)#-#chr(12)#
> -#chr(14)#-#chr(28)#-#chr(29)#-#chr(31)#-#chr(38)#]","","A
> LL")>
> #itemNameClean#

> I had to add in chr(28) and chr(29) into the rereplace
> as those are the equivalents to Unicode 0x1c and 0x1d
> which were the bad characters that the user had entered
> somehow.  Also added chr(38) which is the '&' character,
> also a baddie.

> -- Josh

Hi Josh,

Without wanting to sound critical, I think you may need a little more
testing before declaring this issue resolved. You seem to have some
extra hyphens in the expression here, that's one issue (you just need
a bit of a primer on regex I think), and another issue is that you're
handling the & character without also handling the > (>) and <
(<) or " (") characters, which means if a user enters any of
those into the string, they will also cause problems. This was the
reason why my implementation of it had used both XMLFormat() and the
regular expression, because neither one of them independantly solved
the whole problem.

I'll let you decide about the extra special xml characters. :)

As to the regular expression, here's the explanation of where I see
the problem:

The original expression here:

[#chr(1)#-#chr(8)##chr(11)#-#chr(12)##chr(14)#-#chr(31)#]

is similar to

[a-zA-Z0-9]

Notice in this expression that there are two places where there is no
hyphen between two characters, both at "zA" and at "Z0". This is
because of the way the hyphen is interpreted within the class
designated by the [ and ] characters. The class itself tells the
regular expression engine to match any character within the class, so
[ab] will match the letter "a" or the letter "b". The hyphen then
allows you to specify a range of characters (in ASCII or unicode
numeric order), so that [a-b] will match the letter "a" or the letter
"b" or the letter "c". The reason why many people use a-zA-Z instead
of a-Z is because when you look at an ASCII table, there are several
non-alpha characters between the letter "z" and the letter "A" (or
vice versa, I don't remember offhand if ascii has lower-case higher or
lower in the list).

Now, in your expression above, you've added several hyphens, so your
expression is roughly equivalent of

[a-z-A-Z-0-9]

Off hand, since I haven't tested it, I don't know if this will produce
the same result. At a minimum, my expectation would be that it would
add the hyphen to the list of characters being removed, because I've
been able to include hyphens in a character class before, such as
[-0-9]. Since I'm guessing you want to allow users to use hyphens, I'm
thinking you don't want that to happen. On the other hand it could
potentially add other legal characters (9,10,13 and 32-37) to the list
of characters that are removed. You'll have to test it to know exactly
how it behaves.

hth

s. isaac dealey 434.293.6201
new epoch : isn't it time for a change?

add features without fixtures with
the onTap open source framework

http://www.fusiontap.com
http://coldfusion.sys-con.com/author/4806Dealey.htm


~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:237394
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54