Bug#500015: Cannot parse feed containing SOH character

Matt Kraai Wed, 24 Sep 2008 19:33:58 -0700

On Wed, Sep 24, 2008 at 10:12:41AM -0700, Rodrigo Gallardo wrote:
> > The feed at
> > 
> >  http://jc.ngo.org.uk/~nik/use.perl.journals.rss
> > 
> > currently contains a SOH character (i.e., the 0x01 character).  When I
> > click on it in Liferea, it displays the following error message:
> > 
> >  XML Parsing Error: reference to invalid character number
> >  Location: file:///
> >  Line Number 20, Column 45:
> > 
> >  <pre>Aha. On the line 580 of that we have a &#x1; character. Which leads 
> > me to
> >  --------------------------------------------^
> > 
> > The feed has a UTF-8 encoding declaration and the SOH character is a
> > valid Unicode character, so I think this error is in error.
> 
> As a matter of fact, the XML spec says 
> (http://www.w3.org/TR/REC-xml/#dt-character)
> that
> 
> Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
> [#x10000-#x10FFFF]
> 
> so &#x1; is not a valid char for an XML document.


I don't think this is a correct inference.  In
http://www.w3.org/TR/REC-xml/#charsets, it says

 Consequently, XML processors MUST accept any character in the range
 specified for Char. ]

 Character Range

 [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |     /* any Unicode character,
              [#xE000-#xFFFD] |                     excluding the surrogate
              [#x10000-#x10FFFF]                    blocks, FFFE, and FFFF. */

but it doesn't specify that it must accept *only* characters in that
range.  In fact, the next paragraph states

 All XML processors MUST accept the UTF-8 and UTF-16 encodings of
 Unicode 3.1 ...

In http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt, the
list of Unicode 3.1 characters, the SOH character is the second entry.

-- 
Matt                                                 http://ftbfs.org/



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#500015: Cannot parse feed containing SOH character

Reply via email to