Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 21:53:24 UTC, Robert burner Schadek wrote: On Thursday, 18 February 2016 at 18:28:10 UTC, Alex Vincent wrote: Regarding control characters: If you give me a complete sample file, I can run it through Mozilla's UTF stream conversion and/or XML parsing code (via either SAX or DOMParser) to tell you how that reacts as a reference. Mozilla supports XML 1.0, but not 1.1. thanks you making the effort https://github.com/burner/std.xml2/blob/master/tests/eduni/xml-1.1/out/010.xml In this case, Firefox just passes the control characters through to the contentHandler.characters method: Starting runTest Retrieved source contentHandler.startDocument() contentHandler.startElement("", "foo", "foo", {}) contentHandler.characters("\u0080") contentHandler.endElement("", "foo", "foo") contentHandler.endDocument() Done reading
Re: std.xml2 (collecting features) control character
On Friday, 19 February 2016 at 12:55:52 UTC, Kagamin wrote: http://dpaste.dzfl.pl/2f8a8ff10bde like this? yes
Re: std.xml2 (collecting features) control character
On Friday, 19 February 2016 at 12:30:06 UTC, Robert burner Schadek wrote: ubyte[] arr = cast(ubyte[])[3C, 66, 6F, 6F, 3E, C2, 80, 3C, 2F, 66, 6F, 6F, 3E]); string s = cast(string)arr; dstring ds = to!dstring(s); and see what happens http://dpaste.dzfl.pl/2f8a8ff10bde like this?
Re: std.xml2 (collecting features) control character
On 2016-02-19 11:58, Kagamin via Digitalmars-d wrote: > On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek > wrote: >> the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E" > > http://dpaste.dzfl.pl/80888ed31958 like this? No, The program just takes the hex dump as string. you would need to do something like: ubyte[] arr = cast(ubyte[])[3C, 66, 6F, 6F, 3E, C2, 80, 3C, 2F, 66, 6F, 6F, 3E]); string s = cast(string)arr; dstring ds = to!dstring(s); and see what happens
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek wrote: the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E" http://dpaste.dzfl.pl/80888ed31958 like this?
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 18:28:10 UTC, Alex Vincent wrote: Regarding control characters: If you give me a complete sample file, I can run it through Mozilla's UTF stream conversion and/or XML parsing code (via either SAX or DOMParser) to tell you how that reacts as a reference. Mozilla supports XML 1.0, but not 1.1. thanks you making the effort https://github.com/burner/std.xml2/blob/master/tests/eduni/xml-1.1/out/010.xml
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 17:26:30 UTC, Adam D. Ruppe wrote: On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek wrote: unix file says it is a utf8 encoded file, but not BOM is present. the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E" Gah, I should have read this before replying... well, that does appear to be valid utf-8 why is it throwing an exception then? I'm pretty sure that byte stream *is* actually well-formed xml 1.0 and should pass utf validation as well as the XML well-formedness check. Regarding control characters: If you give me a complete sample file, I can run it through Mozilla's UTF stream conversion and/or XML parsing code (via either SAX or DOMParser) to tell you how that reacts as a reference. Mozilla supports XML 1.0, but not 1.1.
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek wrote: unix file says it is a utf8 encoded file, but not BOM is present. the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E" Gah, I should have read this before replying... well, that does appear to be valid utf-8 why is it throwing an exception then? I'm pretty sure that byte stream *is* actually well-formed xml 1.0 and should pass utf validation as well as the XML well-formedness check.
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:54:10 UTC, Robert burner Schadek wrote: It does not, it has no prolog and therefore no EncodingInfo. In that case, it needs to be valid UTF-8 or valid UTF-16 and it is a fatal error if there's any invalid bytes: https://www.w3.org/TR/REC-xml/#charencoding == It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16. ==
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:54:10 UTC, Robert burner Schadek wrote: unix file says it is a utf8 encoded file, but not BOM is present. the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:47:35 UTC, Adam D. Ruppe wrote: On Thursday, 18 February 2016 at 16:41:52 UTC, Robert burner Schadek wrote: for instance, quick often I find <80> in tests that are supposed to be valid xml 1.0. they are invalid xml 1.1 though What char encoding does the document declare itself as? It does not, it has no prolog and therefore no EncodingInfo. unix file says it is a utf8 encoded file, but not BOM is present.
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:41:52 UTC, Robert burner Schadek wrote: for instance, quick often I find <80> in tests that are supposed to be valid xml 1.0. they are invalid xml 1.1 though What char encoding does the document declare itself as?
Re: std.xml2 (collecting features) control character
for instance, quick often I find <80> in tests that are supposed to be valid xml 1.0. they are invalid xml 1.1 though
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 15:56:58 UTC, Robert burner Schadek wrote: When trying to validate/convert an utf string these lead to exceptions, because they are not valid utf character. That means the user didn't encode them properly... Which one specifically are you thinking of? I'm pretty sure all those control characters have a spot in the Unicode space and can be properly encoded as UTF-8 (though I think even if they are properly encoded, some of them are illegal in XML anyway). If they appear in another form, it is invalid and/or needs a charset conversion, which should be specified in the XML document itself.
Re: std.xml2 (collecting features) control character
While working on a new xml implementation I came cross "control characters (CC)". [1] When trying to validate/convert an utf string these lead to exceptions, because they are not valid utf character. Unfortunately, some of these characters are allowed to appear in valid xml 1.* documents. I currently see two option how to go about it: 1. Do not allow non CCs that do not work with existing functionality. 1.Pros * easy 1.Cons * the resulting xml implementation will not be xml 1.* complete 2. Add special cases to the existing functionality to handle CCs that are allowed in 1.0. 2.Pros * the resulting xml implementation will be xml 1.* complete 2.Cons * will make utf de/encoding slower as I would need to add additional logic Any other ideas, feedback? [1] https://en.wikipedia.org/wiki/C0_and_C1_control_codes