Re: std.xml2 (collecting features) control character

2016-02-19 Thread Alex Vincent via Digitalmars-d
On Thursday, 18 February 2016 at 21:53:24 UTC, Robert burner 
Schadek wrote:
On Thursday, 18 February 2016 at 18:28:10 UTC, Alex Vincent 
wrote:
Regarding control characters:  If you give me a complete 
sample file, I can run it through Mozilla's UTF stream 
conversion and/or XML parsing code (via either SAX or 
DOMParser) to tell you how that reacts as a reference.  
Mozilla supports XML 1.0, but not 1.1.


thanks you making the effort

https://github.com/burner/std.xml2/blob/master/tests/eduni/xml-1.1/out/010.xml


In this case, Firefox just passes the control characters through 
to the contentHandler.characters method:


Starting runTest
Retrieved source
contentHandler.startDocument()
contentHandler.startElement("", "foo", "foo", {})
contentHandler.characters("\u0080")
contentHandler.endElement("", "foo", "foo")
contentHandler.endDocument()
Done reading



Re: std.xml2 (collecting features) control character

2016-02-19 Thread Robert burner Schadek via Digitalmars-d

On Friday, 19 February 2016 at 12:55:52 UTC, Kagamin wrote:

http://dpaste.dzfl.pl/2f8a8ff10bde like this?


yes


Re: std.xml2 (collecting features) control character

2016-02-19 Thread Kagamin via Digitalmars-d
On Friday, 19 February 2016 at 12:30:06 UTC, Robert burner 
Schadek wrote:
ubyte[] arr = cast(ubyte[])[3C, 66, 6F, 6F, 3E, C2, 80, 3C, 2F, 
66, 6F,

6F, 3E]);
string s = cast(string)arr;
dstring ds = to!dstring(s);

and see what happens


http://dpaste.dzfl.pl/2f8a8ff10bde like this?


Re: std.xml2 (collecting features) control character

2016-02-19 Thread Robert burner Schadek via Digitalmars-d
On 2016-02-19 11:58, Kagamin via Digitalmars-d wrote:
> On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek
> wrote:
>> the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"
>
> http://dpaste.dzfl.pl/80888ed31958 like this?
No, The program just takes the hex dump as string.

you would need to do something like:

ubyte[] arr = cast(ubyte[])[3C, 66, 6F, 6F, 3E, C2, 80, 3C, 2F, 66, 6F,
6F, 3E]);
string s = cast(string)arr;
dstring ds = to!dstring(s);

and see what happens


Re: std.xml2 (collecting features) control character

2016-02-19 Thread Kagamin via Digitalmars-d
On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner 
Schadek wrote:

the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"


http://dpaste.dzfl.pl/80888ed31958 like this?


Re: std.xml2 (collecting features) control character

2016-02-18 Thread Robert burner Schadek via Digitalmars-d

On Thursday, 18 February 2016 at 18:28:10 UTC, Alex Vincent wrote:
Regarding control characters:  If you give me a complete sample 
file, I can run it through Mozilla's UTF stream conversion 
and/or XML parsing code (via either SAX or DOMParser) to tell 
you how that reacts as a reference.  Mozilla supports XML 1.0, 
but not 1.1.


thanks you making the effort

https://github.com/burner/std.xml2/blob/master/tests/eduni/xml-1.1/out/010.xml


Re: std.xml2 (collecting features) control character

2016-02-18 Thread Alex Vincent via Digitalmars-d
On Thursday, 18 February 2016 at 17:26:30 UTC, Adam D. Ruppe 
wrote:
On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner 
Schadek wrote:
unix file says it is a utf8 encoded file, but not BOM is 
present.


the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"


Gah, I should have read this before replying... well, that does 
appear to be valid utf-8 why is it throwing an exception 
then?


I'm pretty sure that byte stream *is* actually well-formed xml 
1.0 and should pass utf validation as well as the XML 
well-formedness check.


Regarding control characters:  If you give me a complete sample 
file, I can run it through Mozilla's UTF stream conversion and/or 
XML parsing code (via either SAX or DOMParser) to tell you how 
that reacts as a reference.  Mozilla supports XML 1.0, but not 
1.1.


Re: std.xml2 (collecting features) control character

2016-02-18 Thread Adam D. Ruppe via Digitalmars-d
On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner 
Schadek wrote:
unix file says it is a utf8 encoded file, but not BOM is 
present.


the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"


Gah, I should have read this before replying... well, that does 
appear to be valid utf-8 why is it throwing an exception then?


I'm pretty sure that byte stream *is* actually well-formed xml 
1.0 and should pass utf validation as well as the XML 
well-formedness check.


Re: std.xml2 (collecting features) control character

2016-02-18 Thread Adam D. Ruppe via Digitalmars-d
On Thursday, 18 February 2016 at 16:54:10 UTC, Robert burner 
Schadek wrote:

It does not, it has no prolog and therefore no EncodingInfo.


In that case, it needs to be valid UTF-8 or valid UTF-16 and it 
is a fatal error if there's any invalid bytes:


https://www.w3.org/TR/REC-xml/#charencoding

==
 It is a fatal error if an XML entity is determined (via default, 
encoding declaration, or higher-level protocol) to be in a 
certain encoding but contains byte sequences that are not legal 
in that encoding. Specifically, it is a fatal error if an entity 
encoded in UTF-8 contains any ill-formed code unit sequences, as 
defined in section 3.9 of Unicode [Unicode]. Unless an encoding 
is determined by a higher-level protocol, it is also a fatal 
error if an XML entity contains no encoding declaration and its 
content is not legal UTF-8 or UTF-16.

==



Re: std.xml2 (collecting features) control character

2016-02-18 Thread Robert burner Schadek via Digitalmars-d
On Thursday, 18 February 2016 at 16:54:10 UTC, Robert burner 
Schadek wrote:
unix file says it is a utf8 encoded file, but not BOM is 
present.


the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"


Re: std.xml2 (collecting features) control character

2016-02-18 Thread Robert burner Schadek via Digitalmars-d
On Thursday, 18 February 2016 at 16:47:35 UTC, Adam D. Ruppe 
wrote:
On Thursday, 18 February 2016 at 16:41:52 UTC, Robert burner 
Schadek wrote:
for instance, quick often I find <80> in tests that are 
supposed to be valid xml 1.0. they are invalid xml 1.1 though


What char encoding does the document declare itself as?


It does not, it has no prolog and therefore no EncodingInfo.

unix file says it is a utf8 encoded file, but not BOM is present.


Re: std.xml2 (collecting features) control character

2016-02-18 Thread Adam D. Ruppe via Digitalmars-d
On Thursday, 18 February 2016 at 16:41:52 UTC, Robert burner 
Schadek wrote:
for instance, quick often I find <80> in tests that are 
supposed to be valid xml 1.0. they are invalid xml 1.1 though


What char encoding does the document declare itself as?


Re: std.xml2 (collecting features) control character

2016-02-18 Thread Robert burner Schadek via Digitalmars-d
for instance, quick often I find <80> in tests that are supposed 
to be valid xml 1.0. they are invalid xml 1.1 though


Re: std.xml2 (collecting features) control character

2016-02-18 Thread Adam D. Ruppe via Digitalmars-d
On Thursday, 18 February 2016 at 15:56:58 UTC, Robert burner 
Schadek wrote:
When trying to validate/convert an utf string these lead to 
exceptions, because they are not valid utf character.


That means the user didn't encode them properly...

Which one specifically are you thinking of? I'm pretty sure all 
those control characters have a spot in the Unicode space and can 
be properly encoded as UTF-8 (though I think even if they are 
properly encoded, some of them are illegal in XML anyway).


If they appear in another form, it is invalid and/or needs a 
charset conversion, which should be specified in the XML document 
itself.


Re: std.xml2 (collecting features) control character

2016-02-18 Thread Robert burner Schadek via Digitalmars-d
While working on a new xml implementation I came cross "control 
characters (CC)". [1]
When trying to validate/convert an utf string these lead to 
exceptions, because they are not valid utf character.
Unfortunately, some of these characters are allowed to appear in 
valid xml 1.* documents.


I currently see two option how to go about it:

1. Do not allow non CCs that do not work with existing 
functionality.

1.Pros
  * easy
1.Cons
  * the resulting xml implementation will not be xml 1.* complete

2. Add special cases to the existing functionality to handle CCs 
that are allowed in 1.0.

2.Pros
  * the resulting xml implementation will be xml 1.* complete
2.Cons
  * will make utf de/encoding slower as I would need to add 
additional logic


Any other ideas, feedback?




[1] https://en.wikipedia.org/wiki/C0_and_C1_control_codes