Re: std.xml2 (collecting features)
For everyone's information, I've posted a pull request to Mr. Schadek's github repository, with a proposed Simple API for XML (SAX) stub. I'd really appreciate reviews of the stub's interfaces. https://github.com/burner/std.xml2/pull/5
Re: std.xml2 (collecting features)
On Sunday, 6 March 2016 at 11:46:00 UTC, Robert burner Schadek wrote: On Saturday, 5 March 2016 at 15:20:12 UTC, Craig Dillabaugh wrote: Robert, we have had some student interest in GSOC for XML. Would you be interested in mentoring a student to work with you on this. Craig Of course Great. Can you please get in touch by email so I can add you to the mentors list: craig dot dillabaugh at gmail dot com Cheers
Re: std.xml2 (collecting features)
On Sunday, 6 March 2016 at 11:46:00 UTC, Robert burner Schadek wrote: On Saturday, 5 March 2016 at 15:20:12 UTC, Craig Dillabaugh wrote: Robert, we have had some student interest in GSOC for XML. Would you be interested in mentoring a student to work with you on this. Craig Of course Hi, I don't know if this is the right spot to join the conversation; I'm student and I'd really love to work on std.xml for GSoC! I'm just waiting March 14 to apply.
Re: std.xml2 (collecting features)
On Saturday, 5 March 2016 at 15:20:12 UTC, Craig Dillabaugh wrote: Robert, we have had some student interest in GSOC for XML. Would you be interested in mentoring a student to work with you on this. Craig Of course
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. Robert, we have had some student interest in GSOC for XML. Would you be interested in mentoring a student to work with you on this. Craig
Re: std.xml2 (collecting features)
Adam D. Ruppe wrote: > On Wednesday, 2 March 2016 at 06:59:49 UTC, Tobias Müller wrote: >> What's the usecase of DOM outside of browser >> interoperability/scripting? The API isn't particularly nice, >> especially in languages with a rich type system. > > I find my extended dom to be very nice, especially thanks to D's > type system. I use it for a lot of things: using web apis, html > scraping, config file stuff, working on my own documents, and > even as my web template system. > > Basically, dom.d made xml cool to me. Sure, some kind of DOM is certainly useful. But the standard XML-DOM isn't particularly nice. What's the point of a linked list style interface when you have ranges in the language?
Re: std.xml2 (collecting features)
On Wednesday, 2 March 2016 at 06:59:49 UTC, Tobias Müller wrote: What's the usecase of DOM outside of browser interoperability/scripting? The API isn't particularly nice, especially in languages with a rich type system. I find my extended dom to be very nice, especially thanks to D's type system. I use it for a lot of things: using web apis, html scraping, config file stuff, working on my own documents, and even as my web template system. Basically, dom.d made xml cool to me.
Re: std.xml2 (collecting features)
On Wednesday, 2 March 2016 at 02:50:22 UTC, Alex Vincent wrote: I agree, but the Document Object Model (DOM) is a hge project. It's a project I'd love to take an active hand in driving. My dom.d implements a fair chunk of it already. https://github.com/adamdruppe/arsd/blob/master/dom.d Yes, indeed, it is quite a lot of code, but easy to use if you are familiar with javascript and css selectors. http://dpldocs.info/experimental-docs/arsd.dom.html
Re: std.xml2 (collecting features)
Dejan Lekic wrote: > If you really want to be serious about the XML package, then I > humbly believe implementing the commonly-known DOM interfaces is > a must. Luckily there is IDL available for it: > https://www.w3.org/TR/DOM-Level-2-Core/idl/dom.idl . Also, > speaking about DOM, all levels need to be supported! > > Also, I would recommend borrowing the Tango's XML pull parser as > it is blazingly fast. > > Finally, perhaps integration with signal/slot module should > perhaps be considered as well. > What's the usecase of DOM outside of browser interoperability/scripting? The API isn't particularly nice, especially in languages with a rich type system.
Re: std.xml2 (collecting features)
On Wednesday, 24 February 2016 at 10:55:01 UTC, Dejan Lekic wrote: If you really want to be serious about the XML package, then I humbly believe implementing the commonly-known DOM interfaces is a must. Luckily there is IDL available for it: https://www.w3.org/TR/DOM-Level-2-Core/idl/dom.idl . Also, speaking about DOM, all levels need to be supported! I agree, but the Document Object Model (DOM) is a hge project. It's a project I'd love to take an active hand in driving. Also, DOM "level 4" is a living standard at whatwg.org, along with rules for parsing HTML. (Which naturally means the rules are always changing.) I have a partial implementation of DOM in JavaScript, so I am serious when I say it's going to take time. Ideally (imho), we'd have a set of related packages, prefixed with std.web: * html * xml * dom * css * javascript (Yes, I'm suggesting a rename of std.xml2 to std.web.xml.) But from what I can see, realistically the community is a long way from that. I'm trying to write the SAX interfaces now. I only have a limited amount of time to devote to this (a common complaint, I gather)...
Re: std.xml2 (collecting features)
On Thursday, 25 February 2016 at 23:59:04 UTC, crimaniak wrote: Where is only a couple of ad-hoc checks for attributes values. This language is not XPath-compatible, so most easy way to cover a lot of cases is regex check for attributes. Something like "script[src/https:.+\\.googleapis\\.com/i]" The css3 selector standard offers three substring search: [attr^=foo] if it begins with foo, [attr$=foo] if it ends with foo, and [attr*=foo] if it includes foo somewhere. dom.d supports all three now. So for your regex, you could probably match: [attr*=googleapis.com] well enough.
Re: std.xml2 (collecting features)
On Sunday, 21 February 2016 at 23:57:40 UTC, Adam D. Ruppe wrote: On Sunday, 21 February 2016 at 23:01:22 UTC, crimaniak wrote: I will use it in my experiments, but getElementsBySelector() selector language need to be improved I think. What, specifically, do you have in mind? Where is only a couple of ad-hoc checks for attributes values. This language is not XPath-compatible, so most easy way to cover a lot of cases is regex check for attributes. Something like "script[src/https:.+\\.googleapis\\.com/i]"
Re: std.xml2 (collecting features)
On Tuesday, 23 February 2016 at 12:46:38 UTC, Dmitry wrote: On Tuesday, 23 February 2016 at 11:22:23 UTC, Joakim wrote: Then write a good XML extraction-only library and dub it. I see no reason to include this in Phobos You won't be able to sleep if it will be in Phobos? I use XML and I don't like check tons of side libraries for see which will be good for me, which have support (bugfixes), which will have support in some years, etc. Lot of systems already using XML and any serious language _must_ have official support for it. So are you trying to say C/C++ are not serious languages :o) Having said that, as much as I hate XML, basic support would be a nice feature for the language.
Re: std.xml2 (collecting features)
If you really want to be serious about the XML package, then I humbly believe implementing the commonly-known DOM interfaces is a must. Luckily there is IDL available for it: https://www.w3.org/TR/DOM-Level-2-Core/idl/dom.idl . Also, speaking about DOM, all levels need to be supported! Also, I would recommend borrowing the Tango's XML pull parser as it is blazingly fast. Finally, perhaps integration with signal/slot module should perhaps be considered as well.
Re: std.xml2 (collecting features)
On Tuesday, 23 February 2016 at 11:22:23 UTC, Joakim wrote: Then write a good XML extraction-only library and dub it. I see no reason to include this in Phobos You won't be able to sleep if it will be in Phobos? I use XML and I don't like check tons of side libraries for see which will be good for me, which have support (bugfixes), which will have support in some years, etc. Lot of systems already using XML and any serious language _must_ have official support for it. If data formats are your thing, you could help get Ludwig's JSON stuff in, or better yet, enable some nice binary data format. If it better for you, it not mean that it will better for everyone.
Re: std.xml2 (collecting features)
On Friday, 19 February 2016 at 12:13:53 UTC, Chris wrote: On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote: On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. My request: just skip it. XML is a horrible waste of space for a standard, better D doesn't support it well, anything to discourage it's use. I'd rather see you spend your time on something worthwhile. If data formats are your thing, you could help get Ludwig's JSON stuff in, or better yet, enable some nice binary data format. Glad to hear that someone is working on XML support. We cannot just "skip it". XML/HTML like mark up comes up all the time, here and there. I recently had to write a mini-parser (nowhere near the stuff Robert is doing, just a quick fix!) to extract data from XML input. This has nothing to do with personal preferences, it's just there [1] and has to be dealt with. [1] https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Language Then write a good XML extraction-only library and dub it. I see no reason to include this in Phobos, which will encourage those who don't know any better to use it, since it comes with the compiler. I'll close with a quote from Saint Linus of Torvalds, which I was unaware of till a couple days ago: "XML is crap. Really. There are no excuses. XML is nasty to parse for humans, and it's a disaster to parse even for computers. There's just no reason for that horrible crap to exist." https://en.wikiquote.org/wiki/Linus_Torvalds#2014
Re: std.xml2 (collecting features)
On Thursday, 18 February 2016 at 15:39:01 UTC, Robert burner Schadek wrote: On Thursday, 18 February 2016 at 12:30:29 UTC, Andrei Alexandrescu wrote: Would the measuring be possible with 2995 as a dub package? -- Andrei yes, after have synced the dub package to the PR brought the dub package up to date with the PR (v0.0.6)
Re: std.xml2 (collecting features)
On Sunday, 21 February 2016 at 23:01:22 UTC, crimaniak wrote: I will use it in my experiments, but getElementsBySelector() selector language need to be improved I think. What, specifically, do you have in mind?
Re: std.xml2 (collecting features)
On Saturday, 20 February 2016 at 19:16:47 UTC, Adam D. Ruppe wrote: On Saturday, 20 February 2016 at 19:08:25 UTC, crimaniak wrote: - the ability to read documents with missing or incorrectly specified encoding - additional feature: relaxed mode for reading html and broken XML documents fyi, my dom.d can do those, I use it for web scraping where there's all kinds of hideous stuff out there. https://github.com/adamdruppe/arsd/blob/master/dom.d It works, thanks! I will use it in my experiments, but getElementsBySelector() selector language need to be improved I think.
Re: std.xml2 (collecting features)
On Saturday, 20 February 2016 at 19:08:25 UTC, crimaniak wrote: - the ability to read documents with missing or incorrectly specified encoding - additional feature: relaxed mode for reading html and broken XML documents fyi, my dom.d can do those, I use it for web scraping where there's all kinds of hideous stuff out there. https://github.com/adamdruppe/arsd/blob/master/dom.d
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: Please post you feature requests... - the ability to read documents with missing or incorrectly specified encoding - additional feature: relaxed mode for reading html and broken XML documents Some time ago I worked for Accusoft for the document viewing/converting software. The main experience that I get: any theoretically possible types of errors in the documents are real, when the application is popular.
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 21:53:24 UTC, Robert burner Schadek wrote: On Thursday, 18 February 2016 at 18:28:10 UTC, Alex Vincent wrote: Regarding control characters: If you give me a complete sample file, I can run it through Mozilla's UTF stream conversion and/or XML parsing code (via either SAX or DOMParser) to tell you how that reacts as a reference. Mozilla supports XML 1.0, but not 1.1. thanks you making the effort https://github.com/burner/std.xml2/blob/master/tests/eduni/xml-1.1/out/010.xml In this case, Firefox just passes the control characters through to the contentHandler.characters method: Starting runTest Retrieved source contentHandler.startDocument() contentHandler.startElement("", "foo", "foo", {}) contentHandler.characters("\u0080") contentHandler.endElement("", "foo", "foo") contentHandler.endDocument() Done reading
Re: std.xml2 (collecting features) control character
On Friday, 19 February 2016 at 12:55:52 UTC, Kagamin wrote: http://dpaste.dzfl.pl/2f8a8ff10bde like this? yes
Re: std.xml2 (collecting features) control character
On Friday, 19 February 2016 at 12:30:06 UTC, Robert burner Schadek wrote: ubyte[] arr = cast(ubyte[])[3C, 66, 6F, 6F, 3E, C2, 80, 3C, 2F, 66, 6F, 6F, 3E]); string s = cast(string)arr; dstring ds = to!dstring(s); and see what happens http://dpaste.dzfl.pl/2f8a8ff10bde like this?
Re: std.xml2 (collecting features) control character
On 2016-02-19 11:58, Kagamin via Digitalmars-d wrote: > On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek > wrote: >> the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E" > > http://dpaste.dzfl.pl/80888ed31958 like this? No, The program just takes the hex dump as string. you would need to do something like: ubyte[] arr = cast(ubyte[])[3C, 66, 6F, 6F, 3E, C2, 80, 3C, 2F, 66, 6F, 6F, 3E]); string s = cast(string)arr; dstring ds = to!dstring(s); and see what happens
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote: On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. My request: just skip it. XML is a horrible waste of space for a standard, better D doesn't support it well, anything to discourage it's use. I'd rather see you spend your time on something worthwhile. If data formats are your thing, you could help get Ludwig's JSON stuff in, or better yet, enable some nice binary data format. Glad to hear that someone is working on XML support. We cannot just "skip it". XML/HTML like mark up comes up all the time, here and there. I recently had to write a mini-parser (nowhere near the stuff Robert is doing, just a quick fix!) to extract data from XML input. This has nothing to do with personal preferences, it's just there [1] and has to be dealt with. [1] https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Language
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek wrote: the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E" http://dpaste.dzfl.pl/80888ed31958 like this?
Re: std.xml2 (collecting features)
On Friday, 19 February 2016 at 04:02:02 UTC, Craig Dillabaugh wrote: Would you be interested in mentoring a student for the Google Summer of Code to do work on std.xml? Yes, why not!
Re: std.xml2 (collecting features)
On Thursday, 18 February 2016 at 10:18:18 UTC, Robert burner Schadek wrote: On Thursday, 18 February 2016 at 04:34:13 UTC, Alex Vincent wrote: I'm looking for a status update. DUB doesn't seem to have many options posted. I was thinking about starting a SAXParser implementation. I'm working on it, but recently I had to do some major restructuring of the code. Currently I'm trying to get this merged https://github.com/D-Programming-Language/phobos/pull/3880 because I had some problems with the encoding of test files. XML has a lot of corner cases, it just takes time. If you want to on some XML stuff, please join me. It is properly more productive working together than creating two competing implementations. Would you be interested in mentoring a student for the Google Summer of Code to do work on std.xml?
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 18:28:10 UTC, Alex Vincent wrote: Regarding control characters: If you give me a complete sample file, I can run it through Mozilla's UTF stream conversion and/or XML parsing code (via either SAX or DOMParser) to tell you how that reacts as a reference. Mozilla supports XML 1.0, but not 1.1. thanks you making the effort https://github.com/burner/std.xml2/blob/master/tests/eduni/xml-1.1/out/010.xml
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 17:26:30 UTC, Adam D. Ruppe wrote: On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek wrote: unix file says it is a utf8 encoded file, but not BOM is present. the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E" Gah, I should have read this before replying... well, that does appear to be valid utf-8 why is it throwing an exception then? I'm pretty sure that byte stream *is* actually well-formed xml 1.0 and should pass utf validation as well as the XML well-formedness check. Regarding control characters: If you give me a complete sample file, I can run it through Mozilla's UTF stream conversion and/or XML parsing code (via either SAX or DOMParser) to tell you how that reacts as a reference. Mozilla supports XML 1.0, but not 1.1.
Re: std.xml2 (collecting features)
On Thursday, 18 February 2016 at 10:18:18 UTC, Robert burner Schadek wrote: If you want to on some XML stuff, please join me. It is properly more productive working together than creating two competing implementations. Oh, I absolutely agree, independent implementation is a bad thing. (Someone should rename DRY as "don't repeat yourself or others"... but DRYOO sounds weird.) Where's your repo?
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek wrote: unix file says it is a utf8 encoded file, but not BOM is present. the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E" Gah, I should have read this before replying... well, that does appear to be valid utf-8 why is it throwing an exception then? I'm pretty sure that byte stream *is* actually well-formed xml 1.0 and should pass utf validation as well as the XML well-formedness check.
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:54:10 UTC, Robert burner Schadek wrote: It does not, it has no prolog and therefore no EncodingInfo. In that case, it needs to be valid UTF-8 or valid UTF-16 and it is a fatal error if there's any invalid bytes: https://www.w3.org/TR/REC-xml/#charencoding == It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16. ==
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:54:10 UTC, Robert burner Schadek wrote: unix file says it is a utf8 encoded file, but not BOM is present. the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:47:35 UTC, Adam D. Ruppe wrote: On Thursday, 18 February 2016 at 16:41:52 UTC, Robert burner Schadek wrote: for instance, quick often I find <80> in tests that are supposed to be valid xml 1.0. they are invalid xml 1.1 though What char encoding does the document declare itself as? It does not, it has no prolog and therefore no EncodingInfo. unix file says it is a utf8 encoded file, but not BOM is present.
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 16:41:52 UTC, Robert burner Schadek wrote: for instance, quick often I find <80> in tests that are supposed to be valid xml 1.0. they are invalid xml 1.1 though What char encoding does the document declare itself as?
Re: std.xml2 (collecting features) control character
for instance, quick often I find <80> in tests that are supposed to be valid xml 1.0. they are invalid xml 1.1 though
Re: std.xml2 (collecting features) control character
On Thursday, 18 February 2016 at 15:56:58 UTC, Robert burner Schadek wrote: When trying to validate/convert an utf string these lead to exceptions, because they are not valid utf character. That means the user didn't encode them properly... Which one specifically are you thinking of? I'm pretty sure all those control characters have a spot in the Unicode space and can be properly encoded as UTF-8 (though I think even if they are properly encoded, some of them are illegal in XML anyway). If they appear in another form, it is invalid and/or needs a charset conversion, which should be specified in the XML document itself.
Re: std.xml2 (collecting features) control character
While working on a new xml implementation I came cross "control characters (CC)". [1] When trying to validate/convert an utf string these lead to exceptions, because they are not valid utf character. Unfortunately, some of these characters are allowed to appear in valid xml 1.* documents. I currently see two option how to go about it: 1. Do not allow non CCs that do not work with existing functionality. 1.Pros * easy 1.Cons * the resulting xml implementation will not be xml 1.* complete 2. Add special cases to the existing functionality to handle CCs that are allowed in 1.0. 2.Pros * the resulting xml implementation will be xml 1.* complete 2.Cons * will make utf de/encoding slower as I would need to add additional logic Any other ideas, feedback? [1] https://en.wikipedia.org/wiki/C0_and_C1_control_codes
Re: std.xml2 (collecting features)
On Thursday, 18 February 2016 at 12:30:29 UTC, Andrei Alexandrescu wrote: also I would like to see this https://github.com/D-Programming-Language/phobos/pull/2995 go in first to be able to accurately measure and compare performance Would the measuring be possible with 2995 as a dub package? -- Andrei yes, after have synced the dub package to the PR
Re: std.xml2 (collecting features)
On 02/18/2016 05:49 AM, Robert burner Schadek wrote: On Thursday, 18 February 2016 at 10:18:18 UTC, Robert burner Schadek wrote: If you want to on some XML stuff, please join me. It is properly more productive working together than creating two competing implementations. also I would like to see this https://github.com/D-Programming-Language/phobos/pull/2995 go in first to be able to accurately measure and compare performance Would the measuring be possible with 2995 as a dub package? -- Andrei
Re: std.xml2 (collecting features)
On Thursday, 18 February 2016 at 10:18:18 UTC, Robert burner Schadek wrote: If you want to on some XML stuff, please join me. It is properly more productive working together than creating two competing implementations. also I would like to see this https://github.com/D-Programming-Language/phobos/pull/2995 go in first to be able to accurately measure and compare performance
Re: std.xml2 (collecting features)
On Thursday, 18 February 2016 at 04:34:13 UTC, Alex Vincent wrote: I'm looking for a status update. DUB doesn't seem to have many options posted. I was thinking about starting a SAXParser implementation. I'm working on it, but recently I had to do some major restructuring of the code. Currently I'm trying to get this merged https://github.com/D-Programming-Language/phobos/pull/3880 because I had some problems with the encoding of test files. XML has a lot of corner cases, it just takes time. If you want to on some XML stuff, please join me. It is properly more productive working together than creating two competing implementations.
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. I'm looking for a status update. DUB doesn't seem to have many options posted. I was thinking about starting a SAXParser implementation.
Re: std.xml2 (collecting features)
On Sunday, 10 May 2015 at 08:54:09 UTC, Joakim wrote: One can do all these things with better formats than either XML or JSON. Hypothetically, yes, though formats better than XML don't exist. I personally find XML perfectly readable.
Re: std.xml2 (collecting features)
On Monday, 11 May 2015 at 15:20:12 UTC, Alex Parrill wrote: Can we please not turn this thread into an XML vs JSON flamewar? This is not a flamewar, JSON is ad hoc and I use it a lot, but it isn't actually suitable as a file and archival exchange format. It is important that people understand what the point of XML is in order to build something useful. Full XML support and tooling is very valuable for typed GC-backed batch processing. That means namespaces, entities, XQuery equivalents, DOMs etc A library backed tooling pipeline would be a valuable asset for D. The value is not in _reading_ or _writing_ XML. The value is all about providing a framework for structured grammar/namespace based _processing_ and _transforms_.
Re: std.xml2 (collecting features)
Can we please not turn this thread into an XML vs JSON flamewar? XML is one of the most popular data formats (for better or for worse), so a parser would be a good addition to the standard library.
Re: std.xml2 (collecting features)
On Sunday, 10 May 2015 at 07:01:58 UTC, Marco Leise wrote: Well, I was mostly answering to w0rp here. JSON is both readable and easy to parse, no question. JSON is just javascript literals with some silly constraints. As crappy a format as it gets. Even pure Lisp would have been better. And much more powerful! :) One can't really answer this one. But with many hundreds of published data exchange formats built on XML, it can't have been too shabby all along. And sometimes small things matter, like being able to add comments along with the "payload". XML is actually great for what it is: eXtensible. It means you can build forward compatible formats and annotate existing formats with metadata without breaking existing (compliant) applications etc... It also means you can datamine files whithout knowing the full format. Or knowing that both sender and receiver will validate the XML the same way through XSD. Right, or build a database/archival service that is generic. XML is not going away until there is something better, and that won't happen anytime soon. It is also one of the few formats that I actually need library and _good_ DOM support for. (JSON can be done in an afternoon, so I don't care if it is supported or not...)
Re: std.xml2 (collecting features)
On Sunday, 10 May 2015 at 08:54:09 UTC, Joakim wrote: It's worse than shabby, it's a horrible, horrible choice. Not just for data formats, but for _anything_. XML should not be used. I feel the same way about XML, and I also think that having strong aesthetic internal emotional responses is often necessary to achieve excellence in engineering. But why do we often end up dealing with these two? Familiarity, that is the only reason. XML seems familiar to anybody who's written some HTML, and JSON became familiar to web developers initially. Starting from those two large niches, they've expanded out to become the two most popular data interchange formats, despite XML being a horrible mess and JSON being too simple for many uses. Sometimes you get to pick, but often not. I can hardly tell the UK Debt Management Office to give up XML and switch to msgpack structs (well, I can, but I am not sure they would listen). So at the moment for some data series I use a python library via PyD to convert xml files to JSON. But it would be nice to do it all in D. I am not sure XML is going away very soon since new protocols keep being created using it. (Most recent one I heard of is one for allowing hedge funds to achieve full transparency of their portfolio to end investors - not necessarily something that will achieve what people think it will, but one in tune with the times). Laeeth.
Re: std.xml2 (collecting features)
On Sunday, 10 May 2015 at 07:01:58 UTC, Marco Leise wrote: Am Sat, 09 May 2015 10:28:52 + schrieb "Joakim" : On Monday, 4 May 2015 at 18:50:43 UTC, Marco Leise wrote: > Remember that while JSON is simpler, XML is not just a > structured container for bool, Number and String data. It > comes with many official side kicks covering a broad range of > use cases: > > XPath: > … > > XSL and XSLT > … > > XSL-FO (XSL formatting objects): > … > > XML Schema Definition (XSD): > … These are all incredibly dumb ideas. I don't deny that many people may use these things, but then people use hammers for all kinds of things they shouldn't use them for too. :) :) One can't really answer this one. But with many hundreds of published data exchange formats built on XML, it can't have been too shabby all along. It's worse than shabby, it's a horrible, horrible choice. Not just for data formats, but for _anything_. XML should not be used. And sometimes small things matter, like being able to add comments along with the "payload". JSON doesn't have that. Or knowing that both sender and receiver will validate the XML the same way through XSD. So if it doesn't blow up on your end, it will pass validation on the other end, too. One can do all these things with better formats than either XML or JSON. But why do we often end up dealing with these two? Familiarity, that is the only reason. XML seems familiar to anybody who's written some HTML, and JSON became familiar to web developers initially. Starting from those two large niches, they've expanded out to become the two most popular data interchange formats, despite XML being a horrible mess and JSON being too simple for many uses. I'd like to see a move back to binary formats, which is why I mentioned that to Robert. D would be an ideal language in which to show the superiority of binary to text formats, given its emphasis on efficiency. Many devs have learned the wrong lessons from past closed binary formats, when open binary formats wouldn't have many of those deficiencies. There have been some interesting moves back to open binary formats/protocols in recent years, like Hessian (http://hessian.caucho.com/), Thrift (https://thrift.apache.org/), MessagePack (http://msgpack.org/), and Cap'n Proto (from the protobufs guy after he left google - https://capnproto.org/). I'd rather see phobos support these, which are the future, rather than flash-in-the-pan text formats like XML or JSON.
Re: std.xml2 (collecting features)
Am Sat, 09 May 2015 10:28:52 + schrieb "Joakim" : > On Monday, 4 May 2015 at 18:50:43 UTC, Marco Leise wrote: > > > You two are terrible at motivating people. "Better D doesn't > > support it well" and "JSON is superior through-and-through" is > > overly dismissive. > > … > > You seem to have missed the point of my post, which was to > discourage him from working on an XML module for phobos. As for > "motivating" him, I suggested better alternatives. And I never > said JSON was great, but it's certainly _much_ more readable than > XML, which is one of the basic goals of a text format. Well, I was mostly answering to w0rp here. JSON is both readable and easy to parse, no question. > > Remember that while JSON is simpler, XML is not just a > > structured container for bool, Number and String data. It > > comes with many official side kicks covering a broad range of > > use cases: > > > > XPath: > > … > > > > XSL and XSLT > > … > > > > XSL-FO (XSL formatting objects): > > … > > > > XML Schema Definition (XSD): > > … > > These are all incredibly dumb ideas. I don't deny that many > people may use these things, but then people use hammers for all > kinds of things they shouldn't use them for too. :) :) One can't really answer this one. But with many hundreds of published data exchange formats built on XML, it can't have been too shabby all along. And sometimes small things matter, like being able to add comments along with the "payload". JSON doesn't have that. Or knowing that both sender and receiver will validate the XML the same way through XSD. So if it doesn't blow up on your end, it will pass validation on the other end, too. Am Sat, 09 May 2015 13:04:57 + schrieb "Craig Dillabaugh" : > I have to agree with Joakim on this. Having spent much of this > past > week trying to get XML generated by gSOAP (project has some legacy > code) to work with JAXB (Java) has reinforced my dislike for XML. > > I've used things like XPath and XLST in the past, so I can > appreciate > their power, but think the 'jobs' they perform would be better > supported > elsewhere (ie. language specific XML frameworks). > > In trying to pass data between applications I just want a simple > way > of packaging up the data and ideally making > serialization/deserialization > easy for me. At some point the programmer working on these needs > to understand and validate the data anyway. Sure you can use > DTD/XML Schema to > handle the validation part, but it is just easier to deal with > that > within you own code - without having to learn a 'whole new > language', that > is likely harder to grok than the tools you would have at your > disposal > in your language of choice. You see, the thing is that XSD is _not_ a whole new language, it is written in XML as well, probably specifically to make it so. Try to switch the perspective: With XSD (if it is sufficient for your validation needs) _one_ person needs to learn and write it and other programmers (inside or outside the company) just use the XML library of choice to handle validation via that schema. Once the schema is loaded it is usually no more than doc.validate(); (There is also good GUI tools to assist in writing XSD.) What you propose on the other hand is that everyone involved in the data exchange writes their own validation code in their language of choice, with either no access to existing sources or functionality that doesn't translate to their language! -- Marco
Re: std.xml2 (collecting features)
On Saturday, 9 May 2015 at 10:28:53 UTC, Joakim wrote: On Monday, 4 May 2015 at 18:50:43 UTC, Marco Leise wrote: On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote: clip Remember that while JSON is simpler, XML is not just a structured container for bool, Number and String data. It comes with many official side kicks covering a broad range of use cases: XPath: * allows you to use XML files like a textual database * complex enough to allow for almost any imaginable query * many tools emerged to test XPath expressions against XML documents * also powers XSLT (http://www.liquid-technologies.com/xpath-tutorial.aspx) XSL (Extensible Stylesheet Language) and XSLT (XSL Transformations): * written as XML documents * standard way to transform XML from one structure into another * convert or "compile" data to XHTML or SVG for display in a browser * output to XSL-FO XSL-FO (XSL formatting objects): * written as XSL * type-setting for XML; a XSL-FO processor is similar to a LaTex processor * reads an XML document (a "Format" document) and outputs to a PDF, RTF or similar format XML Schema Definition (XSD): * written as XML * linked in by an XML file * defines structure and validates content to some extent * can set constraints on how often an element can occur in a list * can validate data type of values (length, regex, positive, etc.) * database like unique IDs and references These are all incredibly dumb ideas. I don't deny that many people may use these things, but then people use hammers for all kinds of things they shouldn't use them for too. :) I think XML is the most eat-your-own-dog-food language ever and nicely covers a wide range of use cases. The problem is you're still eating dog food. ;) I have to agree with Joakim on this. Having spent much of this past week trying to get XML generated by gSOAP (project has some legacy code) to work with JAXB (Java) has reinforced my dislike for XML. I've used things like XPath and XLST in the past, so I can appreciate their power, but think the 'jobs' they perform would be better supported elsewhere (ie. language specific XML frameworks). In trying to pass data between applications I just want a simple way of packaging up the data and ideally making serialization/deserialization easy for me. At some point the programmer working on these needs to understand and validate the data anyway. Sure you can use DTD/XML Schema to handle the validation part, but it is just easier to deal with that within you own code - without having to learn a 'whole new language', that is likely harder to grok than the tools you would have at your disposal in your language of choice. Having said all that. As much as I share Joakim's sentiment that I wish XML would just go away, there is a lot of it out there, and I think having good support in Phobos is very valuable so I thank Robert for his efforts. Craig
Re: std.xml2 (collecting features)
On Monday, 4 May 2015 at 18:50:43 UTC, Marco Leise wrote: On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote: My request: just skip it. XML is a horrible waste of space for a standard, better D doesn't support it well, anything to discourage it's use. I'd rather see you spend your time on something worthwhile. If data formats are your thing, you could help get Ludwig's JSON stuff in, or better yet, enable some nice binary data format. You two are terrible at motivating people. "Better D doesn't support it well" and "JSON is superior through-and-through" is overly dismissive. To me it sounds like someone saying replace C++ with JavaScript, because C++ is a horrible standard and JavaScript is so much superior. Honestly. You seem to have missed the point of my post, which was to discourage him from working on an XML module for phobos. As for "motivating" him, I suggested better alternatives. And I never said JSON was great, but it's certainly _much_ more readable than XML, which is one of the basic goals of a text format. Remember that while JSON is simpler, XML is not just a structured container for bool, Number and String data. It comes with many official side kicks covering a broad range of use cases: XPath: * allows you to use XML files like a textual database * complex enough to allow for almost any imaginable query * many tools emerged to test XPath expressions against XML documents * also powers XSLT (http://www.liquid-technologies.com/xpath-tutorial.aspx) XSL (Extensible Stylesheet Language) and XSLT (XSL Transformations): * written as XML documents * standard way to transform XML from one structure into another * convert or "compile" data to XHTML or SVG for display in a browser * output to XSL-FO XSL-FO (XSL formatting objects): * written as XSL * type-setting for XML; a XSL-FO processor is similar to a LaTex processor * reads an XML document (a "Format" document) and outputs to a PDF, RTF or similar format XML Schema Definition (XSD): * written as XML * linked in by an XML file * defines structure and validates content to some extent * can set constraints on how often an element can occur in a list * can validate data type of values (length, regex, positive, etc.) * database like unique IDs and references These are all incredibly dumb ideas. I don't deny that many people may use these things, but then people use hammers for all kinds of things they shouldn't use them for too. :) I think XML is the most eat-your-own-dog-food language ever and nicely covers a wide range of use cases. The problem is you're still eating dog food. ;) In any case there are many XML based file formats that we might want to parse. Amongst them SVG, OpenDocument (Open/LibreOffics), RSS feeds, several US Offices, XMP and other meta data formats. Sure, and if he has any real need for any of those, who are we to stop him? But if he's just looking for some way to contribute, there are better ways. On Monday, 4 May 2015 at 20:44:42 UTC, Jonathan M Davis wrote: Also true. Many of us just don't find enough time to work on D, and we don't seem to do a good job of encouraging larger contributions to Phobos, so newcomers don't tend to contribute like that. And there's so much to do all around that the big stuff just falls by the wayside, and it really shouldn't. This is why I keep asking Walter and Andrei for a list of "big stuff" on the wiki- they don't have to be big, just important- so that newcomers know where help is most needed. Of course, it doesn't have to be them, it could be any member of the D core team, though whatever the BDFLs push for would have a bit more weight.
Re: std.xml2 (collecting features)
On 06/05/2015 07:31, Jacob Carlborg wrote: On 2015-05-06 01:38, Walter Bright wrote: I haven't read the Tango source code, but the performance of it's xml was supposedly because it did not use the GC, it used slices. That's only true for the pull parser (not sure about the SAX parser). The DOM parser needs to allocate the nodes, but if I recall correctly those are allocated in a free list. Not sure which parser was used in the test. The direct comparisons were with the DOM parsers (I was playing with a D port of some C++ code at work at the time, and that is DOM based). xmlp has alternate parsers (event driven etc) which were faster in some simple tests i did, but I don't recall if I did a direct comparison with Tango there.
Re: std.xml2 (collecting features)
An old friend of mine who was intimate with the microsoft xml parsers was fond of saying, particularly with respect to xml parsers, that if you hadn't finished implementing and testing error handling and negative tests (ie, malformed documents) that your positive benchmarks were fairly meaningless. A whole lot of work goes into that 'second half' of things that can quickly cost performance. I didn't dive or don't recall specific details as this was years ago. The (over-)generalization from there is an old adage: it's easy to write an incorrect program. On 5/5/2015 11:33 PM, Jacob Carlborg via Digitalmars-d wrote: On 2015-05-05 16:04, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= " wrote: In my opinion it is rather difficult to build a good API without also using the API in an application in parallel. So it would be a good strategy to build a specific DOM along with writing the XML infrastructure, like SVG/HTML. Agree. Also, some parsers, like RapidXML only support a subset of XML. So they cannot be used for comparisons. The Tango parser has some limitation as well. In some places it sacrificed correctness for speed. There's a comment claiming the parser might read past the input if it's not well formed.
Re: std.xml2 (collecting features)
On 2015-05-06 01:38, Walter Bright wrote: I haven't read the Tango source code, but the performance of it's xml was supposedly because it did not use the GC, it used slices. That's only true for the pull parser (not sure about the SAX parser). The DOM parser needs to allocate the nodes, but if I recall correctly those are allocated in a free list. Not sure which parser was used in the test. -- /Jacob Carlborg
Re: std.xml2 (collecting features)
On 2015-05-05 16:04, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= " wrote: In my opinion it is rather difficult to build a good API without also using the API in an application in parallel. So it would be a good strategy to build a specific DOM along with writing the XML infrastructure, like SVG/HTML. Agree. Also, some parsers, like RapidXML only support a subset of XML. So they cannot be used for comparisons. The Tango parser has some limitation as well. In some places it sacrificed correctness for speed. There's a comment claiming the parser might read past the input if it's not well formed. -- /Jacob Carlborg
Re: std.xml2 (collecting features)
On 5/5/2015 4:16 AM, Richard Webb wrote: Also, profiling showed a lot of time spent in the GC, and the recent improvements in that area might have changed things by now. I haven't read the Tango source code, but the performance of it's xml was supposedly because it did not use the GC, it used slices.
Re: std.xml2 (collecting features)
On Tuesday, 5 May 2015 at 12:10:59 UTC, Jacob Carlborg wrote: Yes, of course it's slower. The DOM parser creates a DOM as well, which the pull parser doesn't. These other libraries, what kind of parsers are those using? I mean, it's not fair to compare a pull parser against a DOM parser. I agree. Most applications will use a DOM parser for convenience, so sacrificing some speed initially in favour of easy-of-use makes a lot of sense. As long as it is possible to improve it later (e.g. use SIMD scanning to find the end of CDATA etc). In my opinion it is rather difficult to build a good API without also using the API in an application in parallel. So it would be a good strategy to build a specific DOM along with writing the XML infrastructure, like SVG/HTML. Also, some parsers, like RapidXML only support a subset of XML. So they cannot be used for comparisons.
Re: std.xml2 (collecting features)
On 2015-05-05 12:41, "Mario =?UTF-8?B?S3LDtnBsaW4i?= " wrote: Recently, I compared DOM parsers for an XML files of 100 MByte: 15.8 s tango.text.xml (SiegeLord/Tango-D2) 13.4 s ae.utils.xml (CyberShadow/ae) 8.5 s xml.etree (Python) Either the Tango DOM parser is slow compared to the Tango pull parser, Yes, of course it's slower. The DOM parser creates a DOM as well, which the pull parser doesn't. These other libraries, what kind of parsers are those using? I mean, it's not fair to compare a pull parser against a DOM parser. Could you try D1 Tango as well? Or do you have the benchmark available somewhere? or the D2 port ruined the performance. Might be the case as well, see this comment [1]. [1] http://forum.dlang.org/thread/vsbsxfeciryrdsjhh...@forum.dlang.org?page=3#post-mi8hs8:24b0j:241:40digitalmars.com -- /Jacob Carlborg
Re: std.xml2 (collecting features)
On 05/05/2015 11:41, "Mario =?UTF-8?B?S3LDtnBsaW4i?= " wrote: Recently, I compared DOM parsers for an XML files of 100 MByte: 15.8 s tango.text.xml (SiegeLord/Tango-D2) 13.4 s ae.utils.xml (CyberShadow/ae) 8.5 s xml.etree (Python) Either the Tango DOM parser is slow compared to the Tango pull parser, or the D2 port ruined the performance. fwiw I did some tests a couple of years back with https://launchpad.net/d2-xml on 20 odd megabyte files and found it faster than Tango. Unfortunately that would need some work to test now, as xmlp is abandoned and wouldn't build last time I tried it :-( I also had some success with https://github.com/opticron/kxml, though it had some issues with chuffy entity decoding performance. Also, profiling showed a lot of time spent in the GC, and the recent improvements in that area might have changed things by now.
Re: std.xml2 (collecting features)
On Tuesday, 5 May 2015 at 10:41:37 UTC, Mario Kröplin wrote: On Monday, 4 May 2015 at 19:28:25 UTC, Jacob Carlborg wrote: On 2015-05-03 19:39, Robert burner Schadek wrote: Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 I recommend benchmarking against the Tango pull parser. Recently, I compared DOM parsers for an XML files of 100 MByte: 15.8 s tango.text.xml (SiegeLord/Tango-D2) 13.4 s ae.utils.xml (CyberShadow/ae) 8.5 s xml.etree (Python) Either the Tango DOM parser is slow compared to the Tango pull parser, or the D2 port ruined the performance. As usual: system, compiler, compiler version, compilation flags?
Re: std.xml2 (collecting features)
On Monday, 4 May 2015 at 19:28:25 UTC, Jacob Carlborg wrote: On 2015-05-03 19:39, Robert burner Schadek wrote: Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 I recommend benchmarking against the Tango pull parser. Recently, I compared DOM parsers for an XML files of 100 MByte: 15.8 s tango.text.xml (SiegeLord/Tango-D2) 13.4 s ae.utils.xml (CyberShadow/ae) 8.5 s xml.etree (Python) Either the Tango DOM parser is slow compared to the Tango pull parser, or the D2 port ruined the performance.
Re: std.xml2 (collecting features)
On Monday, 4 May 2015 at 19:31:59 UTC, Jonathan M Davis wrote: Given how D's arrays work, we have the opportunity to have an _extremely_ fast XML parser thanks to slices. Yes, that would be great. XML is a flexible go-to archive, exchange and application format. Things like entities, namespaces and so makes it non-trivial, but being able to conveniently process Inkscape and Open Office files etc would be very useful. One should probably look at what applications generate XML and create some large test files with existing applications.
Re: std.xml2 (collecting features)
Am Tue, 05 May 2015 02:01:50 + schrieb "weaselcat" : > maybe off-topic, but it would be nice if the standard json,xml, > etc etc all had identical interfaces(except for > implementation-specific quirks.) This might be something worth > discussing if it wasn't already agreed upon. I don't think this needs discussion. It is plain impossible to have a sophisticated JSON parser and a sophisticated XML parser share the same API. Established function names, structural differences in the formats and feature sets differ to much. For example in XML attributes and child elements are used somewhat interchangeably whereas in JSON attributes don't exist. So while in JSON "obj.field" makes sense in XML you would want to select either an attribute or an element with the name "field". -- Marco
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. maybe off-topic, but it would be nice if the standard json,xml, etc etc all had identical interfaces(except for implementation-specific quirks.) This might be something worth discussing if it wasn't already agreed upon.
Re: std.xml2 (collecting features)
On 5/05/2015 10:45 a.m., Liam McSherry wrote: On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. Not a feature, but if `std.data.json` [1] gets accepted in to Phobos, it may be something to consider naming this `std.data.xml` (although that might not as effectively differentiate it from `std.xml`). [1]: http://wiki.dlang.org/Review_Queue It really should be std.data.xml. To keep with the new structuring. Plus it'll make transitioning a little easier.
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. Not a feature, but if `std.data.json` [1] gets accepted in to Phobos, it may be something to consider naming this `std.data.xml` (although that might not as effectively differentiate it from `std.xml`). [1]: http://wiki.dlang.org/Review_Queue
Re: std.xml2 (collecting features)
On Monday, 4 May 2015 at 19:45:18 UTC, Andrei Alexandrescu wrote: On 5/4/15 12:31 PM, Jonathan M Davis wrote: Fast parsing is definitely a killer feature of D and the fact that std.xml botches that so badly is just embarrassing. To be frank what's more embarrassing is that we managed to do nothing about it for years (aside from endlessly wailing about it in an a capella ensemble). It's a failure of leadership (that Walter and I need to work on) that very many unimportant and arguably less interesting areas of Phobos get attention at the expense of this one. -- Andrei Also true. Many of us just don't find enough time to work on D, and we don't seem to do a good job of encouraging larger contributions to Phobos, so newcomers don't tend to contribute like that. And there's so much to do all around that the big stuff just falls by the wayside, and it really shouldn't. - Jonathan M Davis
Re: std.xml2 (collecting features)
On 5/4/2015 2:35 AM, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= " wrote: Wouldn't D-ranges make it impossible to use SIMD optimizations when scanning? Not at all. Algorithms can be specialized for various forms of input ranges, including ones where SIMD optimizations can be used. Specialization is one of the very cool things about D algorithms.
Re: std.xml2 (collecting features)
On 5/4/2015 12:28 PM, Jacob Carlborg wrote: On 2015-05-03 19:39, Robert burner Schadek wrote: Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 I recommend benchmarking against the Tango pull parser. I agree. The Tango XML parser has set the performance bar. If any new solution can't match that, throw it out and try again.
Re: std.xml2 (collecting features)
On 5/4/2015 12:31 PM, Jonathan M Davis wrote: Given how D's arrays work, we have the opportunity to have an _extremely_ fast XML parser thanks to slices. It's highly unlikely that any C or C++ solution is going to be able to compete, and if it can, it's likely to be far more complex than necessary. Parsing is an area where we definitely should write our own stuff rather than porting existing code from other languages or use existing libraries in other languages via C bindings. Fast parsing is definitely a killer feature of D and the fact that std.xml botches that so badly is just embarrassing. Tango's XML package was well regarded and the fastest in the business. It used slicing, and almost no memory allocation.
Re: std.xml2 (collecting features)
On 5/4/15 12:31 PM, Jonathan M Davis wrote: On Monday, 4 May 2015 at 09:35:55 UTC, Ola Fosheim Grøstad wrote: However, it would make a lot of sense to just convert an existing XML solution with Boost license. I don't know which ones are any good, but RapidXML is at least Boost. Given how D's arrays work, we have the opportunity to have an _extremely_ fast XML parser thanks to slices. It's highly unlikely that any C or C++ solution is going to be able to compete, and if it can, it's likely to be far more complex than necessary. Parsing is an area where we definitely should write our own stuff rather than porting existing code from other languages or use existing libraries in other languages via C bindings. Fast parsing is definitely a killer feature of D and the fact that std.xml botches that so badly is just embarrassing. To be frank what's more embarrassing is that we managed to do nothing about it for years (aside from endlessly wailing about it in an a capella ensemble). It's a failure of leadership (that Walter and I need to work on) that very many unimportant and arguably less interesting areas of Phobos get attention at the expense of this one. -- Andrei
Re: std.xml2 (collecting features)
On 2015-05-03 19:39, Robert burner Schadek wrote: Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 There are a couple of interesting comments about the Tango pull parser that can be worth mentioning: * Use -version=whitespace to retain whitespace as data nodes. We see a %25 increase in token count and 10% throughput drop when parsing "hamlet.xml" with this option enabled (pullparser alone) * The parser is constructed with some tradeoffs relating to document integrity. It is generally optimized for well-formed documents, and currently may read past a document-end for those that are not well formed * Making some tiny unrelated change to the code can cause notable throughput changes. We're not yet clear why these swings are so pronounced (for changes outside the code path) but they seem to be related to the alignment of codegen. It could be a cache-line issue, or something else The last comment might not relevant anymore since these are all quite old comments. -- /Jacob Carlborg
Re: std.xml2 (collecting features)
On Monday, 4 May 2015 at 09:35:55 UTC, Ola Fosheim Grøstad wrote: However, it would make a lot of sense to just convert an existing XML solution with Boost license. I don't know which ones are any good, but RapidXML is at least Boost. Given how D's arrays work, we have the opportunity to have an _extremely_ fast XML parser thanks to slices. It's highly unlikely that any C or C++ solution is going to be able to compete, and if it can, it's likely to be far more complex than necessary. Parsing is an area where we definitely should write our own stuff rather than porting existing code from other languages or use existing libraries in other languages via C bindings. Fast parsing is definitely a killer feature of D and the fact that std.xml botches that so badly is just embarrassing. - Jonathan M Davis
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 22:02:13 UTC, Walter Bright wrote: On 5/3/2015 2:31 PM, Ilya Yaroshenko wrote: Can it lazily reads huge files (files greater than memory)? If a range interface is used, it doesn't need to be aware of where the data is coming from. In fact, the xml package should NOT be doing I/O. Indeed. It should operate on ranges without caring where they came from (though it may end up supporting both input ranges and random-access ranges with the idea that it can support reading of a socket with a range in a less efficient manner or operating on a whole file at once as via a random-access range for more efficient parsing). But if I/O is a big concern, I'd suggest just using std.mmfile to do the trick, since then you can still operate on the whole file as a single array without having to actually have the whole thing in memory. - Jonathan M Davis
Re: std.xml2 (collecting features)
On 2015-05-04 21:14, Jonathan M Davis wrote: If I were doing it, I'd do three types of parsers: 1. A parser that was pretty much as low level as you can get, where you basically a range of XML atributes or tags. Exactly how to build that could be a bit entertaining, since it would have to be hierarchical, and ranges aren't, but something like a range of tags where you can get a range of its attributes and sub-tags from it so that the whole document can be processed without actually getting to the level of even a SAX parser. That parser could then be used to build the other parsers, and anyone who needed insanely fast speeds could use it rather than the SAX or DOM parser so long as they were willing to pay the inevitable loss in user-friendliness. 2. SAX parser built on the low level parser. 3. DOM parser built either on the low level parser or the SAX parser (whichever made more sense). I doubt that I'm really explaining the low level parser well enough or have even though through it enough, but I really think that even a SAX parser is too high level for the base parser and that something that slightly higher than a lexer (high enough to actually be processing XML rather than individual tokens but pretty much only as high as is required to do that) would be a far better choice. IIRC, Michel Fortin's work went in that direction, and he linked to his code in another post, so I'd suggest at least looking at that for ideas. This way the XML parser is structured in Tango. A pull parser at the lowest level, a SAX parser on top of that and I think the DOM parser builds on top of the pull parser. The Tango pull parser can give you the following tokens: * start element * attribute * end element * end empty element * data * comment * cdata * doctype * pi -- /Jacob Carlborg
Re: std.xml2 (collecting features)
On 2015-05-03 19:39, Robert burner Schadek wrote: Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 I recommend benchmarking against the Tango pull parser. -- /Jacob Carlborg
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. If I were doing it, I'd do three types of parsers: 1. A parser that was pretty much as low level as you can get, where you basically a range of XML atributes or tags. Exactly how to build that could be a bit entertaining, since it would have to be hierarchical, and ranges aren't, but something like a range of tags where you can get a range of its attributes and sub-tags from it so that the whole document can be processed without actually getting to the level of even a SAX parser. That parser could then be used to build the other parsers, and anyone who needed insanely fast speeds could use it rather than the SAX or DOM parser so long as they were willing to pay the inevitable loss in user-friendliness. 2. SAX parser built on the low level parser. 3. DOM parser built either on the low level parser or the SAX parser (whichever made more sense). I doubt that I'm really explaining the low level parser well enough or have even though through it enough, but I really think that even a SAX parser is too high level for the base parser and that something that slightly higher than a lexer (high enough to actually be processing XML rather than individual tokens but pretty much only as high as is required to do that) would be a far better choice. IIRC, Michel Fortin's work went in that direction, and he linked to his code in another post, so I'd suggest at least looking at that for ideas. Regardless, by building layers of XML parsers rather than just the standard ones, it should be possible to get higher performance while still having the more standard, user-friendly ones for those that don't need the full performance and do need the user-friendliness (though of course, we do want the SAX and DOM parsers to be efficient as well). - Jonathan M Davis
Re: std.xml2 (collecting features)
Am Sun, 03 May 2015 14:00:11 -0700 schrieb Walter Bright : > On 5/3/2015 10:39 AM, Robert burner Schadek wrote: > > - CTS for encoding (ubyte(ASCII), char(utf8), ... ) > > Encoding schemes should be handled by adapter algorithms, not in the XML > parser > itself, which should only handle UTF8. Unlike JSON, XML actually declares the encoding in the prolog, e.g.: -- Marco
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote: > My request: just skip it. XML is a horrible waste of space for > a standard, better D doesn't support it well, anything to > discourage it's use. I'd rather see you spend your time on > something worthwhile. If data formats are your thing, you > could help get Ludwig's JSON stuff in, or better yet, enable > some nice binary data format. Am Sun, 03 May 2015 18:44:11 + schrieb "w0rp" : > I agree that JSON is superior through-and-through, but legacy > support matters, and XML is in many places. It's good to have a > quality XML parsing library. You two are terrible at motivating people. "Better D doesn't support it well" and "JSON is superior through-and-through" is overly dismissive. To me it sounds like someone saying replace C++ with JavaScript, because C++ is a horrible standard and JavaScript is so much superior. Honestly. Remember that while JSON is simpler, XML is not just a structured container for bool, Number and String data. It comes with many official side kicks covering a broad range of use cases: XPath: * allows you to use XML files like a textual database * complex enough to allow for almost any imaginable query * many tools emerged to test XPath expressions against XML documents * also powers XSLT (http://www.liquid-technologies.com/xpath-tutorial.aspx) XSL (Extensible Stylesheet Language) and XSLT (XSL Transformations): * written as XML documents * standard way to transform XML from one structure into another * convert or "compile" data to XHTML or SVG for display in a browser * output to XSL-FO XSL-FO (XSL formatting objects): * written as XSL * type-setting for XML; a XSL-FO processor is similar to a LaTex processor * reads an XML document (a "Format" document) and outputs to a PDF, RTF or similar format XML Schema Definition (XSD): * written as XML * linked in by an XML file * defines structure and validates content to some extent * can set constraints on how often an element can occur in a list * can validate data type of values (length, regex, positive, etc.) * database like unique IDs and references I think XML is the most eat-your-own-dog-food language ever and nicely covers a wide range of use cases. In any case there are many XML based file formats that we might want to parse. Amongst them SVG, OpenDocument (Open/LibreOffics), RSS feeds, several US Offices, XMP and other meta data formats. When it comes to which features to support, I personally used XSD more than XPath and the tech using it. But quite frankly both would be expected by users. Based on XPath, XSL transformations can be added any time then. Anything beyond that doesn't feel quite "core" enough to be in a XML module. -- Marco
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 22:02:13 UTC, Walter Bright wrote: On 5/3/2015 2:31 PM, Ilya Yaroshenko wrote: Can it lazily reads huge files (files greater than memory)? If a range interface is used, it doesn't need to be aware of where the data is coming from. In fact, the xml package should NOT be doing I/O. Wouldn't D-ranges make it impossible to use SIMD optimizations when scanning? However, it would make a lot of sense to just convert an existing XML solution with Boost license. I don't know which ones are any good, but RapidXML is at least Boost.
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 23:32:28 UTC, Michel Fortin wrote: This isn't a feature request (sorry?), but I just want to point out that you should feel free to borrow code from https://github.com/michelf/mfr-xml-d There's probably a lot you can reuse in there. nice, thank you
Re: std.xml2 (collecting features)
On 4/05/2015 5:39 a.m., Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. Preferably the interfaces are made first 1:1 as the spec requires. Then its just a matter of building the actual reader/writer code. That way we could theoretically rewrite the reader/writer to support other formats such as html5/svg. Independently of phobos. Also would be nice to be CTFE'able!
Re: std.xml2 (collecting features)
On 2015-05-03 17:39:46 +, "Robert burner Schadek" said: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. This isn't a feature request (sorry?), but I just want to point out that you should feel free to borrow code from https://github.com/michelf/mfr-xml-d There's probably a lot you can reuse in there. -- Michel Fortin michel.for...@michelf.ca http://michelf.ca
Re: std.xml2 (collecting features)
On 5/3/2015 2:31 PM, Ilya Yaroshenko wrote: Can it lazily reads huge files (files greater than memory)? If a range interface is used, it doesn't need to be aware of where the data is coming from. In fact, the xml package should NOT be doing I/O.
Re: std.xml2 (collecting features)
Can it lazily reads huge files (files greater than memory)?
Re: std.xml2 (collecting features)
On 5/3/2015 10:39 AM, Robert burner Schadek wrote: Please post you feature requests, and please keep the posts DRY and on topic. Try to design the interface to it so it does not inherently require the implementation to allocate GC memory.
Re: std.xml2 (collecting features)
On 5/3/2015 10:39 AM, Robert burner Schadek wrote: - CTS for encoding (ubyte(ASCII), char(utf8), ... ) Encoding schemes should be handled by adapter algorithms, not in the XML parser itself, which should only handle UTF8.
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. Could possibly use pegged to do it? It may simplify the parsing portion of it for you at least.
Re: std.xml2 (collecting features)
On 5/3/2015 10:39 AM, Robert burner Schadek wrote: Please post you feature requests, and please keep the posts DRY and on topic. Pipeline range interface, for example: source.xmlparse(configuration).whatever();
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote: On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. My request: just skip it. XML is a horrible waste of space for a standard, better D doesn't support it well, anything to discourage it's use. I'd rather see you spend your time on something worthwhile. If data formats are your thing, you could help get Ludwig's JSON stuff in, or better yet, enable some nice binary data format. I agree that JSON is superior through-and-through, but legacy support matters, and XML is in many places. It's good to have a quality XML parsing library.
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote: On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. My request: just skip it. XML is a horrible waste of space for a standard, better D doesn't support it well, anything to discourage it's use. I'd rather see you spend your time on something worthwhile. If data formats are your thing, you could help get Ludwig's JSON stuff in, or better yet, enable some nice binary data format. That's not really an option considering the huge amount of XML data there is out there.
Re: std.xml2 (collecting features)
- CTS to disable parsing location (line,column)
Re: std.xml2 (collecting features)
On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it: - SAX and DOM parser - in-situ / slicing parsing when possible (forward range?) - compile time switch (CTS) for lazy attribute parsing - CTS for encoding (ubyte(ASCII), char(utf8), ... ) - CTS for input validating - performance Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2 Please post you feature requests, and please keep the posts DRY and on topic. My request: just skip it. XML is a horrible waste of space for a standard, better D doesn't support it well, anything to discourage it's use. I'd rather see you spend your time on something worthwhile. If data formats are your thing, you could help get Ludwig's JSON stuff in, or better yet, enable some nice binary data format.