On Tuesday, 13 February 2018 at 20:10:59 UTC, Jonathan M Davis wrote:
On Tuesday, February 13, 2018 15:22:32 Kagamin via Digitalmars-d-announce wrote:
On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis

wrote:
> The core problem is that entity references get replaced with > more XML that needs to be parsed. So, they can't simply be > passed on for post-processing. As I understand it, they have > to be replaced while the parsing is going on. And that means > that you can't do something like return slices of the > original input that don't bother with the entity references > and then have a separate parser take that and process it > further to deal with the entity references. The first parser > has to deal with them, and that means not returning slices > of the original input unless you're dealing purely with > strings and are willing to allocate new strings in the cases > where the data needs to be mutated because of an entity > reference.

Standard entities like & have the same problem, so the same solution should work too.

That depends on what exactly an entity reference can contain. If it can do something like put a start tag in there, and then it has to be terminated by the document putting an end tag in there or another entity reference containing an end tag, then it can't be handled after the fact like & can be, since & is just replaced by text. If an entity reference can't contain a start tag without a matching end tag, then sure. But I find the XML spec to be surprisingly hard to understand with regards to entity references. It's not clear to me where it's even legal to put them or not, let alone what you're allowed to put in them exactly. And I can't even really trust the XML gramamr as long as entity references are involved, because the gramamr in the spec is the grammar _after_ entity references have all been replaced, which I was quite dismayed to figure out.

If it's 100% sure that entity references can be treated as just text and that you can't end up with stuff like start tags or end tags being inserted and messing with the parsing such that they all have to be replaced for the XML to be correctly parsed, then I have no problem passing entity references along, and a higher level parser could try to do something with them, but it's not clear to me at all that an XML document with entity references is correct enough to be parsed while not replacing the entity references with whatever XML markup they contain. I had originally passed them along with the idea that a higher level parser could do something with them, but I decided that I couldn't do that if you could do something like drop a start tag in there and change the meaning of the stuff that needs to be parsed that isn't directly in the entity reference.


There's also the issue that entity references open a whole can of worms concerning security. It quite possible to have an exponential growing entity replacement that can take down any parser.

<!DOCTYPE root [
 <!ELEMENT root ANY>
 <!ENTITY LOL "LOL">
<!ENTITY LOL1 "&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;"> <!ENTITY LOL2 "&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;"> <!ENTITY LOL3 "&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;"> <!ENTITY LOL4 "&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;"> <!ENTITY LOL5 "&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;"> <!ENTITY LOL6 "&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;"> <!ENTITY LOL7 "&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;"> <!ENTITY LOL8 "&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;"> <!ENTITY LOL9 "&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;">
]>
<root>&LOL9;</root>

Hope you have enough memory (this expands to a 3 000 000 000 LOL's)



Reply via email to