If you go the strict override route, then it must be limited to parsing of
HTML, and cannot apply to general parsing of XML. There is a pathway for
that in the Web Connector but I will need to look at it in depth and I do
not have time this week. Perhaps this weekend.
Karl
On Mon, Sep 9, 2019 at 5:28 AM wrote:
> Hi Karl,
>
> I'm not sure we're going in the good direction by trying to apply a strict
> XML parser in the HTML connector. HTML is not mandatorily XML compliant
> (otherwise it is XHTML), and it is therefore not what many web pages are
> made of. Speaking of which, the HTML source code I took as example passes
> the HTML validation.
> I've spent some time understanding how the main browsers handle the script
> tag while creating their DOM representation. As a matter of fact, they
> basically pause the DOM creation when finding it, and hand the scripts over
> to dedicated engines. See for instance this blog explaining it :
> https://hacks.mozilla.org/2017/09/building-the-dom-faster-speculative-parsing-async-defer-and-preload/
> As such, if we want to follow a similar approach, one way I have in mind
> could be the following:
>
> Have a "getScriptParseState" method in the TagParseState class :
>
> protected int getScriptParseState()
> {
> return 0;
> }
>
> that would be overriden by the FormParseState class :
>
> protected int getScriptParseState()
> {
> return scriptParseState;
> }
>
> Then use this method in the switch case of the TagParseState class for the
> TAGPARSESTATE_SAWLEFTANGLE case (l271 in MCF v2.12) :
>
>
> else if (bTagDepth == 0)
> {
> if (isWhitespace(thisChar) || getScriptParseState() == 1 )
> {
> // Not a tag.
> currentState = TAGPARSESTATE_NORMAL;
>
>
> As the scriptParseState parameter would only be set to 1 in the
> ScriptParseState class which is specific to the web connector, we are sure
> that a connector willing to parse a standard XML file will not be impacted
> by our HTML specific method.
>
> What do you think ?
>
> Julien
>
> -Message d'origine-
> De : Karl Wright
> Envoyé : vendredi 6 septembre 2019 16:54
> À : dev
> Objet : Re: TagParseState behavior with Web connector
>
> *IF* you wanted to allow broken XML to be still correctly parsed, the
> first thing you must do is come up with a list of exceptions to standard
> XML parsing that you would want to support. Presuming that you have a
> browser that you think is doing a good job of handling the broken HTML in
> question, you can certainly experiment to determine what that browser does
> with specific exception cases that you come up with. Once that is done,
> then the state diagram for the tag parser must be modified in the minimal
> way to permit your exceptions to work.
>
> This is no small task, because you will be forced to consider certain tags
> as applying context, and since you are doing that, you are therefore going
> to necessarily break correct XML parsing in a non-HTML situation. For
> example:
>
> if a
> ... would, in a true XML setting, recognize the beginning of a tag,
> and you would not want to break the case where it really was a tag:
>
> text bold text
>
> So an exception rule you might propose might be that if you start a tag,
> but don't properly complete it, the tag is not considered valid. But then
> there's this case:
>
> if ad {dostuff};
>
> Since the & is an XML entity begin, what do you do here? The parser will
> correctly detect an invalid entity, but then it also needs to understand
> that it's also an invalid tag.
>
> There are a ton of cases, and they would all have to be handled correctly
> for javascript to consistently and successfully not be interpreted as tags.
>
> I'm willing to look at this but you're going to need to supply that list
> of cases.
>
> Karl
>
>
> On Fri, Sep 6, 2019 at 9:34 AM wrote:
>
> > Hi Karl,
> >
> > Thanks for your suggestion. Took me some time to think about it, but I
> > think we have two different approaches for this case:
> > 1. In your case, it seems like if a source is problematic, it is its
> > own problem, not the one of the parser/connector, so the latter should
> > just discard the doc 2. In my case, we start from the principle that
> > in many situations (especially in web or enterprise scenarii), sources
> > cannot be changed as we want, be it for instance because they belong
> > to another party that has no interest in changing the code (think any
> > website that does not care who parses it), or because the software is
> > not maintained anymore (old versions of CMS systems for instance).
> >
> > The question then is: do we want to enable connectors to be modified
> > so that they can handle special non-compliant cases (which is our
> > case), or do we want connectors that only and strictly index content
> > that respect given specifications.
> > The solutions here would be :
> > 1. Use CDATA
> > 2. Put the javascript code in its own