Re: TagParseState behavior with Web connector

2019-09-09 Thread Karl Wright
If you go the strict override route, then it must be limited to parsing of
HTML, and cannot apply to general parsing of XML.  There is a pathway for
that in the Web Connector but I will need to look at it in depth and I do
not have time this week.  Perhaps this weekend.

Karl


On Mon, Sep 9, 2019 at 5:28 AM  wrote:

> Hi Karl,
>
> I'm not sure we're going in the good direction by trying to apply a strict
> XML parser in the HTML connector. HTML is not mandatorily XML compliant
> (otherwise it is XHTML), and it is therefore not what many web pages are
> made of. Speaking of which, the HTML source code I took as example passes
> the HTML validation.
> I've spent some time understanding how the main browsers handle the script
> tag while creating their DOM representation. As a matter of fact, they
> basically pause the DOM creation when finding it, and hand the scripts over
> to dedicated engines. See for instance this blog explaining it :
> https://hacks.mozilla.org/2017/09/building-the-dom-faster-speculative-parsing-async-defer-and-preload/
> As such, if we want to follow a similar approach, one way I have in mind
> could be the following:
>
> Have a "getScriptParseState" method in the TagParseState class :
>
> protected int getScriptParseState()
> {
>   return 0;
> }
>
> that would be overriden by the FormParseState class :
>
> protected int getScriptParseState()
> {
>   return scriptParseState;
> }
>
> Then use this method in the switch case of the TagParseState class for the
> TAGPARSESTATE_SAWLEFTANGLE case (l271 in MCF v2.12) :
>
> 
> else if (bTagDepth == 0)
>   {
> if (isWhitespace(thisChar) || getScriptParseState() == 1 )
> {
>   // Not a tag.
>   currentState = TAGPARSESTATE_NORMAL;
> 
>
> As the scriptParseState parameter would only be set to 1 in the
> ScriptParseState class which is specific to the web connector, we are sure
> that a connector willing to parse a standard XML file will not be impacted
> by our HTML specific method.
>
> What do you think ?
>
> Julien
>
> -Message d'origine-
> De : Karl Wright 
> Envoyé : vendredi 6 septembre 2019 16:54
> À : dev 
> Objet : Re: TagParseState behavior with Web connector
>
> *IF* you wanted to allow broken XML to be still correctly parsed, the
> first thing you must do is come up with a list of exceptions to standard
> XML parsing that you would want to support.  Presuming that you have a
> browser that you think is doing a good job of handling the broken HTML in
> question, you can certainly experiment to determine what that browser does
> with specific exception cases that you come up with.  Once that is done,
> then the state diagram for the tag parser must be modified in the minimal
> way to permit your exceptions to work.
>
> This is no small task, because you will be forced to consider certain tags
> as applying context, and since you are doing that, you are therefore going
> to necessarily break correct XML parsing in a non-HTML situation.  For
> example:
>
> if a
> ... would, in a true XML setting, recognize the beginning of a  tag,
> and you would not want to break the case where it really was a  tag:
>
>  text  bold text  
>
> So an exception rule you might propose might be that if you start a tag,
> but don't properly complete it, the tag is not considered valid.  But then
> there's this case:
>
>  if ad {dostuff};
>
> Since the & is an XML entity begin, what do you do here?  The parser will
> correctly detect an invalid entity, but then it also needs to understand
> that it's also an invalid tag.
>
> There are a ton of cases, and they would all have to be handled correctly
> for javascript to consistently and successfully not be interpreted as tags.
>
> I'm willing to look at this but you're going to need to supply that list
> of cases.
>
> Karl
>
>
> On Fri, Sep 6, 2019 at 9:34 AM  wrote:
>
> > Hi Karl,
> >
> > Thanks for your suggestion. Took me some time to think about it, but I
> > think we have two different approaches for this case:
> > 1. In your case, it seems like if a source is problematic, it is its
> > own problem, not the one of the parser/connector, so the latter should
> > just discard the doc 2. In my case, we start from the principle that
> > in many situations (especially in web or enterprise scenarii), sources
> > cannot be changed as we want, be it for instance because they belong
> > to another party that has no interest in changing the code (think any
> > website that does not care who parses it), or because the software is
> > not maintained anymore (old versions of CMS systems for instance).
> >
> > The question then is: do we want to enable connectors to be modified
> > so that they can handle special non-compliant cases (which is our
> > case), or do we want connectors that only and strictly index content
> > that respect given specifications.
> > The solutions here would be :
> > 1. Use CDATA
> > 2. Put the javascript code in its own 

RE: TagParseState behavior with Web connector

Hi Karl, 

I'm not sure we're going in the good direction by trying to apply a strict XML 
parser in the HTML connector. HTML is not mandatorily XML compliant (otherwise 
it is XHTML), and it is therefore not what many web pages are made of. Speaking 
of which, the HTML source code I took as example passes the HTML validation.
I've spent some time understanding how the main browsers handle the script tag 
while creating their DOM representation. As a matter of fact, they basically 
pause the DOM creation when finding it, and hand the scripts over to dedicated 
engines. See for instance this blog explaining it : 
https://hacks.mozilla.org/2017/09/building-the-dom-faster-speculative-parsing-async-defer-and-preload/
As such, if we want to follow a similar approach, one way I have in mind could 
be the following:

Have a "getScriptParseState" method in the TagParseState class :

protected int getScriptParseState()
{
  return 0;
}

that would be overriden by the FormParseState class : 

protected int getScriptParseState()
{ 
  return scriptParseState;
}

Then use this method in the switch case of the TagParseState class for the 
TAGPARSESTATE_SAWLEFTANGLE case (l271 in MCF v2.12) :


else if (bTagDepth == 0)
  {
if (isWhitespace(thisChar) || getScriptParseState() == 1 )
{
  // Not a tag.
  currentState = TAGPARSESTATE_NORMAL;


As the scriptParseState parameter would only be set to 1 in the 
ScriptParseState class which is specific to the web connector, we are sure that 
a connector willing to parse a standard XML file will not be impacted by our 
HTML specific method.  

What do you think ? 

Julien

-Message d'origine-
De : Karl Wright  
Envoyé : vendredi 6 septembre 2019 16:54
À : dev 
Objet : Re: TagParseState behavior with Web connector

*IF* you wanted to allow broken XML to be still correctly parsed, the first 
thing you must do is come up with a list of exceptions to standard XML parsing 
that you would want to support.  Presuming that you have a browser that you 
think is doing a good job of handling the broken HTML in question, you can 
certainly experiment to determine what that browser does with specific 
exception cases that you come up with.  Once that is done, then the state 
diagram for the tag parser must be modified in the minimal way to permit your 
exceptions to work.

This is no small task, because you will be forced to consider certain tags as 
applying context, and since you are doing that, you are therefore going to 
necessarily break correct XML parsing in a non-HTML situation.  For
example:

if ad {dostuff};

Since the & is an XML entity begin, what do you do here?  The parser will 
correctly detect an invalid entity, but then it also needs to understand that 
it's also an invalid tag.

There are a ton of cases, and they would all have to be handled correctly for 
javascript to consistently and successfully not be interpreted as tags.

I'm willing to look at this but you're going to need to supply that list of 
cases.

Karl


On Fri, Sep 6, 2019 at 9:34 AM  wrote:

> Hi Karl,
>
> Thanks for your suggestion. Took me some time to think about it, but I 
> think we have two different approaches for this case:
> 1. In your case, it seems like if a source is problematic, it is its 
> own problem, not the one of the parser/connector, so the latter should 
> just discard the doc 2. In my case, we start from the principle that 
> in many situations (especially in web or enterprise scenarii), sources 
> cannot be changed as we want, be it for instance because they belong 
> to another party that has no interest in changing the code (think any 
> website that does not care who parses it), or because the software is 
> not maintained anymore (old versions of CMS systems for instance).
>
> The question then is: do we want to enable connectors to be modified 
> so that they can handle special non-compliant cases (which is our 
> case), or do we want connectors that only and strictly index content 
> that respect given specifications.
> The solutions here would be :
> 1. Use CDATA
> 2. Put the javascript code in its own file 3. Encode every problematic 
> chars in the javascript Each solution requires to modify the source 
> webpage which may be impossible or refused by the source owner, and 
> the latter one would make the javascript code less readable and easy 
> to understand by developers...
>
> So if I rephrase a bit my question, I would add to what I wrote in my 
> first email:
>
> Assuming that the mentioned source document MUST be parsed to manage 
> to perform the form based