Re: TagParseState behavior with Web connector

Karl Wright Mon, 09 Sep 2019 03:21:24 -0700

If you go the strict override route, then it must be limited to parsing of
HTML, and cannot apply to general parsing of XML.  There is a pathway for
that in the Web Connector but I will need to look at it in depth and I do
not have time this week.  Perhaps this weekend.


Karl


On Mon, Sep 9, 2019 at 5:28 AM <[email protected]> wrote:

> Hi Karl,
>
> I'm not sure we're going in the good direction by trying to apply a strict
> XML parser in the HTML connector. HTML is not mandatorily XML compliant
> (otherwise it is XHTML), and it is therefore not what many web pages are
> made of. Speaking of which, the HTML source code I took as example passes
> the HTML validation.
> I've spent some time understanding how the main browsers handle the script
> tag while creating their DOM representation. As a matter of fact, they
> basically pause the DOM creation when finding it, and hand the scripts over
> to dedicated engines. See for instance this blog explaining it :
> https://hacks.mozilla.org/2017/09/building-the-dom-faster-speculative-parsing-async-defer-and-preload/
> As such, if we want to follow a similar approach, one way I have in mind
> could be the following:
>
> Have a "getScriptParseState" method in the TagParseState class :
>
> protected int getScriptParseState()
> {
>   return 0;
> }
>
> that would be overriden by the FormParseState class :
>
> protected int getScriptParseState()
> {
>       return scriptParseState;
> }
>
> Then use this method in the switch case of the TagParseState class for the
> TAGPARSESTATE_SAWLEFTANGLE case (l271 in MCF v2.12) :
>
> ....
> else if (bTagDepth == 0)
>       {
>         if (isWhitespace(thisChar) || getScriptParseState() == 1 )
>         {
>           // Not a tag.
>           currentState = TAGPARSESTATE_NORMAL;
> ....
>
> As the scriptParseState parameter would only be set to 1 in the
> ScriptParseState class which is specific to the web connector, we are sure
> that a connector willing to parse a standard XML file will not be impacted
> by our HTML specific method.
>
> What do you think ?
>
> Julien
>
> -----Message d'origine-----
> De : Karl Wright <[email protected]>
> Envoyé : vendredi 6 septembre 2019 16:54
> À : dev <[email protected]>
> Objet : Re: TagParseState behavior with Web connector
>
> *IF* you wanted to allow broken XML to be still correctly parsed, the
> first thing you must do is come up with a list of exceptions to standard
> XML parsing that you would want to support.  Presuming that you have a
> browser that you think is doing a good job of handling the broken HTML in
> question, you can certainly experiment to determine what that browser does
> with specific exception cases that you come up with.  Once that is done,
> then the state diagram for the tag parser must be modified in the minimal
> way to permit your exceptions to work.
>
> This is no small task, because you will be forced to consider certain tags
> as applying context, and since you are doing that, you are therefore going
> to necessarily break correct XML parsing in a non-HTML situation.  For
> example:
>
> <script>if a<b {dostuff};</script>
>
> ... would, in a true XML setting, recognize the beginning of a <b> tag,
> and you would not want to break the case where it really was a <b> tag:
>
> <something> text <b> bold text </b> </something>
>
> So an exception rule you might propose might be that if you start a tag,
> but don't properly complete it, the tag is not considered valid.  But then
> there's this case:
>
> <script> if a<b&&c>d {dostuff};</script>
>
> Since the & is an XML entity begin, what do you do here?  The parser will
> correctly detect an invalid entity, but then it also needs to understand
> that it's also an invalid tag.
>
> There are a ton of cases, and they would all have to be handled correctly
> for javascript to consistently and successfully not be interpreted as tags.
>
> I'm willing to look at this but you're going to need to supply that list
> of cases.
>
> Karl
>
>
> On Fri, Sep 6, 2019 at 9:34 AM <[email protected]> wrote:
>
> > Hi Karl,
> >
> > Thanks for your suggestion. Took me some time to think about it, but I
> > think we have two different approaches for this case:
> > 1. In your case, it seems like if a source is problematic, it is its
> > own problem, not the one of the parser/connector, so the latter should
> > just discard the doc 2. In my case, we start from the principle that
> > in many situations (especially in web or enterprise scenarii), sources
> > cannot be changed as we want, be it for instance because they belong
> > to another party that has no interest in changing the code (think any
> > website that does not care who parses it), or because the software is
> > not maintained anymore (old versions of CMS systems for instance).
> >
> > The question then is: do we want to enable connectors to be modified
> > so that they can handle special non-compliant cases (which is our
> > case), or do we want connectors that only and strictly index content
> > that respect given specifications.
> > The solutions here would be :
> > 1. Use CDATA
> > 2. Put the javascript code in its own file 3. Encode every problematic
> > chars in the javascript Each solution requires to modify the source
> > webpage which may be impossible or refused by the source owner, and
> > the latter one would make the javascript code less readable and easy
> > to understand by developers...
> >
> > So if I rephrase a bit my question, I would add to what I wrote in my
> > first email:
> >
> > Assuming that the mentioned source document MUST be parsed to manage
> > to perform the form based authentication, and assuming that it cannot
> > be modified and thus it cannot comply with any of the recommendations
> > exposed above, what would be your recommended approach to modify the
> > connector so that it may optionally accept to handle such cases where
> > we have spotted a given sequence of characters that pose problem ?
> >
> > Regards,
> > Julien
> >
> > -----Message d'origine-----
> > De : Karl Wright <[email protected]>
> > Envoyé : jeudi 5 septembre 2019 18:30
> > À : dev <[email protected]>
> > Objet : Re: TagParseState behavior with Web connector
> >
> > The parser requires that the document being parsed be valid XML.  Data
> > within non-CDATA sections is *required* to use entity references to
> > include < or > characters.  See:
> >
> >
> > https://stackoverflow.com/questions/330725/use-of-greater-than-symbol-
> > in-xml
> >
> >
> > Karl
> >
> >
> > On Thu, Sep 5, 2019 at 12:10 PM Julien Massiera <
> > [email protected]> wrote:
> >
> > > Hi Karl,
> > >
> > > I discovered a problematic behavior with the
> > > org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class
> > > when crawling web pages. This behavior poses problem in particular
> > > for the scenario of form based authentication, as explained further in
> my email.
> > >
> > > The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState
> > > class which is called by the TagParseState on each noteTag() or
> > > noteEndTag() methods, uses the
> > > org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState
> > > class to detect if the parsing process is in or out of a 'script'
> > > tag and then do something or not with the incoming data.
> > >
> > > The problem is that the TagParseState class is not aware of the type
> > > of tag currently parsed, so it continues to analyze any char
> > > encountered to detect tags even if it is actually parsing a script tag.
> > > So let's imagine you have a script tag built like this in a web page:
> > >
> > > <script>if(myvar <= 9) {.......}</script>
> > >
> > > When the TagParseState parses the char '<' it will consider that a
> > > new tag begins until it encounters a '>' char. So in the case above,
> > > the TagParseState will never catch the end of the script tag, and
> > > thus, the scriptParseState variable in the ScriptParseState class
> > > will remain in the SCRIPTPARSESTATE_INSCRIPT state and the rest of
> > > the web page will not be correctly handled by the other parsers.
> > >
> > > As a result, if you, for example, configure a form authentication
> > > for your crawl and that the form web page contains this kind of
> > > script tag prior to the form tag, the form will never be handled and
> > > the authentication will fail. This was the case I encountered, and I
> > > resolved it by forcing the scriptParseState to be
> > SCRIPTPARSESTATE_NORMAL.
> > >
> > > I have difficulties finding an elegant way to solve this issue, so I
> > > would gladly welcome your thoughts on that.
> > >
> > > To simplify the reproductibility of this behavior just create an
> > > HTML with the following content :
> > >
> > >
> > > <!doctype html><html lang="fr"><head><meta name="Viewport"
> > > content="width=device-width, height=device-height"/><meta
> > > charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"
> > > /><noscript><meta http-equiv="refresh" content="0;
> > > URL=error.jsp?errorMessage=error.JavaScriptDisabled"/></noscript><li
> > > nk rel="shortcut icon" type="image/x-icon"
> > > href="/form/images/favicon.ico"
> > > /><link rel="stylesheet"
> > > href="/form/css/jQuery/ui-ilex-theme/jquery-ui-1.10.4.custom.min.css"
> > > type="text/css"><link rel="stylesheet" type="text/css"
> > > href="/form/css/bootstrap.min.css" /><link rel="stylesheet"
> > type="text/css"
> > > href="/form/css/styles_sign_and_go.css" /><link rel="stylesheet"
> > > type="text/css" href="/form/css/styles_custom.css" /><script
> > > src="/form/js/jQuery/jquery.min.js"></script><script
> > > src="/form/js/bootstrap.min.js"></script><script
> > > src="/form/js/authenticator.js"></script><script>$(document).ready(f
> > > un
> > > ction() {$("button, input[type='submit'], input[type='cancel'],
> > > input[type='button']").addClass("ui-button ui-widget
> > > ui-state-default ui-corner-all");});</script><script
> > > src="/form/img_func.js"></script><!--[if lt IE 9]><script
> > > src="/form/js/ie_polyfills.js"></script><![endif]--><script
> > > src="/form/js/custom.js"></script><title>Redirection to source URL
> > > </title>
> > >                         </head>
> > >                         <body
> > > onload='give_focus_and_verif_cookie_enabled()'><script>var
> > > retryCount=0;function getIEVersion() { var match =
> > > navigator.userAgent.match(/(?:MSIE |Trident\/.*; rv:)(\d+)/); return
> > > match ? parseInt(match[1]) : -1; }function
> > > give_focus(){if(retryCount>100){return;}var currentIEVersion =
> > > getIEVersion();if(currentIEVersion <= 9){var bFound =
> > > false;if(document.forms[0]!=null){for(i=0; i <
> > > document.forms[0].length;
> > > i++){retryCount = retryCount+1;try{if (document.forms[0][i].type !=
> > > "hidden") { if (document.forms[0][i].disabled != true) {
> > >  document.forms[0][i].focus();     var bFound = true;  } } if (bFound
> ==
> > > true)   break; } catch(err) { setTimeout("give_focus()",1000); }
> > > }}}}function
> > > give_focus_and_verif_cookie_enabled(){give_focus();if(!navigator.coo
> > > ki
> > > eEnabled){
> > > window.location.href="error.jsp?errorMessage=error.CookieDisabled";}
> > > }< /script><div id="wrapper"><div id="header"><div
> > > class="container"><div
> > > class="logo"></div><h1>Authentication</h1><div class="changeLang"><a
> > > href="?displayLang=en-gb">EN</a> | <a
> > > href="?displayLang=fr-fr">FR</a></div></div></div>
> > >
> > >                         <form action='login.jsp' method='post'
> > > name='theform'>
> > >                         <input type="hidden" name="csrfAuth"
> > > value="-aja2lwx5jf09">
> > >                         <input type="hidden"
> > > name="sng-remember-me-fingerprint" id="sng-remember-me-fingerprint"
> > > value="null" >
> > >
> > >
> > >                         </form>
> > >                     <div id="content">
> > >                       <div class="container">
> > >                         <div id="contenu_specifique_application" >
> > >                                  <div class="app msgLoading">
> > >                                    <div class="app-description"
> > > style="height:64px;">
> > >                         <h3>You will be redirected within a few
> > > seconds.</h3>
> > >                         </div>
> > >                         </div>
> > >                         </div>
> > >                         </div>
> > >                         </div>
> > >        <script>
> > >        $(document).ready(function(){
> > >
> > > document.getElementById('sng-remember-me-fingerprint').value
> > > = getStoreLocal('sng-remember-me-fingerprint');
> > >          document.theform.submit();
> > >          try {
> > >              history.replaceState(null, "", document.referrer );
> > >          } catch(err) {
> > >            // security error in edge
> > >          }
> > >        });
> > >        </script>
> > >                         </div></body></html>
> > >
> > >
> > >
> > >
> > > Regards
> > >
> > > --
> > > Julien MASSIERA
> > > Directeur développement produit
> > > France Labs – Les experts du Search
> > > www.francelabs.com
> > >
> > >
> >
> >
>
>

Re: TagParseState behavior with Web connector

Reply via email to