If you go the strict override route, then it must be limited to parsing of HTML, and cannot apply to general parsing of XML. There is a pathway for that in the Web Connector but I will need to look at it in depth and I do not have time this week. Perhaps this weekend.
Karl On Mon, Sep 9, 2019 at 5:28 AM <julien.massi...@francelabs.com> wrote: > Hi Karl, > > I'm not sure we're going in the good direction by trying to apply a strict > XML parser in the HTML connector. HTML is not mandatorily XML compliant > (otherwise it is XHTML), and it is therefore not what many web pages are > made of. Speaking of which, the HTML source code I took as example passes > the HTML validation. > I've spent some time understanding how the main browsers handle the script > tag while creating their DOM representation. As a matter of fact, they > basically pause the DOM creation when finding it, and hand the scripts over > to dedicated engines. See for instance this blog explaining it : > https://hacks.mozilla.org/2017/09/building-the-dom-faster-speculative-parsing-async-defer-and-preload/ > As such, if we want to follow a similar approach, one way I have in mind > could be the following: > > Have a "getScriptParseState" method in the TagParseState class : > > protected int getScriptParseState() > { > return 0; > } > > that would be overriden by the FormParseState class : > > protected int getScriptParseState() > { > return scriptParseState; > } > > Then use this method in the switch case of the TagParseState class for the > TAGPARSESTATE_SAWLEFTANGLE case (l271 in MCF v2.12) : > > .... > else if (bTagDepth == 0) > { > if (isWhitespace(thisChar) || getScriptParseState() == 1 ) > { > // Not a tag. > currentState = TAGPARSESTATE_NORMAL; > .... > > As the scriptParseState parameter would only be set to 1 in the > ScriptParseState class which is specific to the web connector, we are sure > that a connector willing to parse a standard XML file will not be impacted > by our HTML specific method. > > What do you think ? > > Julien > > -----Message d'origine----- > De : Karl Wright <daddy...@gmail.com> > Envoyé : vendredi 6 septembre 2019 16:54 > À : dev <dev@manifoldcf.apache.org> > Objet : Re: TagParseState behavior with Web connector > > *IF* you wanted to allow broken XML to be still correctly parsed, the > first thing you must do is come up with a list of exceptions to standard > XML parsing that you would want to support. Presuming that you have a > browser that you think is doing a good job of handling the broken HTML in > question, you can certainly experiment to determine what that browser does > with specific exception cases that you come up with. Once that is done, > then the state diagram for the tag parser must be modified in the minimal > way to permit your exceptions to work. > > This is no small task, because you will be forced to consider certain tags > as applying context, and since you are doing that, you are therefore going > to necessarily break correct XML parsing in a non-HTML situation. For > example: > > <script>if a<b {dostuff};</script> > > ... would, in a true XML setting, recognize the beginning of a <b> tag, > and you would not want to break the case where it really was a <b> tag: > > <something> text <b> bold text </b> </something> > > So an exception rule you might propose might be that if you start a tag, > but don't properly complete it, the tag is not considered valid. But then > there's this case: > > <script> if a<b&&c>d {dostuff};</script> > > Since the & is an XML entity begin, what do you do here? The parser will > correctly detect an invalid entity, but then it also needs to understand > that it's also an invalid tag. > > There are a ton of cases, and they would all have to be handled correctly > for javascript to consistently and successfully not be interpreted as tags. > > I'm willing to look at this but you're going to need to supply that list > of cases. > > Karl > > > On Fri, Sep 6, 2019 at 9:34 AM <julien.massi...@francelabs.com> wrote: > > > Hi Karl, > > > > Thanks for your suggestion. Took me some time to think about it, but I > > think we have two different approaches for this case: > > 1. In your case, it seems like if a source is problematic, it is its > > own problem, not the one of the parser/connector, so the latter should > > just discard the doc 2. In my case, we start from the principle that > > in many situations (especially in web or enterprise scenarii), sources > > cannot be changed as we want, be it for instance because they belong > > to another party that has no interest in changing the code (think any > > website that does not care who parses it), or because the software is > > not maintained anymore (old versions of CMS systems for instance). > > > > The question then is: do we want to enable connectors to be modified > > so that they can handle special non-compliant cases (which is our > > case), or do we want connectors that only and strictly index content > > that respect given specifications. > > The solutions here would be : > > 1. Use CDATA > > 2. Put the javascript code in its own file 3. Encode every problematic > > chars in the javascript Each solution requires to modify the source > > webpage which may be impossible or refused by the source owner, and > > the latter one would make the javascript code less readable and easy > > to understand by developers... > > > > So if I rephrase a bit my question, I would add to what I wrote in my > > first email: > > > > Assuming that the mentioned source document MUST be parsed to manage > > to perform the form based authentication, and assuming that it cannot > > be modified and thus it cannot comply with any of the recommendations > > exposed above, what would be your recommended approach to modify the > > connector so that it may optionally accept to handle such cases where > > we have spotted a given sequence of characters that pose problem ? > > > > Regards, > > Julien > > > > -----Message d'origine----- > > De : Karl Wright <daddy...@gmail.com> > > Envoyé : jeudi 5 septembre 2019 18:30 > > À : dev <dev@manifoldcf.apache.org> > > Objet : Re: TagParseState behavior with Web connector > > > > The parser requires that the document being parsed be valid XML. Data > > within non-CDATA sections is *required* to use entity references to > > include < or > characters. See: > > > > > > https://stackoverflow.com/questions/330725/use-of-greater-than-symbol- > > in-xml > > > > > > Karl > > > > > > On Thu, Sep 5, 2019 at 12:10 PM Julien Massiera < > > julien.massi...@francelabs.com> wrote: > > > > > Hi Karl, > > > > > > I discovered a problematic behavior with the > > > org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class > > > when crawling web pages. This behavior poses problem in particular > > > for the scenario of form based authentication, as explained further in > my email. > > > > > > The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState > > > class which is called by the TagParseState on each noteTag() or > > > noteEndTag() methods, uses the > > > org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState > > > class to detect if the parsing process is in or out of a 'script' > > > tag and then do something or not with the incoming data. > > > > > > The problem is that the TagParseState class is not aware of the type > > > of tag currently parsed, so it continues to analyze any char > > > encountered to detect tags even if it is actually parsing a script tag. > > > So let's imagine you have a script tag built like this in a web page: > > > > > > <script>if(myvar <= 9) {.......}</script> > > > > > > When the TagParseState parses the char '<' it will consider that a > > > new tag begins until it encounters a '>' char. So in the case above, > > > the TagParseState will never catch the end of the script tag, and > > > thus, the scriptParseState variable in the ScriptParseState class > > > will remain in the SCRIPTPARSESTATE_INSCRIPT state and the rest of > > > the web page will not be correctly handled by the other parsers. > > > > > > As a result, if you, for example, configure a form authentication > > > for your crawl and that the form web page contains this kind of > > > script tag prior to the form tag, the form will never be handled and > > > the authentication will fail. This was the case I encountered, and I > > > resolved it by forcing the scriptParseState to be > > SCRIPTPARSESTATE_NORMAL. > > > > > > I have difficulties finding an elegant way to solve this issue, so I > > > would gladly welcome your thoughts on that. > > > > > > To simplify the reproductibility of this behavior just create an > > > HTML with the following content : > > > > > > > > > <!doctype html><html lang="fr"><head><meta name="Viewport" > > > content="width=device-width, height=device-height"/><meta > > > charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge" > > > /><noscript><meta http-equiv="refresh" content="0; > > > URL=error.jsp?errorMessage=error.JavaScriptDisabled"/></noscript><li > > > nk rel="shortcut icon" type="image/x-icon" > > > href="/form/images/favicon.ico" > > > /><link rel="stylesheet" > > > href="/form/css/jQuery/ui-ilex-theme/jquery-ui-1.10.4.custom.min.css" > > > type="text/css"><link rel="stylesheet" type="text/css" > > > href="/form/css/bootstrap.min.css" /><link rel="stylesheet" > > type="text/css" > > > href="/form/css/styles_sign_and_go.css" /><link rel="stylesheet" > > > type="text/css" href="/form/css/styles_custom.css" /><script > > > src="/form/js/jQuery/jquery.min.js"></script><script > > > src="/form/js/bootstrap.min.js"></script><script > > > src="/form/js/authenticator.js"></script><script>$(document).ready(f > > > un > > > ction() {$("button, input[type='submit'], input[type='cancel'], > > > input[type='button']").addClass("ui-button ui-widget > > > ui-state-default ui-corner-all");});</script><script > > > src="/form/img_func.js"></script><!--[if lt IE 9]><script > > > src="/form/js/ie_polyfills.js"></script><![endif]--><script > > > src="/form/js/custom.js"></script><title>Redirection to source URL > > > </title> > > > </head> > > > <body > > > onload='give_focus_and_verif_cookie_enabled()'><script>var > > > retryCount=0;function getIEVersion() { var match = > > > navigator.userAgent.match(/(?:MSIE |Trident\/.*; rv:)(\d+)/); return > > > match ? parseInt(match[1]) : -1; }function > > > give_focus(){if(retryCount>100){return;}var currentIEVersion = > > > getIEVersion();if(currentIEVersion <= 9){var bFound = > > > false;if(document.forms[0]!=null){for(i=0; i < > > > document.forms[0].length; > > > i++){retryCount = retryCount+1;try{if (document.forms[0][i].type != > > > "hidden") { if (document.forms[0][i].disabled != true) { > > > document.forms[0][i].focus(); var bFound = true; } } if (bFound > == > > > true) break; } catch(err) { setTimeout("give_focus()",1000); } > > > }}}}function > > > give_focus_and_verif_cookie_enabled(){give_focus();if(!navigator.coo > > > ki > > > eEnabled){ > > > window.location.href="error.jsp?errorMessage=error.CookieDisabled";} > > > }< /script><div id="wrapper"><div id="header"><div > > > class="container"><div > > > class="logo"></div><h1>Authentication</h1><div class="changeLang"><a > > > href="?displayLang=en-gb">EN</a> | <a > > > href="?displayLang=fr-fr">FR</a></div></div></div> > > > > > > <form action='login.jsp' method='post' > > > name='theform'> > > > <input type="hidden" name="csrfAuth" > > > value="-aja2lwx5jf09"> > > > <input type="hidden" > > > name="sng-remember-me-fingerprint" id="sng-remember-me-fingerprint" > > > value="null" > > > > > > > > > > </form> > > > <div id="content"> > > > <div class="container"> > > > <div id="contenu_specifique_application" > > > > <div class="app msgLoading"> > > > <div class="app-description" > > > style="height:64px;"> > > > <h3>You will be redirected within a few > > > seconds.</h3> > > > </div> > > > </div> > > > </div> > > > </div> > > > </div> > > > <script> > > > $(document).ready(function(){ > > > > > > document.getElementById('sng-remember-me-fingerprint').value > > > = getStoreLocal('sng-remember-me-fingerprint'); > > > document.theform.submit(); > > > try { > > > history.replaceState(null, "", document.referrer ); > > > } catch(err) { > > > // security error in edge > > > } > > > }); > > > </script> > > > </div></body></html> > > > > > > > > > > > > > > > Regards > > > > > > -- > > > Julien MASSIERA > > > Directeur développement produit > > > France Labs – Les experts du Search > > > www.francelabs.com > > > > > > > > > > > >