The parser requires that the document being parsed be valid XML. Data within non-CDATA sections is *required* to use entity references to include < or > characters. See:
https://stackoverflow.com/questions/330725/use-of-greater-than-symbol-in-xml Karl On Thu, Sep 5, 2019 at 12:10 PM Julien Massiera < julien.massi...@francelabs.com> wrote: > Hi Karl, > > I discovered a problematic behavior with the > org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when > crawling web pages. This behavior poses problem in particular for the > scenario of form based authentication, as explained further in my email. > > The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class > which is called by the TagParseState on each noteTag() or noteEndTag() > methods, uses the > org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState > class to detect if the parsing process is in or out of a 'script' tag > and then do something or not with the incoming data. > > The problem is that the TagParseState class is not aware of the type of > tag currently parsed, so it continues to analyze any char encountered to > detect tags even if it is actually parsing a script tag. > So let's imagine you have a script tag built like this in a web page: > > <script>if(myvar <= 9) {.......}</script> > > When the TagParseState parses the char '<' it will consider that a new > tag begins until it encounters a '>' char. So in the case above, the > TagParseState will never catch the end of the script tag, and thus, > the scriptParseState variable in the ScriptParseState class will remain > in the SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will > not be correctly handled by the other parsers. > > As a result, if you, for example, configure a form authentication for > your crawl and that the form web page contains this kind of script tag > prior to the form tag, the form will never be handled and the > authentication will fail. This was the case I encountered, and I > resolved it by forcing the scriptParseState to be SCRIPTPARSESTATE_NORMAL. > > I have difficulties finding an elegant way to solve this issue, so I > would gladly welcome your thoughts on that. > > To simplify the reproductibility of this behavior just create an HTML > with the following content : > > > <!doctype html><html lang="fr"><head><meta name="Viewport" > content="width=device-width, height=device-height"/><meta > charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge" > /><noscript><meta http-equiv="refresh" content="0; > URL=error.jsp?errorMessage=error.JavaScriptDisabled"/></noscript><link > rel="shortcut icon" type="image/x-icon" href="/form/images/favicon.ico" > /><link rel="stylesheet" > href="/form/css/jQuery/ui-ilex-theme/jquery-ui-1.10.4.custom.min.css" > type="text/css"><link rel="stylesheet" type="text/css" > href="/form/css/bootstrap.min.css" /><link rel="stylesheet" type="text/css" > href="/form/css/styles_sign_and_go.css" /><link rel="stylesheet" > type="text/css" href="/form/css/styles_custom.css" /><script > src="/form/js/jQuery/jquery.min.js"></script><script > src="/form/js/bootstrap.min.js"></script><script > src="/form/js/authenticator.js"></script><script>$(document).ready(function() > {$("button, input[type='submit'], input[type='cancel'], > input[type='button']").addClass("ui-button ui-widget ui-state-default > ui-corner-all");});</script><script > src="/form/img_func.js"></script><!--[if lt IE 9]><script > src="/form/js/ie_polyfills.js"></script><![endif]--><script > src="/form/js/custom.js"></script><title>Redirection to source URL > </title> > </head> > <body > onload='give_focus_and_verif_cookie_enabled()'><script>var > retryCount=0;function getIEVersion() { var match = > navigator.userAgent.match(/(?:MSIE |Trident\/.*; rv:)(\d+)/); return match > ? parseInt(match[1]) : -1; }function > give_focus(){if(retryCount>100){return;}var currentIEVersion = > getIEVersion();if(currentIEVersion <= 9){var bFound = > false;if(document.forms[0]!=null){for(i=0; i < document.forms[0].length; > i++){retryCount = retryCount+1;try{if (document.forms[0][i].type != > "hidden") { if (document.forms[0][i].disabled != true) { > document.forms[0][i].focus(); var bFound = true; } } if (bFound == > true) break; } catch(err) { setTimeout("give_focus()",1000); } > }}}}function > give_focus_and_verif_cookie_enabled(){give_focus();if(!navigator.cookieEnabled){ > > window.location.href="error.jsp?errorMessage=error.CookieDisabled";}}</script><div > id="wrapper"><div id="header"><div class="container"><div > class="logo"></div><h1>Authentication</h1><div class="changeLang"><a > href="?displayLang=en-gb">EN</a> | <a > href="?displayLang=fr-fr">FR</a></div></div></div> > > <form action='login.jsp' method='post' > name='theform'> > <input type="hidden" name="csrfAuth" > value="-aja2lwx5jf09"> > <input type="hidden" > name="sng-remember-me-fingerprint" id="sng-remember-me-fingerprint" > value="null" > > > > </form> > <div id="content"> > <div class="container"> > <div id="contenu_specifique_application" > > <div class="app msgLoading"> > <div class="app-description" > style="height:64px;"> > <h3>You will be redirected within a few > seconds.</h3> > </div> > </div> > </div> > </div> > </div> > <script> > $(document).ready(function(){ > document.getElementById('sng-remember-me-fingerprint').value = > getStoreLocal('sng-remember-me-fingerprint'); > document.theform.submit(); > try { > history.replaceState(null, "", document.referrer ); > } catch(err) { > // security error in edge > } > }); > </script> > </div></body></html> > > > > > Regards > > -- > Julien MASSIERA > Directeur développement produit > France Labs – Les experts du Search > www.francelabs.com > >