The parser requires that the document being parsed be valid XML.  Data
within non-CDATA sections is *required* to use entity references to include
< or > characters.  See:

https://stackoverflow.com/questions/330725/use-of-greater-than-symbol-in-xml


Karl


On Thu, Sep 5, 2019 at 12:10 PM Julien Massiera <
julien.massi...@francelabs.com> wrote:

> Hi Karl,
>
> I discovered a problematic behavior with the
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when
> crawling web pages. This behavior poses problem in particular for the
> scenario of form based authentication, as explained further in my email.
>
> The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class
> which is called by the TagParseState on each noteTag() or noteEndTag()
> methods, uses the
> org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState
> class to detect if the parsing process is in or out of a 'script' tag
> and then do something or not with the incoming data.
>
> The problem is that the TagParseState class is not aware of the type of
> tag currently parsed, so it continues to analyze any char encountered to
> detect tags even if it is actually parsing a script tag.
> So let's imagine you have a script tag built like this in a web page:
>
> <script>if(myvar <= 9) {.......}</script>
>
> When the TagParseState parses the char '<' it will consider that a new
> tag begins until it encounters a '>' char. So in the case above, the
> TagParseState will never catch the end of the script tag, and thus,
> the scriptParseState variable in the ScriptParseState class will remain
> in the SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will
> not be correctly handled by the other parsers.
>
> As a result, if you, for example, configure a form authentication for
> your crawl and that the form web page contains this kind of script tag
> prior to the form tag, the form will never be handled and the
> authentication will fail. This was the case I encountered, and I
> resolved it by forcing the scriptParseState to be SCRIPTPARSESTATE_NORMAL.
>
> I have difficulties finding an elegant way to solve this issue, so I
> would gladly welcome your thoughts on that.
>
> To simplify the reproductibility of this behavior just create an HTML
> with the following content :
>
>
> <!doctype html><html lang="fr"><head><meta name="Viewport"
> content="width=device-width, height=device-height"/><meta
> charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"
> /><noscript><meta http-equiv="refresh" content="0;
> URL=error.jsp?errorMessage=error.JavaScriptDisabled"/></noscript><link
> rel="shortcut icon" type="image/x-icon" href="/form/images/favicon.ico"
> /><link rel="stylesheet"
> href="/form/css/jQuery/ui-ilex-theme/jquery-ui-1.10.4.custom.min.css"
> type="text/css"><link rel="stylesheet" type="text/css"
> href="/form/css/bootstrap.min.css" /><link rel="stylesheet" type="text/css"
> href="/form/css/styles_sign_and_go.css" /><link rel="stylesheet"
> type="text/css" href="/form/css/styles_custom.css" /><script
> src="/form/js/jQuery/jquery.min.js"></script><script
> src="/form/js/bootstrap.min.js"></script><script
> src="/form/js/authenticator.js"></script><script>$(document).ready(function()
> {$("button, input[type='submit'], input[type='cancel'],
> input[type='button']").addClass("ui-button ui-widget ui-state-default
> ui-corner-all");});</script><script
> src="/form/img_func.js"></script><!--[if lt IE 9]><script
> src="/form/js/ie_polyfills.js"></script><![endif]--><script
> src="/form/js/custom.js"></script><title>Redirection to source URL
> </title>
>                         </head>
>                         <body
> onload='give_focus_and_verif_cookie_enabled()'><script>var
> retryCount=0;function getIEVersion() { var match =
> navigator.userAgent.match(/(?:MSIE |Trident\/.*; rv:)(\d+)/); return match
> ? parseInt(match[1]) : -1; }function
> give_focus(){if(retryCount>100){return;}var currentIEVersion =
> getIEVersion();if(currentIEVersion <= 9){var bFound =
> false;if(document.forms[0]!=null){for(i=0; i < document.forms[0].length;
> i++){retryCount = retryCount+1;try{if (document.forms[0][i].type !=
> "hidden") { if (document.forms[0][i].disabled != true) {
>  document.forms[0][i].focus();     var bFound = true;  } } if (bFound ==
> true)   break; } catch(err) { setTimeout("give_focus()",1000); }
> }}}}function
> give_focus_and_verif_cookie_enabled(){give_focus();if(!navigator.cookieEnabled){
>  
> window.location.href="error.jsp?errorMessage=error.CookieDisabled";}}</script><div
> id="wrapper"><div id="header"><div class="container"><div
> class="logo"></div><h1>Authentication</h1><div class="changeLang"><a
> href="?displayLang=en-gb">EN</a> | <a
> href="?displayLang=fr-fr">FR</a></div></div></div>
>
>                         <form action='login.jsp' method='post'
> name='theform'>
>                         <input type="hidden" name="csrfAuth"
> value="-aja2lwx5jf09">
>                         <input type="hidden"
> name="sng-remember-me-fingerprint" id="sng-remember-me-fingerprint"
> value="null" >
>
>
>                         </form>
>                     <div id="content">
>                       <div class="container">
>                         <div id="contenu_specifique_application" >
>                                  <div class="app msgLoading">
>                                    <div class="app-description"
> style="height:64px;">
>                         <h3>You will be redirected within a few
> seconds.</h3>
>                         </div>
>                         </div>
>                         </div>
>                         </div>
>                         </div>
>        <script>
>        $(document).ready(function(){
>          document.getElementById('sng-remember-me-fingerprint').value =
> getStoreLocal('sng-remember-me-fingerprint');
>          document.theform.submit();
>          try {
>              history.replaceState(null, "", document.referrer );
>          } catch(err) {
>            // security error in edge
>          }
>        });
>        </script>
>                         </div></body></html>
>
>
>
>
> Regards
>
> --
> Julien MASSIERA
> Directeur développement produit
> France Labs – Les experts du Search
> www.francelabs.com
>
>

Reply via email to