Hi Karl,
I discovered a problematic behavior with the
org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when
crawling web pages. This behavior poses problem in particular for the
scenario of form based authentication, as explained further in my email.
The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class
which is called by the TagParseState on each noteTag() or noteEndTag()
methods, uses the
org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState
class to detect if the parsing process is in or out of a 'script' tag
and then do something or not with the incoming data.
The problem is that the TagParseState class is not aware of the type of
tag currently parsed, so it continues to analyze any char encountered to
detect tags even if it is actually parsing a script tag.
So let's imagine you have a script tag built like this in a web page:
<script>if(myvar <= 9) {.......}</script>
When the TagParseState parses the char '<' it will consider that a new
tag begins until it encounters a '>' char. So in the case above, the
TagParseState will never catch the end of the script tag, and thus,
the scriptParseState variable in the ScriptParseState class will remain
in the SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will
not be correctly handled by the other parsers.
As a result, if you, for example, configure a form authentication for
your crawl and that the form web page contains this kind of script tag
prior to the form tag, the form will never be handled and the
authentication will fail. This was the case I encountered, and I
resolved it by forcing the scriptParseState to be SCRIPTPARSESTATE_NORMAL.
I have difficulties finding an elegant way to solve this issue, so I
would gladly welcome your thoughts on that.
To simplify the reproductibility of this behavior just create an HTML
with the following content :
<!doctype html><html lang="fr"><head><meta name="Viewport" content="width=device-width, height=device-height"/><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge" /><noscript><meta http-equiv="refresh" content="0; URL=error.jsp?errorMessage=error.JavaScriptDisabled"/></noscript><link rel="shortcut icon" type="image/x-icon" href="/form/images/favicon.ico" /><link
rel="stylesheet" href="/form/css/jQuery/ui-ilex-theme/jquery-ui-1.10.4.custom.min.css" type="text/css"><link rel="stylesheet" type="text/css" href="/form/css/bootstrap.min.css" /><link rel="stylesheet" type="text/css" href="/form/css/styles_sign_and_go.css" /><link rel="stylesheet" type="text/css" href="/form/css/styles_custom.css" /><script src="/form/js/jQuery/jquery.min.js"></script><script
src="/form/js/bootstrap.min.js"></script><script src="/form/js/authenticator.js"></script><script>$(document).ready(function() {$("button, input[type='submit'], input[type='cancel'], input[type='button']").addClass("ui-button ui-widget ui-state-default ui-corner-all");});</script><script src="/form/img_func.js"></script><!--[if lt IE 9]><script src="/form/js/ie_polyfills.js"></script><![endif]--><script
src="/form/js/custom.js"></script><title>Redirection to source URL </title>
</head>
<body onload='give_focus_and_verif_cookie_enabled()'><script>var retryCount=0;function getIEVersion() { var match = navigator.userAgent.match(/(?:MSIE |Trident\/.*; rv:)(\d+)/); return match ? parseInt(match[1]) : -1; }function
give_focus(){if(retryCount>100){return;}var currentIEVersion = getIEVersion();if(currentIEVersion <= 9){var bFound = false;if(document.forms[0]!=null){for(i=0; i < document.forms[0].length; i++){retryCount = retryCount+1;try{if (document.forms[0][i].type != "hidden") { if
(document.forms[0][i].disabled != true) { document.forms[0][i].focus(); var bFound = true; } } if (bFound == true) break; } catch(err) { setTimeout("give_focus()",1000); } }}}}function give_focus_and_verif_cookie_enabled(){give_focus();if(!navigator.cookieEnabled){
window.location.href="error.jsp?errorMessage=error.CookieDisabled";}}</script><div id="wrapper"><div id="header"><div class="container"><div class="logo"></div><h1>Authentication</h1><div
class="changeLang"><a href="?displayLang=en-gb">EN</a> | <a href="?displayLang=fr-fr">FR</a></div></div></div>
<form action='login.jsp' method='post' name='theform'>
<input type="hidden" name="csrfAuth"
value="-aja2lwx5jf09">
<input type="hidden" name="sng-remember-me-fingerprint"
id="sng-remember-me-fingerprint" value="null" >
</form>
<div id="content">
<div class="container">
<div id="contenu_specifique_application" >
<div class="app msgLoading">
<div class="app-description"
style="height:64px;">
<h3>You will be redirected within a few seconds.</h3>
</div>
</div>
</div>
</div>
</div>
<script>
$(document).ready(function(){
document.getElementById('sng-remember-me-fingerprint').value =
getStoreLocal('sng-remember-me-fingerprint');
document.theform.submit();
try {
history.replaceState(null, "", document.referrer );
} catch(err) {
// security error in edge
}
});
</script>
</div></body></html>
Regards
--
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
www.francelabs.com