Hi Karl,

Thanks for your suggestion. Took me some time to think about it, but I think we 
have two different approaches for this case:
1. In your case, it seems like if a source is problematic, it is its own 
problem, not the one of the parser/connector, so the latter should just discard 
the doc 
2. In my case, we start from the principle that in many situations (especially 
in web or enterprise scenarii), sources cannot be changed as we want, be it for 
instance because they belong to another party that has no interest in changing 
the code (think any website that does not care who parses it), or because the 
software is not maintained anymore (old versions of CMS systems for instance).

The question then is: do we want to enable connectors to be modified so that 
they can handle special non-compliant cases (which is our case), or do we want 
connectors that only and strictly index content that respect given 
specifications. 
The solutions here would be :
1. Use CDATA
2. Put the javascript code in its own file
3. Encode every problematic chars in the javascript
Each solution requires to modify the source webpage which may be impossible or 
refused by the source owner, and the latter one would make the javascript code 
less readable and easy to understand by developers...

So if I rephrase a bit my question, I would add to what I wrote in my first 
email:

Assuming that the mentioned source document MUST be parsed to manage to perform 
the form based authentication, and assuming that it cannot be modified and thus 
it cannot comply with any of the recommendations exposed above, what would be 
your recommended approach to modify the connector so that it may optionally 
accept to handle such cases where we have spotted a given sequence of 
characters that pose problem ? 

Regards,
Julien

-----Message d'origine-----
De : Karl Wright <[email protected]> 
Envoyé : jeudi 5 septembre 2019 18:30
À : dev <[email protected]>
Objet : Re: TagParseState behavior with Web connector

The parser requires that the document being parsed be valid XML.  Data within 
non-CDATA sections is *required* to use entity references to include < or > 
characters.  See:

https://stackoverflow.com/questions/330725/use-of-greater-than-symbol-in-xml


Karl


On Thu, Sep 5, 2019 at 12:10 PM Julien Massiera < 
[email protected]> wrote:

> Hi Karl,
>
> I discovered a problematic behavior with the 
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when 
> crawling web pages. This behavior poses problem in particular for the 
> scenario of form based authentication, as explained further in my email.
>
> The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class 
> which is called by the TagParseState on each noteTag() or noteEndTag() 
> methods, uses the 
> org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState
> class to detect if the parsing process is in or out of a 'script' tag 
> and then do something or not with the incoming data.
>
> The problem is that the TagParseState class is not aware of the type 
> of tag currently parsed, so it continues to analyze any char 
> encountered to detect tags even if it is actually parsing a script tag.
> So let's imagine you have a script tag built like this in a web page:
>
> <script>if(myvar <= 9) {.......}</script>
>
> When the TagParseState parses the char '<' it will consider that a new 
> tag begins until it encounters a '>' char. So in the case above, the 
> TagParseState will never catch the end of the script tag, and thus, 
> the scriptParseState variable in the ScriptParseState class will 
> remain in the SCRIPTPARSESTATE_INSCRIPT state and the rest of the web 
> page will not be correctly handled by the other parsers.
>
> As a result, if you, for example, configure a form authentication for 
> your crawl and that the form web page contains this kind of script tag 
> prior to the form tag, the form will never be handled and the 
> authentication will fail. This was the case I encountered, and I 
> resolved it by forcing the scriptParseState to be SCRIPTPARSESTATE_NORMAL.
>
> I have difficulties finding an elegant way to solve this issue, so I 
> would gladly welcome your thoughts on that.
>
> To simplify the reproductibility of this behavior just create an HTML 
> with the following content :
>
>
> <!doctype html><html lang="fr"><head><meta name="Viewport"
> content="width=device-width, height=device-height"/><meta 
> charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"
> /><noscript><meta http-equiv="refresh" content="0; 
> URL=error.jsp?errorMessage=error.JavaScriptDisabled"/></noscript><link
> rel="shortcut icon" type="image/x-icon" href="/form/images/favicon.ico"
> /><link rel="stylesheet"
> href="/form/css/jQuery/ui-ilex-theme/jquery-ui-1.10.4.custom.min.css"
> type="text/css"><link rel="stylesheet" type="text/css"
> href="/form/css/bootstrap.min.css" /><link rel="stylesheet" type="text/css"
> href="/form/css/styles_sign_and_go.css" /><link rel="stylesheet"
> type="text/css" href="/form/css/styles_custom.css" /><script 
> src="/form/js/jQuery/jquery.min.js"></script><script
> src="/form/js/bootstrap.min.js"></script><script
> src="/form/js/authenticator.js"></script><script>$(document).ready(fun
> ction() {$("button, input[type='submit'], input[type='cancel'], 
> input[type='button']").addClass("ui-button ui-widget ui-state-default 
> ui-corner-all");});</script><script
> src="/form/img_func.js"></script><!--[if lt IE 9]><script 
> src="/form/js/ie_polyfills.js"></script><![endif]--><script
> src="/form/js/custom.js"></script><title>Redirection to source URL 
> </title>
>                         </head>
>                         <body
> onload='give_focus_and_verif_cookie_enabled()'><script>var
> retryCount=0;function getIEVersion() { var match = 
> navigator.userAgent.match(/(?:MSIE |Trident\/.*; rv:)(\d+)/); return 
> match ? parseInt(match[1]) : -1; }function 
> give_focus(){if(retryCount>100){return;}var currentIEVersion = 
> getIEVersion();if(currentIEVersion <= 9){var bFound = 
> false;if(document.forms[0]!=null){for(i=0; i < 
> document.forms[0].length;
> i++){retryCount = retryCount+1;try{if (document.forms[0][i].type !=
> "hidden") { if (document.forms[0][i].disabled != true) {
>  document.forms[0][i].focus();     var bFound = true;  } } if (bFound ==
> true)   break; } catch(err) { setTimeout("give_focus()",1000); }
> }}}}function
> give_focus_and_verif_cookie_enabled(){give_focus();if(!navigator.cooki
> eEnabled){  
> window.location.href="error.jsp?errorMessage=error.CookieDisabled";}}<
> /script><div id="wrapper"><div id="header"><div class="container"><div 
> class="logo"></div><h1>Authentication</h1><div class="changeLang"><a 
> href="?displayLang=en-gb">EN</a> | <a 
> href="?displayLang=fr-fr">FR</a></div></div></div>
>
>                         <form action='login.jsp' method='post'
> name='theform'>
>                         <input type="hidden" name="csrfAuth"
> value="-aja2lwx5jf09">
>                         <input type="hidden"
> name="sng-remember-me-fingerprint" id="sng-remember-me-fingerprint"
> value="null" >
>
>
>                         </form>
>                     <div id="content">
>                       <div class="container">
>                         <div id="contenu_specifique_application" >
>                                  <div class="app msgLoading">
>                                    <div class="app-description"
> style="height:64px;">
>                         <h3>You will be redirected within a few 
> seconds.</h3>
>                         </div>
>                         </div>
>                         </div>
>                         </div>
>                         </div>
>        <script>
>        $(document).ready(function(){
>          document.getElementById('sng-remember-me-fingerprint').value 
> = getStoreLocal('sng-remember-me-fingerprint');
>          document.theform.submit();
>          try {
>              history.replaceState(null, "", document.referrer );
>          } catch(err) {
>            // security error in edge
>          }
>        });
>        </script>
>                         </div></body></html>
>
>
>
>
> Regards
>
> --
> Julien MASSIERA
> Directeur développement produit
> France Labs – Les experts du Search
> www.francelabs.com
>
>

Reply via email to