[ 
https://issues.apache.org/jira/browse/CONNECTORS-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936866#comment-16936866
 ] 

Karl Wright commented on CONNECTORS-1623:
-----------------------------------------

I put together a fix but need to verify it.


> Script tags not ignored
> -----------------------
>
>                 Key: CONNECTORS-1623
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.13
>            Reporter: Julien Massiera
>            Assignee: Karl Wright
>            Priority: Critical
>             Fix For: ManifoldCF 2.14
>
>
> I discovered a problematic behavior with the 
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when 
> crawling web pages. This behavior poses problem in particular for the 
> scenario of form based authentication, as explained further. 
>  The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which 
> is called by the TagParseState on each noteTag() or noteEndTag() methods, 
> uses the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState 
> class to detect if the parsing process is in or out of a 'script' tag and 
> then do something or not with the incoming data. The problem is that the 
> TagParseState class is not aware of the type of tag currently parsed, so it 
> continues to analyze any char encountered to detect tags even if it is 
> actually parsing a script tag. 
> So let's imagine you have a script tag built like this in a web page: 
> {code:java}
> <script>if(myvar <= 9) {.......}</script>
> {code}
> When the TagParseState parses the char '<' it will consider that a new tag 
> begins until it encounters a '>' char. So in the case above, the 
> TagParseState will never catch the end of the script tag, and thus, the 
> scriptParseState variable in the ScriptParseState class will remain in the 
> SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be 
> correctly handled by the other parsers. 
>  As a result, if you, for example, configure a form authentication for your 
> crawl and that the form web page contains this kind of script tag prior to 
> the form tag, the form will never be handled and the authentication will 
> fail. This was the case I encountered, and I resolved it by forcing the 
> scriptParseState to be SCRIPTPARSESTATE_NORMAL.
> ref : 
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to