[
https://issues.apache.org/jira/browse/CONNECTORS-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karl Wright reassigned CONNECTORS-1623:
---------------------------------------
Assignee: Karl Wright
> Script tags not ignored
> -----------------------
>
> Key: CONNECTORS-1623
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
> Project: ManifoldCF
> Issue Type: Bug
> Components: Web connector
> Affects Versions: ManifoldCF 2.13
> Reporter: Julien Massiera
> Assignee: Karl Wright
> Priority: Critical
>
> I discovered a problematic behavior with the
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when
> crawling web pages. This behavior poses problem in particular for the
> scenario of form based authentication, as explained further.
> The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which
> is called by the TagParseState on each noteTag() or noteEndTag() methods,
> uses the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState
> class to detect if the parsing process is in or out of a 'script' tag and
> then do something or not with the incoming data. The problem is that the
> TagParseState class is not aware of the type of tag currently parsed, so it
> continues to analyze any char encountered to detect tags even if it is
> actually parsing a script tag.
> So let's imagine you have a script tag built like this in a web page:
> {code:java}
> <script>if(myvar <= 9) {.......}</script>
> {code}
> When the TagParseState parses the char '<' it will consider that a new tag
> begins until it encounters a '>' char. So in the case above, the
> TagParseState will never catch the end of the script tag, and thus, the
> scriptParseState variable in the ScriptParseState class will remain in the
> SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be
> correctly handled by the other parsers.
> As a result, if you, for example, configure a form authentication for your
> crawl and that the form web page contains this kind of script tag prior to
> the form tag, the form will never be handled and the authentication will
> fail. This was the case I encountered, and I resolved it by forcing the
> scriptParseState to be SCRIPTPARSESTATE_NORMAL.
> ref :
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)