[ https://issues.apache.org/jira/browse/CONNECTORS-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wright reassigned CONNECTORS-1623: --------------------------------------- Assignee: Karl Wright > Script tags not ignored > ----------------------- > > Key: CONNECTORS-1623 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1623 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector > Affects Versions: ManifoldCF 2.13 > Reporter: Julien Massiera > Assignee: Karl Wright > Priority: Critical > > I discovered a problematic behavior with the > org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when > crawling web pages. This behavior poses problem in particular for the > scenario of form based authentication, as explained further. > The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which > is called by the TagParseState on each noteTag() or noteEndTag() methods, > uses the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState > class to detect if the parsing process is in or out of a 'script' tag and > then do something or not with the incoming data. The problem is that the > TagParseState class is not aware of the type of tag currently parsed, so it > continues to analyze any char encountered to detect tags even if it is > actually parsing a script tag. > So let's imagine you have a script tag built like this in a web page: > {code:java} > <script>if(myvar <= 9) {.......}</script> > {code} > When the TagParseState parses the char '<' it will consider that a new tag > begins until it encounters a '>' char. So in the case above, the > TagParseState will never catch the end of the script tag, and thus, the > scriptParseState variable in the ScriptParseState class will remain in the > SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be > correctly handled by the other parsers. > As a result, if you, for example, configure a form authentication for your > crawl and that the form web page contains this kind of script tag prior to > the form tag, the form will never be handled and the authentication will > fail. This was the case I encountered, and I resolved it by forcing the > scriptParseState to be SCRIPTPARSESTATE_NORMAL. > ref : > [http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)