[ 
https://issues.apache.org/jira/browse/CONNECTORS-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936934#comment-16936934
 ] 

Karl Wright commented on CONNECTORS-1623:
-----------------------------------------

Verification failed, with this unit test failure:

{code}
run-connector-common-tests:
    [junit] Testsuite: org.apache.manifoldcf.connectorcommon.fuzzyml.TestFuzzyML
    [junit] ERROR StatusLogger No log4j2 configuration file found. Using 
default configuration: logging only errors to the console.
    [junit] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
0.344 sec
    [junit]
    [junit] ------------- Standard Error -----------------
    [junit] ERROR StatusLogger No log4j2 configuration file found. Using 
default configuration: logging only errors to the console.
    [junit] ------------- ---------------- ---------------
    [junit] Testcase: 
testTags(org.apache.manifoldcf.connectorcommon.fuzzyml.TestFuzzyML):      FAILED
    [junit] null
    [junit] junit.framework.AssertionFailedError
    [junit]     at 
org.apache.manifoldcf.connectorcommon.fuzzyml.TestFuzzyML.testTags(TestFuzzyML.java:192)
    [junit]
    [junit]

BUILD FAILED
C:\wip\mcf\trunk\build.xml:290: The following error occurred while executing 
this line:
C:\wip\mcf\trunk\framework\build.xml:2030: Test 
org.apache.manifoldcf.connectorcommon.fuzzyml.TestFuzzyML failed
{code}

The test is using a real-world example HTML page and parsing it, and it fails 
because it does not correctly pick up the </script> tag at the end of the 
script section.  The reason may be that end tags are still processed within the 
script section and that confuses the tag pairing.  That will not be 
straightforward to fix.  [~julienFL], awaiting your suggestion for that.


> Script tags not ignored
> -----------------------
>
>                 Key: CONNECTORS-1623
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.13
>            Reporter: Julien Massiera
>            Assignee: Karl Wright
>            Priority: Critical
>             Fix For: ManifoldCF 2.14
>
>
> I discovered a problematic behavior with the 
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when 
> crawling web pages. This behavior poses problem in particular for the 
> scenario of form based authentication, as explained further. 
>  The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which 
> is called by the TagParseState on each noteTag() or noteEndTag() methods, 
> uses the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState 
> class to detect if the parsing process is in or out of a 'script' tag and 
> then do something or not with the incoming data. The problem is that the 
> TagParseState class is not aware of the type of tag currently parsed, so it 
> continues to analyze any char encountered to detect tags even if it is 
> actually parsing a script tag. 
> So let's imagine you have a script tag built like this in a web page: 
> {code:java}
> <script>if(myvar <= 9) {.......}</script>
> {code}
> When the TagParseState parses the char '<' it will consider that a new tag 
> begins until it encounters a '>' char. So in the case above, the 
> TagParseState will never catch the end of the script tag, and thus, the 
> scriptParseState variable in the ScriptParseState class will remain in the 
> SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be 
> correctly handled by the other parsers. 
>  As a result, if you, for example, configure a form authentication for your 
> crawl and that the form web page contains this kind of script tag prior to 
> the form tag, the form will never be handled and the authentication will 
> fail. This was the case I encountered, and I resolved it by forcing the 
> scriptParseState to be SCRIPTPARSESTATE_NORMAL.
> ref : 
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to