Re: [VOTE] Release Apache ManifoldCF 2.14, RC1

2019-09-24 Thread Karl Wright
Ran all tests.

+1 from me.

Karl


On Tue, Sep 24, 2019 at 1:29 PM Karl Wright  wrote:

> Please vote on whether to release Apache ManifoldCF 2.14, RC0.
>
> There is a release tag at
> https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.14-RC1 .
> There is a release artifact at
> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.14 .
>
> Note that this release of ManifoldCF has a new connector -- the Content
> Services Web Services connector.  In order to integrate this connector,
> changes to the dependencies for web services had to be made, so I urge
> those using the Alfresco and CMIS connectors to verify this release if at
> all possible.  Integration tests pass but there's nothing like an on-site
> test.
>
> RC1 also contains a fix for CONNECTORS-1623.
>
> Thanks,
> Karl
>


Re: [CANCEL] [VOTE] Release Apache ManifoldCF 2.14, RC0

2019-09-24 Thread Furkan KAMACI
Hi Karl,

Thanks for the information! On the other hand, hope your family member get
well soon.

Kind Regards,
Furkan KAMACI

On Tue, Sep 24, 2019 at 8:47 PM Karl Wright  wrote:

> Hi Furkan,
>
> We do not do this as part of the release process.  The tags are versioned
> but the artifacts are named the same.  Only one of them ever is released so
> this is OK; the older artifacts are put in a different folder labeled
> "RC0", "RC1", etc.
>
> This process was originally designed by Jukka Zitting and Grant Ingersoll
> when MCF was graduating from the incubator and we have not changed it.
>
> Karl
>
>
> On Tue, Sep 24, 2019 at 1:36 PM Furkan KAMACI 
> wrote:
>
> > Hi Karl,
> >
> > I didn't want to hijack latest vote thread. Do we need to add a suffix as
> > like RC-1 to dist file as like here:
> > https://dist.apache.org/repos/dist/dev/zookeeper/ ?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Tue, Sep 24, 2019 at 7:32 PM Karl Wright  wrote:
> >
> > > CONNECTORS-1623.
> > >
> > > On Tue, Sep 24, 2019 at 9:04 AM Karl Wright 
> wrote:
> > >
> > > > Please vote on whether to release Apache ManifoldCF 2.14, RC0.
> > > >
> > > > There is a release tag at
> > > > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.14-RC0 .
> > > > There is a release artifact at
> > > >
> > https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.14
> > > .
> > > >
> > > > Note that this release of ManifoldCF has a new connector -- the
> Content
> > > > Services Web Services connector.  In order to integrate this
> connector,
> > > > changes to the dependencies for web services had to be made, so I
> urge
> > > > those using the Alfresco and CMIS connectors to verify this release
> if
> > at
> > > > all possible.  Integration tests pass but there's nothing like an
> > on-site
> > > > test.
> > > >
> > > > Thanks,
> > > > Karl
> > > >
> > > >
> > >
> >
>


Re: [CANCEL] [VOTE] Release Apache ManifoldCF 2.14, RC0

2019-09-24 Thread Karl Wright
Hi Furkan,

We do not do this as part of the release process.  The tags are versioned
but the artifacts are named the same.  Only one of them ever is released so
this is OK; the older artifacts are put in a different folder labeled
"RC0", "RC1", etc.

This process was originally designed by Jukka Zitting and Grant Ingersoll
when MCF was graduating from the incubator and we have not changed it.

Karl


On Tue, Sep 24, 2019 at 1:36 PM Furkan KAMACI 
wrote:

> Hi Karl,
>
> I didn't want to hijack latest vote thread. Do we need to add a suffix as
> like RC-1 to dist file as like here:
> https://dist.apache.org/repos/dist/dev/zookeeper/ ?
>
> Kind Regards,
> Furkan KAMACI
>
> On Tue, Sep 24, 2019 at 7:32 PM Karl Wright  wrote:
>
> > CONNECTORS-1623.
> >
> > On Tue, Sep 24, 2019 at 9:04 AM Karl Wright  wrote:
> >
> > > Please vote on whether to release Apache ManifoldCF 2.14, RC0.
> > >
> > > There is a release tag at
> > > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.14-RC0 .
> > > There is a release artifact at
> > >
> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.14
> > .
> > >
> > > Note that this release of ManifoldCF has a new connector -- the Content
> > > Services Web Services connector.  In order to integrate this connector,
> > > changes to the dependencies for web services had to be made, so I urge
> > > those using the Alfresco and CMIS connectors to verify this release if
> at
> > > all possible.  Integration tests pass but there's nothing like an
> on-site
> > > test.
> > >
> > > Thanks,
> > > Karl
> > >
> > >
> >
>


Re: [CANCEL] [VOTE] Release Apache ManifoldCF 2.14, RC0

2019-09-24 Thread Furkan KAMACI
Hi Karl,

I didn't want to hijack latest vote thread. Do we need to add a suffix as
like RC-1 to dist file as like here:
https://dist.apache.org/repos/dist/dev/zookeeper/ ?

Kind Regards,
Furkan KAMACI

On Tue, Sep 24, 2019 at 7:32 PM Karl Wright  wrote:

> CONNECTORS-1623.
>
> On Tue, Sep 24, 2019 at 9:04 AM Karl Wright  wrote:
>
> > Please vote on whether to release Apache ManifoldCF 2.14, RC0.
> >
> > There is a release tag at
> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.14-RC0 .
> > There is a release artifact at
> > https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.14
> .
> >
> > Note that this release of ManifoldCF has a new connector -- the Content
> > Services Web Services connector.  In order to integrate this connector,
> > changes to the dependencies for web services had to be made, so I urge
> > those using the Alfresco and CMIS connectors to verify this release if at
> > all possible.  Integration tests pass but there's nothing like an on-site
> > test.
> >
> > Thanks,
> > Karl
> >
> >
>


[jira] [Resolved] (CONNECTORS-1623) Script tags not ignored

2019-09-24 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1623.
-
Resolution: Fixed

> Script tags not ignored
> ---
>
> Key: CONNECTORS-1623
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.13
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.14
>
>
> I discovered a problematic behavior with the 
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when 
> crawling web pages. This behavior poses problem in particular for the 
> scenario of form based authentication, as explained further. 
>  The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which 
> is called by the TagParseState on each noteTag() or noteEndTag() methods, 
> uses the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState 
> class to detect if the parsing process is in or out of a 'script' tag and 
> then do something or not with the incoming data. The problem is that the 
> TagParseState class is not aware of the type of tag currently parsed, so it 
> continues to analyze any char encountered to detect tags even if it is 
> actually parsing a script tag. 
> So let's imagine you have a script tag built like this in a web page: 
> {code:java}
> if(myvar <= 9) {...}
> {code}
> When the TagParseState parses the char '<' it will consider that a new tag 
> begins until it encounters a '>' char. So in the case above, the 
> TagParseState will never catch the end of the script tag, and thus, the 
> scriptParseState variable in the ScriptParseState class will remain in the 
> SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be 
> correctly handled by the other parsers. 
>  As a result, if you, for example, configure a form authentication for your 
> crawl and that the form web page contains this kind of script tag prior to 
> the form tag, the form will never be handled and the authentication will 
> fail. This was the case I encountered, and I resolved it by forcing the 
> scriptParseState to be SCRIPTPARSESTATE_NORMAL.
> ref : 
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[CANCEL] [VOTE] Release Apache ManifoldCF 2.14, RC0

2019-09-24 Thread Karl Wright
CONNECTORS-1623.

On Tue, Sep 24, 2019 at 9:04 AM Karl Wright  wrote:

> Please vote on whether to release Apache ManifoldCF 2.14, RC0.
>
> There is a release tag at
> https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.14-RC0 .
> There is a release artifact at
> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.14 .
>
> Note that this release of ManifoldCF has a new connector -- the Content
> Services Web Services connector.  In order to integrate this connector,
> changes to the dependencies for web services had to be made, so I urge
> those using the Alfresco and CMIS connectors to verify this release if at
> all possible.  Integration tests pass but there's nothing like an on-site
> test.
>
> Thanks,
> Karl
>
>


[jira] [Commented] (CONNECTORS-1623) Script tags not ignored

2019-09-24 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936967#comment-16936967
 ] 

Karl Wright commented on CONNECTORS-1623:
-

r1867468 (trunk)
r1867469 (release branch)


> Script tags not ignored
> ---
>
> Key: CONNECTORS-1623
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.13
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.14
>
>
> I discovered a problematic behavior with the 
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when 
> crawling web pages. This behavior poses problem in particular for the 
> scenario of form based authentication, as explained further. 
>  The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which 
> is called by the TagParseState on each noteTag() or noteEndTag() methods, 
> uses the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState 
> class to detect if the parsing process is in or out of a 'script' tag and 
> then do something or not with the incoming data. The problem is that the 
> TagParseState class is not aware of the type of tag currently parsed, so it 
> continues to analyze any char encountered to detect tags even if it is 
> actually parsing a script tag. 
> So let's imagine you have a script tag built like this in a web page: 
> {code:java}
> if(myvar <= 9) {...}
> {code}
> When the TagParseState parses the char '<' it will consider that a new tag 
> begins until it encounters a '>' char. So in the case above, the 
> TagParseState will never catch the end of the script tag, and thus, the 
> scriptParseState variable in the ScriptParseState class will remain in the 
> SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be 
> correctly handled by the other parsers. 
>  As a result, if you, for example, configure a form authentication for your 
> crawl and that the form web page contains this kind of script tag prior to 
> the form tag, the form will never be handled and the authentication will 
> fail. This was the case I encountered, and I resolved it by forcing the 
> scriptParseState to be SCRIPTPARSESTATE_NORMAL.
> ref : 
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1623) Script tags not ignored

2019-09-24 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936964#comment-16936964
 ] 

Karl Wright commented on CONNECTORS-1623:
-

I found a fix; committing it.


> Script tags not ignored
> ---
>
> Key: CONNECTORS-1623
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.13
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.14
>
>
> I discovered a problematic behavior with the 
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when 
> crawling web pages. This behavior poses problem in particular for the 
> scenario of form based authentication, as explained further. 
>  The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which 
> is called by the TagParseState on each noteTag() or noteEndTag() methods, 
> uses the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState 
> class to detect if the parsing process is in or out of a 'script' tag and 
> then do something or not with the incoming data. The problem is that the 
> TagParseState class is not aware of the type of tag currently parsed, so it 
> continues to analyze any char encountered to detect tags even if it is 
> actually parsing a script tag. 
> So let's imagine you have a script tag built like this in a web page: 
> {code:java}
> if(myvar <= 9) {...}
> {code}
> When the TagParseState parses the char '<' it will consider that a new tag 
> begins until it encounters a '>' char. So in the case above, the 
> TagParseState will never catch the end of the script tag, and thus, the 
> scriptParseState variable in the ScriptParseState class will remain in the 
> SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be 
> correctly handled by the other parsers. 
>  As a result, if you, for example, configure a form authentication for your 
> crawl and that the form web page contains this kind of script tag prior to 
> the form tag, the form will never be handled and the authentication will 
> fail. This was the case I encountered, and I resolved it by forcing the 
> scriptParseState to be SCRIPTPARSESTATE_NORMAL.
> ref : 
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1623) Script tags not ignored

2019-09-24 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936934#comment-16936934
 ] 

Karl Wright commented on CONNECTORS-1623:
-

Verification failed, with this unit test failure:

{code}
run-connector-common-tests:
[junit] Testsuite: org.apache.manifoldcf.connectorcommon.fuzzyml.TestFuzzyML
[junit] ERROR StatusLogger No log4j2 configuration file found. Using 
default configuration: logging only errors to the console.
[junit] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
0.344 sec
[junit]
[junit] - Standard Error -
[junit] ERROR StatusLogger No log4j2 configuration file found. Using 
default configuration: logging only errors to the console.
[junit] -  ---
[junit] Testcase: 
testTags(org.apache.manifoldcf.connectorcommon.fuzzyml.TestFuzzyML):  FAILED
[junit] null
[junit] junit.framework.AssertionFailedError
[junit] at 
org.apache.manifoldcf.connectorcommon.fuzzyml.TestFuzzyML.testTags(TestFuzzyML.java:192)
[junit]
[junit]

BUILD FAILED
C:\wip\mcf\trunk\build.xml:290: The following error occurred while executing 
this line:
C:\wip\mcf\trunk\framework\build.xml:2030: Test 
org.apache.manifoldcf.connectorcommon.fuzzyml.TestFuzzyML failed
{code}

The test is using a real-world example HTML page and parsing it, and it fails 
because it does not correctly pick up the  tag at the end of the 
script section.  The reason may be that end tags are still processed within the 
script section and that confuses the tag pairing.  That will not be 
straightforward to fix.  [~julienFL], awaiting your suggestion for that.


> Script tags not ignored
> ---
>
> Key: CONNECTORS-1623
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.13
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.14
>
>
> I discovered a problematic behavior with the 
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when 
> crawling web pages. This behavior poses problem in particular for the 
> scenario of form based authentication, as explained further. 
>  The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which 
> is called by the TagParseState on each noteTag() or noteEndTag() methods, 
> uses the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState 
> class to detect if the parsing process is in or out of a 'script' tag and 
> then do something or not with the incoming data. The problem is that the 
> TagParseState class is not aware of the type of tag currently parsed, so it 
> continues to analyze any char encountered to detect tags even if it is 
> actually parsing a script tag. 
> So let's imagine you have a script tag built like this in a web page: 
> {code:java}
> if(myvar <= 9) {...}
> {code}
> When the TagParseState parses the char '<' it will consider that a new tag 
> begins until it encounters a '>' char. So in the case above, the 
> TagParseState will never catch the end of the script tag, and thus, the 
> scriptParseState variable in the ScriptParseState class will remain in the 
> SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be 
> correctly handled by the other parsers. 
>  As a result, if you, for example, configure a form authentication for your 
> crawl and that the form web page contains this kind of script tag prior to 
> the form tag, the form will never be handled and the authentication will 
> fail. This was the case I encountered, and I resolved it by forcing the 
> scriptParseState to be SCRIPTPARSESTATE_NORMAL.
> ref : 
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1623) Script tags not ignored

2019-09-24 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936866#comment-16936866
 ] 

Karl Wright commented on CONNECTORS-1623:
-

I put together a fix but need to verify it.


> Script tags not ignored
> ---
>
> Key: CONNECTORS-1623
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.13
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.14
>
>
> I discovered a problematic behavior with the 
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when 
> crawling web pages. This behavior poses problem in particular for the 
> scenario of form based authentication, as explained further. 
>  The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which 
> is called by the TagParseState on each noteTag() or noteEndTag() methods, 
> uses the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState 
> class to detect if the parsing process is in or out of a 'script' tag and 
> then do something or not with the incoming data. The problem is that the 
> TagParseState class is not aware of the type of tag currently parsed, so it 
> continues to analyze any char encountered to detect tags even if it is 
> actually parsing a script tag. 
> So let's imagine you have a script tag built like this in a web page: 
> {code:java}
> if(myvar <= 9) {...}
> {code}
> When the TagParseState parses the char '<' it will consider that a new tag 
> begins until it encounters a '>' char. So in the case above, the 
> TagParseState will never catch the end of the script tag, and thus, the 
> scriptParseState variable in the ScriptParseState class will remain in the 
> SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be 
> correctly handled by the other parsers. 
>  As a result, if you, for example, configure a form authentication for your 
> crawl and that the form web page contains this kind of script tag prior to 
> the form tag, the form will never be handled and the authentication will 
> fail. This was the case I encountered, and I resolved it by forcing the 
> scriptParseState to be SCRIPTPARSESTATE_NORMAL.
> ref : 
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1623) Script tags not ignored

2019-09-24 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1623:

Fix Version/s: ManifoldCF 2.14

> Script tags not ignored
> ---
>
> Key: CONNECTORS-1623
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.13
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.14
>
>
> I discovered a problematic behavior with the 
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when 
> crawling web pages. This behavior poses problem in particular for the 
> scenario of form based authentication, as explained further. 
>  The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which 
> is called by the TagParseState on each noteTag() or noteEndTag() methods, 
> uses the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState 
> class to detect if the parsing process is in or out of a 'script' tag and 
> then do something or not with the incoming data. The problem is that the 
> TagParseState class is not aware of the type of tag currently parsed, so it 
> continues to analyze any char encountered to detect tags even if it is 
> actually parsing a script tag. 
> So let's imagine you have a script tag built like this in a web page: 
> {code:java}
> if(myvar <= 9) {...}
> {code}
> When the TagParseState parses the char '<' it will consider that a new tag 
> begins until it encounters a '>' char. So in the case above, the 
> TagParseState will never catch the end of the script tag, and thus, the 
> scriptParseState variable in the ScriptParseState class will remain in the 
> SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be 
> correctly handled by the other parsers. 
>  As a result, if you, for example, configure a form authentication for your 
> crawl and that the form web page contains this kind of script tag prior to 
> the form tag, the form will never be handled and the authentication will 
> fail. This was the case I encountered, and I resolved it by forcing the 
> scriptParseState to be SCRIPTPARSESTATE_NORMAL.
> ref : 
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (CONNECTORS-1623) Script tags not ignored

2019-09-24 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1623:
---

Assignee: Karl Wright

> Script tags not ignored
> ---
>
> Key: CONNECTORS-1623
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.13
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
>
> I discovered a problematic behavior with the 
> org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when 
> crawling web pages. This behavior poses problem in particular for the 
> scenario of form based authentication, as explained further. 
>  The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which 
> is called by the TagParseState on each noteTag() or noteEndTag() methods, 
> uses the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState 
> class to detect if the parsing process is in or out of a 'script' tag and 
> then do something or not with the incoming data. The problem is that the 
> TagParseState class is not aware of the type of tag currently parsed, so it 
> continues to analyze any char encountered to detect tags even if it is 
> actually parsing a script tag. 
> So let's imagine you have a script tag built like this in a web page: 
> {code:java}
> if(myvar <= 9) {...}
> {code}
> When the TagParseState parses the char '<' it will consider that a new tag 
> begins until it encounters a '>' char. So in the case above, the 
> TagParseState will never catch the end of the script tag, and thus, the 
> scriptParseState variable in the ScriptParseState class will remain in the 
> SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be 
> correctly handled by the other parsers. 
>  As a result, if you, for example, configure a form authentication for your 
> crawl and that the form web page contains this kind of script tag prior to 
> the form tag, the form will never be handled and the authentication will 
> fail. This was the case I encountered, and I resolved it by forcing the 
> scriptParseState to be SCRIPTPARSESTATE_NORMAL.
> ref : 
> [http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CONNECTORS-1623) Script tags not ignored

2019-09-24 Thread Julien Massiera (Jira)
Julien Massiera created CONNECTORS-1623:
---

 Summary: Script tags not ignored
 Key: CONNECTORS-1623
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1623
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 2.13
Reporter: Julien Massiera


I discovered a problematic behavior with the 
org.apache.manifoldcf.connectorcommon.fuzzyml.TagParseState class when crawling 
web pages. This behavior poses problem in particular for the scenario of form 
based authentication, as explained further. 

 The org.apache.manifoldcf.connectorcommon.fuzzyml.HTMLParseState class which 
is called by the TagParseState on each noteTag() or noteEndTag() methods, uses 
the org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState class 
to detect if the parsing process is in or out of a 'script' tag and then do 
something or not with the incoming data. The problem is that the TagParseState 
class is not aware of the type of tag currently parsed, so it continues to 
analyze any char encountered to detect tags even if it is actually parsing a 
script tag. 

So let's imagine you have a script tag built like this in a web page: 
{code:java}
if(myvar <= 9) {...}
{code}

When the TagParseState parses the char '<' it will consider that a new tag 
begins until it encounters a '>' char. So in the case above, the TagParseState 
will never catch the end of the script tag, and thus, the scriptParseState 
variable in the ScriptParseState class will remain in the 
SCRIPTPARSESTATE_INSCRIPT state and the rest of the web page will not be 
correctly handled by the other parsers. 

 As a result, if you, for example, configure a form authentication for your 
crawl and that the form web page contains this kind of script tag prior to the 
form tag, the form will never be handled and the authentication will fail. This 
was the case I encountered, and I resolved it by forcing the scriptParseState 
to be SCRIPTPARSESTATE_NORMAL.

ref : 
[http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release Apache ManifoldCF 2.14, RC0

2019-09-24 Thread Karl Wright
It would also be great if you constructed a patch and attached it to your
ticket.  Then we can have a back-and-forth about improvements if they are
needed, and tests that exercise it.  I am really very pressed for time of
late and medical issues in my family is going to really impact what little
time I have.

Karl


On Tue, Sep 24, 2019 at 9:25 AM Karl Wright  wrote:

> Please create a ticket.  I am so busy I cannot keep track of issues
> without tickets.
> Karl
>
>
>
> On Tue, Sep 24, 2019 at 9:23 AM Julien Massiera <
> julien.massi...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> is it possible to have in this v2.14 a fix for the problem I exposed two
>> weeks ago about the web connector ? ref :
>>
>> http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E
>>
>> I didn't create a ticket on this subject but I can if you are ok with it.
>>
>> Julien
>>
>> On 24/09/2019 15:04, Karl Wright wrote:
>> > Please vote on whether to release Apache ManifoldCF 2.14, RC0.
>> >
>> > There is a release tag at
>> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.14-RC0 .
>> > There is a release artifact at
>> >
>> https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.14
>> .
>> >
>> > Note that this release of ManifoldCF has a new connector -- the Content
>> > Services Web Services connector.  In order to integrate this connector,
>> > changes to the dependencies for web services had to be made, so I urge
>> > those using the Alfresco and CMIS connectors to verify this release if
>> at
>> > all possible.  Integration tests pass but there's nothing like an
>> on-site
>> > test.
>> >
>> > Thanks,
>> > Karl
>> >
>> --
>> Julien MASSIERA
>> Directeur développement produit
>> France Labs – Les experts du Search
>> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation
>> Makers Summit
>> www.francelabs.com
>>
>>


Re: [VOTE] Release Apache ManifoldCF 2.14, RC0

2019-09-24 Thread Karl Wright
Please create a ticket.  I am so busy I cannot keep track of issues without
tickets.
Karl



On Tue, Sep 24, 2019 at 9:23 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:

> Hi Karl,
>
> is it possible to have in this v2.14 a fix for the problem I exposed two
> weeks ago about the web connector ? ref :
>
> http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E
>
> I didn't create a ticket on this subject but I can if you are ok with it.
>
> Julien
>
> On 24/09/2019 15:04, Karl Wright wrote:
> > Please vote on whether to release Apache ManifoldCF 2.14, RC0.
> >
> > There is a release tag at
> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.14-RC0 .
> > There is a release artifact at
> > https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.14
> .
> >
> > Note that this release of ManifoldCF has a new connector -- the Content
> > Services Web Services connector.  In order to integrate this connector,
> > changes to the dependencies for web services had to be made, so I urge
> > those using the Alfresco and CMIS connectors to verify this release if at
> > all possible.  Integration tests pass but there's nothing like an on-site
> > test.
> >
> > Thanks,
> > Karl
> >
> --
> Julien MASSIERA
> Directeur développement produit
> France Labs – Les experts du Search
> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers
> Summit
> www.francelabs.com
>
>


Re: [VOTE] Release Apache ManifoldCF 2.14, RC0

2019-09-24 Thread Julien Massiera

Hi Karl,

is it possible to have in this v2.14 a fix for the problem I exposed two 
weeks ago about the web connector ? ref : 
http://mail-archives.apache.org/mod_mbox/manifoldcf-dev/201909.mbox/%3CCALUFAGA7eXi_gNBqWv2PRt2FaXuuKW5rTwLiXfceTkUAQfBvVg%40mail.gmail.com%3E


I didn't create a ticket on this subject but I can if you are ok with it.

Julien

On 24/09/2019 15:04, Karl Wright wrote:

Please vote on whether to release Apache ManifoldCF 2.14, RC0.

There is a release tag at
https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.14-RC0 .
There is a release artifact at
https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.14 .

Note that this release of ManifoldCF has a new connector -- the Content
Services Web Services connector.  In order to integrate this connector,
changes to the dependencies for web services had to be made, so I urge
those using the Alfresco and CMIS connectors to verify this release if at
all possible.  Integration tests pass but there's nothing like an on-site
test.

Thanks,
Karl


--
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers 
Summit
www.francelabs.com



[VOTE] Release Apache ManifoldCF 2.14, RC0

2019-09-24 Thread Karl Wright
Please vote on whether to release Apache ManifoldCF 2.14, RC0.

There is a release tag at
https://svn.apache.org/repos/asf/manifoldcf/tags/release-2.14-RC0 .
There is a release artifact at
https://dist.apache.org/repos/dist/dev/manifoldcf/apache-manifoldcf-2.14 .

Note that this release of ManifoldCF has a new connector -- the Content
Services Web Services connector.  In order to integrate this connector,
changes to the dependencies for web services had to be made, so I urge
those using the Alfresco and CMIS connectors to verify this release if at
all possible.  Integration tests pass but there's nothing like an on-site
test.

Thanks,
Karl


[jira] [Resolved] (CONNECTORS-1566) Develop CSWS connector as a replacement for deprecated LiveLink LAPI connector

2019-09-24 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1566.
-
Resolution: Fixed

Confirmed that the connector is mostly working for multiple users.  We reserve 
the right to open individual tickets for problems that still need resolution.


> Develop CSWS connector as a replacement for deprecated LiveLink LAPI connector
> --
>
> Key: CONNECTORS-1566
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1566
> Project: ManifoldCF
>  Issue Type: Task
>  Components: LiveLink connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.14
>
> Attachments: OTCS_IIS.png, OTCS_Tomcat.png, chrome_cgfC00ujx7.png
>
>
> LAPI is being deprecated.  We need to develop a replacement for it using the 
> ContentServer Web Services API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)