[jira] [Commented] (CONNECTORS-1680) WebConnector: Support the Document Base URL element

2021-12-17 Thread Markus Schuch (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461592#comment-17461592
 ] 

Markus Schuch commented on CONNECTORS-1680:
---

fixed with r1896101

> WebConnector: Support the Document Base URL element
> ---
>
> Key: CONNECTORS-1680
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1680
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Reporter: Markus Schuch
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.21
>
>
> HTML allows to specifiy the base URL to use for all relative URLs in a 
> document:
> {code:java}
> 
>   
>
> https://example.org/"/>
> ...
>   
>...
> {code}
> [https://developer.mozilla.org/de/docs/Web/HTML/Element/base]
> The Web Connector should respect this element when handling relative links.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (CONNECTORS-1680) WebConnector: Support the Document Base URL element

2021-12-17 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461575#comment-17461575
 ] 

Karl Wright commented on CONNECTORS-1680:
-

The baseDocumentIdentifier code is new for adding the ability to set the base 
URL for all relative references, a change you requested and I implemented.

All relative document urls should be based on baseDocumentIdentifier, and that 
is initialized to be documentIdentifier at the start.


> WebConnector: Support the Document Base URL element
> ---
>
> Key: CONNECTORS-1680
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1680
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Reporter: Markus Schuch
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.21
>
>
> HTML allows to specifiy the base URL to use for all relative URLs in a 
> document:
> {code:java}
> 
>   
>
> https://example.org/"/>
> ...
>   
>...
> {code}
> [https://developer.mozilla.org/de/docs/Web/HTML/Element/base]
> The Web Connector should respect this element when handling relative links.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (CONNECTORS-1680) WebConnector: Support the Document Base URL element

2021-12-17 Thread Markus Schuch (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461544#comment-17461544
 ] 

Markus Schuch commented on CONNECTORS-1680:
---

[~kwri...@metacarta.com] in my journey of fixing the travis ci build i found 
out, that the 
{{org.apache.manifoldcf.crawler.connectors.webcrawler.tests.SessionLoginHSQLDBIT}}
 fails after this change:

{code:java}
 [junit] -  ---
    [junit] Testcase: 
sessionCrawl(org.apache.manifoldcf.crawler.connectors.webcrawler.tests.SessionLoginHSQLDBIT):
     Caused an ERROR
    [junit] Wrong number of documents processed - expected 101, saw 1
    [junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: Wrong 
number of documents processed - expected 101, saw 1
    [junit]     at 
org.apache.manifoldcf.crawler.connectors.webcrawler.tests.SessionTester.executeTest(SessionTester.java:166)
    [junit]     at 
org.apache.manifoldcf.crawler.connectors.webcrawler.tests.SessionLoginHSQLDBIT.sessionCrawl(SessionLoginHSQLDBIT.java:59)
    [junit]
    [junit]
{code}

> WebConnector: Support the Document Base URL element
> ---
>
> Key: CONNECTORS-1680
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1680
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Reporter: Markus Schuch
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.21
>
>
> HTML allows to specifiy the base URL to use for all relative URLs in a 
> document:
> {code:java}
> 
>   
>
> https://example.org/"/>
> ...
>   
>...
> {code}
> [https://developer.mozilla.org/de/docs/Web/HTML/Element/base]
> The Web Connector should respect this element when handling relative links.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (CONNECTORS-1680) WebConnector: Support the Document Base URL element

2021-11-28 Thread Markus Schuch (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17450108#comment-17450108
 ] 

Markus Schuch commented on CONNECTORS-1680:
---

I tried crawling a website with Base URL element and the discovered links look 
correct now.

I also looked over the code changes. One minor issue: the assignment of 
{{baseDocumentIdentifier}} in the constructor of {{ProcessActivityLinkHandler}} 
looks useless (assigned to itself) --> 
https://github.com/apache/manifoldcf/commit/8b88d6d44aba54ee276619340ad36a9cc441932d#diff-9fff09bc306de2115d24403a5fe1c3c9fe831d30a54448291f1f90b4d5c6b2a7R3889



> WebConnector: Support the Document Base URL element
> ---
>
> Key: CONNECTORS-1680
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1680
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Reporter: Markus Schuch
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.21
>
>
> HTML allows to specifiy the base URL to use for all relative URLs in a 
> document:
> {code:java}
> 
>   
>
> https://example.org/"/>
> ...
>   
>...
> {code}
> [https://developer.mozilla.org/de/docs/Web/HTML/Element/base]
> The Web Connector should respect this element when handling relative links.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)