[ 
https://issues.apache.org/jira/browse/CONNECTORS-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1729:
---------------------------------------

    Assignee: Karl Wright

> The Confluence-v6 Repository Connector's attachment logic is incorrect
> ----------------------------------------------------------------------
>
>                 Key: CONNECTORS-1729
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1729
>             Project: ManifoldCF
>          Issue Type: Bug
>            Reporter: Nguyen Huu Nhat
>            Assignee: Karl Wright
>            Priority: Major
>
> Hi there,
> As there is an issue that is still not handled occurs in use, I would like to 
> suggest the following fix for the source code of Confluence Repository 
> Connector.
> For details about this issue, please refer to the information below:
> h3. +*1. Connector Name*+
> confluence-v6 \ Confluence Repository Connector
> h3. +*2. Overview*+
>  * In the Confluence Repository Connector, there is an error in the logic 
> that determines wether the document has attachments or not.
>  * Wrong logic leads to attachments not being crawled.
> ※ This error only occurs when crawling documents from Confluence server, 
> while crawling documents from Confluence Cloud (SaaS) still works normally.
>  * Formats of the document's ID when there is a file attached are as below:
>  ** Crawled from Confluence server: *<id of attchment>-<id of blog/page>*
>  ** Crawled from Confluence cloud (SaaS): *att<id of attchment>-<id of 
> blog/page>*
> h3. +*3. Reproduction*+
>  * On Confluence server:
>  ** Create a blog.
>  ** Add attachments to the newly created blog.
>  * On ManifoldCF:
>  ** Create a Confluence Repository Connector with the aforementioned 
> Confluence server information.
>  ** Create a job using the connector created above with the following details:
>  *** On the [Page] tab:
>  **** Process Attachments: (Check).
>  **** Type Specification: Blog.
>  ** Start job.
>  ** Check [Simple History Report].
> h3. +*4. Cause*+
>  * At the logic for judging whether the document has / does not have a file 
> attachment, if the ID of the document begins with *att*, it is judging that 
> there is a file attachment.
>  * However, the ID field of the document crawled from the Confluence server, 
> in fact, when the file is attached, does not prefix it with *att* (format 
> mentioned in item 2).
> h3. +*5. Solution*+
> My observation is as below:
>  * If a document has a file attachment, the ID of that document is a string 
> of characters connected by *-* character.
>  * If a document does not have a file attachment, the ID of that document 
> does not contain *-* character.
> Therefore, it is possible to judge whether a file is is attached or not by 
> checking if the ID contains *-* character.
> h3. +*6. Suggested source code (based on release 2.22.1)*+
> ***Class: 
> org.apache.manifoldcf.crawler.connectors.confluence.v6.util.ConfluenceUtil***
> [https://github.com/apache/manifoldcf/blob/release-2.22.1/connectors/confluence-v6/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/confluence/v6/util/ConfluenceUtil.java#L28]
> {code:java}
> -  private static final String ATTACHMENT_ID_PREFIX = "att";
> +  private static final String ATTACHMENT_ID_CHARACTER = "-";
> {code}
> [https://github.com/apache/manifoldcf/blob/release-2.22.1/connectors/confluence-v6/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/confluence/v6/util/ConfluenceUtil.java#L47]
> {code:java}
>    public static Boolean isAttachment(String id) {
> -    return id.startsWith(ATTACHMENT_ID_PREFIX);
> +    return id.contains(ATTACHMENT_ID_CHARACTER);
>    }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to