[ https://issues.apache.org/jira/browse/CONNECTORS-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wright resolved CONNECTORS-1729. ------------------------------------- Fix Version/s: ManifoldCF next Resolution: Fixed r1903771 > The Confluence-v6 Repository Connector's attachment logic is incorrect > ---------------------------------------------------------------------- > > Key: CONNECTORS-1729 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1729 > Project: ManifoldCF > Issue Type: Bug > Reporter: Nguyen Huu Nhat > Assignee: Karl Wright > Priority: Major > Fix For: ManifoldCF next > > > Hi there, > As there is an issue that is still not handled occurs in use, I would like to > suggest the following fix for the source code of Confluence Repository > Connector. > For details about this issue, please refer to the information below: > h3. +*1. Connector Name*+ > confluence-v6 \ Confluence Repository Connector > h3. +*2. Overview*+ > * In the Confluence Repository Connector, there is an error in the logic > that determines wether the document has attachments or not. > * Wrong logic leads to attachments not being crawled. > ※ This error only occurs when crawling documents from Confluence server, > while crawling documents from Confluence Cloud (SaaS) still works normally. > * Formats of the document's ID when there is a file attached are as below: > ** Crawled from Confluence server: *<id of attchment>-<id of blog/page>* > ** Crawled from Confluence cloud (SaaS): *att<id of attchment>-<id of > blog/page>* > h3. +*3. Reproduction*+ > * On Confluence server: > ** Create a blog. > ** Add attachments to the newly created blog. > * On ManifoldCF: > ** Create a Confluence Repository Connector with the aforementioned > Confluence server information. > ** Create a job using the connector created above with the following details: > *** On the [Page] tab: > **** Process Attachments: (Check). > **** Type Specification: Blog. > ** Start job. > ** Check [Simple History Report]. > h3. +*4. Cause*+ > * At the logic for judging whether the document has / does not have a file > attachment, if the ID of the document begins with *att*, it is judging that > there is a file attachment. > * However, the ID field of the document crawled from the Confluence server, > in fact, when the file is attached, does not prefix it with *att* (format > mentioned in item 2). > h3. +*5. Solution*+ > My observation is as below: > * If a document has a file attachment, the ID of that document is a string > of characters connected by *-* character. > * If a document does not have a file attachment, the ID of that document > does not contain *-* character. > Therefore, it is possible to judge whether a file is is attached or not by > checking if the ID contains *-* character. > h3. +*6. Suggested source code (based on release 2.22.1)*+ > ***Class: > org.apache.manifoldcf.crawler.connectors.confluence.v6.util.ConfluenceUtil*** > [https://github.com/apache/manifoldcf/blob/release-2.22.1/connectors/confluence-v6/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/confluence/v6/util/ConfluenceUtil.java#L28] > {code:java} > - private static final String ATTACHMENT_ID_PREFIX = "att"; > + private static final String ATTACHMENT_ID_CHARACTER = "-"; > {code} > [https://github.com/apache/manifoldcf/blob/release-2.22.1/connectors/confluence-v6/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/confluence/v6/util/ConfluenceUtil.java#L47] > {code:java} > public static Boolean isAttachment(String id) { > - return id.startsWith(ATTACHMENT_ID_PREFIX); > + return id.contains(ATTACHMENT_ID_CHARACTER); > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)