[ https://issues.apache.org/jira/browse/CONNECTORS-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098837#comment-14098837 ]
Karl Wright commented on CONNECTORS-1009: ----------------------------------------- Hi Prasad, Please read this entry: https://chemistry.apache.org/java/0.9.0/maven/apidocs/org/apache/chemistry/opencmis/client/api/Session.html#query%28java.lang.String,%20boolean%29 Note that we call the session.query() method as follows: {code} ItemIterable<QueryResult> results = session.query(cmisQuery, false).getPage(1000000000); {code} Note the "false" second argument, which if I read this right *should* cause the seed query to return only the latest versions. So, in theory, if you remove the document = document.getObjectOfLatestVersion() invocation, the connector should work. Please also note that you perform the full crawl equivalent of continuous crawling by just setting up a a set of schedule windows, and making sure you turn off the requirement that crawls only ever begin at the start of a window. > Cmis Repository Connector does not handle Document updating properly > -------------------------------------------------------------------- > > Key: CONNECTORS-1009 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1009 > Project: ManifoldCF > Issue Type: Bug > Components: CMIS connector > Affects Versions: ManifoldCF 1.7 > Reporter: Prasad Perera > Priority: Minor > Fix For: ManifoldCF 1.7 > > Attachments: std_logs.txt, std_prints.diff > > > As a part of the Fix for CONNECTORS-1004, It seems CmisRepositoryConnector > does not handle document updating properly. > Case Scenario: > * Create a continuous crawling job using CmisRepositoryConnector. > * Update a document on repository end. > * The document keep submitting to OutputConnector at each crawling interval > though it was not updated afterwards. > One possible Fix needed I is : @ CmisRepositoryConnector:processDocument, > activities.ingestDocumentWithException(nodeId, version, documentURI, rd); > The documentURI should point to the old document URI (Now it points to the > latest documentURI discovered and it may seems to confuse document references > ?) > Also, In ECM systems, for example in Alfresco, the documentIDs are formulated > with the version number as well. > Ex: workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.0 --> > version 1.0 > workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.1 --> version > 1.1 > When we setup a query to crawl a repository folder, we discover content by > referring the child nodes. Because of that, now it seems to queue all the > document versions and submit them to OutputConnector thus producing duplicate > documents at the output (search) side. > Is there a way to avoid this problem ? It will be great if the repository can > just take the latest document version and submit it as an update. -- This message was sent by Atlassian JIRA (v6.2#6252)