[ 
https://issues.apache.org/jira/browse/CONNECTORS-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098837#comment-14098837
 ] 

Karl Wright commented on CONNECTORS-1009:
-----------------------------------------

Hi Prasad,

Please read this entry: 
https://chemistry.apache.org/java/0.9.0/maven/apidocs/org/apache/chemistry/opencmis/client/api/Session.html#query%28java.lang.String,%20boolean%29

Note that we call the session.query() method as follows:

{code}
      ItemIterable<QueryResult> results = session.query(cmisQuery, 
false).getPage(1000000000);
{code}

Note the "false" second argument, which if I read this right *should* cause the 
seed query to return only the latest versions.  So, in theory, if you remove 
the document = document.getObjectOfLatestVersion() invocation, the connector 
should work.

Please also note that you perform the full crawl equivalent of continuous 
crawling by just setting up a a set of schedule windows, and making sure you 
turn off the requirement that crawls only ever begin at the start of a window.



> Cmis Repository Connector does not handle Document updating properly
> --------------------------------------------------------------------
>
>                 Key: CONNECTORS-1009
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1009
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: CMIS connector
>    Affects Versions: ManifoldCF 1.7
>            Reporter: Prasad Perera
>            Priority: Minor
>             Fix For: ManifoldCF 1.7
>
>         Attachments: std_logs.txt, std_prints.diff
>
>
> As a part of the Fix for CONNECTORS-1004, It seems CmisRepositoryConnector 
> does not handle document updating properly.
> Case Scenario:
> * Create a continuous crawling job using  CmisRepositoryConnector.
> * Update a document on repository end.
> * The document keep submitting to OutputConnector at each crawling interval 
> though it was not updated afterwards.
> One possible Fix needed I is : @ CmisRepositoryConnector:processDocument,
>  activities.ingestDocumentWithException(nodeId, version, documentURI, rd);
> The documentURI should point to the old document URI (Now it points to the 
> latest documentURI discovered and it may seems to confuse document references 
> ?)
> Also, In ECM systems, for example in Alfresco, the documentIDs are formulated 
> with the version number as well.
> Ex: workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.0 --> 
> version 1.0
> workspace://SpacesStore/8e12a887-3fa8-48d6-8516-5bcfad358ba2;1.1 --> version 
> 1.1
> When we setup a query to crawl a repository folder, we discover content by 
> referring the child nodes. Because of that, now it seems to queue all the 
> document versions and submit them to OutputConnector thus producing duplicate 
> documents at the output (search) side.
> Is there a way to avoid this problem ? It will be great if the repository can 
> just take the latest document version and submit it as an update.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to