[ https://issues.apache.org/jira/browse/CONNECTORS-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wright resolved CONNECTORS-1153. ------------------------------------- Resolution: Fixed > Documents crawled using manifoldcf 1.6 or earlier are needlessly recrawled > after upgrade to 1.7 or later > -------------------------------------------------------------------------------------------------------- > > Key: CONNECTORS-1153 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1153 > Project: ManifoldCF > Issue Type: Bug > Affects Versions: ManifoldCF 1.7, ManifoldCF 1.8 > Reporter: Aeham Abushwashi > Assignee: Karl Wright > Fix For: ManifoldCF 1.8.1, ManifoldCF 2.0.1, ManifoldCF 1.9, > ManifoldCF 2.1 > > > After upgrading to mcf 1.7 or later, pre-existing documents are recrawled and > re-indexed even if they have not changed in any way since their last > pre-upgrade crawl. The impact can be significant for large manifold > deployments with millions+ static documents. > There appear to be three contributing factors: > 1. The empty transformation version of a legacy document is different from > the initial value of "0+0!" - in PipelineObjectWithVersions#buildAddPipeline > and IncrementalIngester#checkFetchDocument > 2. Incorrect comparison of output versions in > PipelineObjectWithVersions#buildAddPipeline where oldOutputVersion is > compared to a VersionContext object instead of the version string, which can > be obtained by calling VersionContext#getVersionString - if > IPipelineSpecification#getStageDescriptionString continues to return a > VersionContext object, a rename of the method could be useful > 3. In PipelineObjectWithVersions#buildAddPipeline, a null value for > newAuthorityNameString is not treated the same as an empty string (like it is > in other methods) -- This message was sent by Atlassian JIRA (v6.3.4#6332)