[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736881#comment-16736881
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 1/8/19 11:39 AM:
------------------------------------------------------------------

Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.

If you remove a document from the site map, and you want MCF to pick up that 
the document is now unreachable and should be removed, you can do this by 
setting a hopcount maximum that is large but also selecting "delete unreachable 
documents".  The only thing I'd caution you about if you use this approach is 
that links BETWEEN documents will also be traversed, so if you want the sitemap 
to be a whitelist then you want hopcount max = 2.



was (Author: kwri...@metacarta.com):
Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.

If you remove a document from the site map, and you want MCF to pick up that 
the document is now unreachable and should be removed, you can do this by 
setting a hopcount maximum that is large but also selecting "delete unreachable 
documents".  The only thing I'd caution you about if you use this approach is 
that links BETWEEN documents will also be traversed, so if you want the sitemap 
to be a whitelist then you want hopcount max = 1.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>             Fix For: ManifoldCF 2.12
>
>         Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to