[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723782#comment-16723782
 ] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 7:48 AM:
---------------------------------------------------------------------

[~kwri...@metacarta.com]
If we update to manifold 2.12 can we than use the seedmap as originaly intended 
by us ?

so we create a job with X seeds, ES output, web input and HopCount 0 for links 
and redirect:
 # Put X seeds in seedmap
 # run job
 # X documents get pushed to ES
 # update job to have X minus 20 seeds
 wait till scheduled time

 
 # run job
 # 20 documents get deleted from ES
 # X minus 20 documents get updated
 # wait till scheduled time
 # ...

Will it work like this ?


was (Author: steenti):
If we update to manifold 2.12 can we than use the seedmap as originaly intended 
by us ?

so we create a job with X seeds, ES output, web input and HopCount 0 for links 
and redirect: 
 # Put X seeds in seedmap
 # run job
 # X documents get pushed to ES
 # update job to have X minus 20 seeds
wait till scheduled time

 
 # run job
 # 20 documents get deleted from ES
 # X minus 20 documents get updated
 # wait till scheduled time
 # ...

Will it work like this ?

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>             Fix For: ManifoldCF 2.12
>
>         Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to