[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724016#comment-16724016
 ] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 12:47 PM:
----------------------------------------------------------------------

There is no regex, there is no possibility to make a regex for this. That's the 
issue with creating the exclude/blacklist. there is already a regex being used 
for images, documents and other files that don't have to be crawled.

'started acting strange' stopped working and crashed, because of the amount of 
URL's it stopped indexing, no messages or error's were given, just the job 
stopped working.
----
This is not the question. answer my question please.
 is this the way we have to run the job:
{panel}
_*Tim Steenbeke added a comment*_

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...{panel}
 


was (Author: steenti):
There is no regex, there is no possibility to make a regex for this. That's the 
issue with creating the exclude/blacklist.


'started acting strange' stopped working and crashed.
----
This is not the question. answer my question please.
is this the way we have to run the job:
{panel}
_*Tim Steenbeke added a comment*_

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...{panel}
 

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>             Fix For: ManifoldCF 2.12
>
>         Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to