[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723981#comment-16723981 ]
Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 11:52 AM: ---------------------------------------------------------------------- [~kwri...@metacarta.com] - So then with the seed map URL: we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ... Last time we tried this manifold started acting strange because of the amount of url's/links located in the sitemap URL (sitemap url: [https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en&html=true]) (blacklist url's: [https://www.uantwerpen.be/admin/system/sitemap/sitemap_revokes.aspx?lang=en&html=true]) was (Author: steenti): [~kwri...@metacarta.com] - So then with the seed map URL: we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ... Last time we tried this manifold started acting strange because of the amount of url's/links located in the sitemap URL (sitemap url: [https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en&html=true]) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > ------------------------------------------------------------------------------------ > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector > Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic > Reporter: Tim Steenbeke > Assignee: Karl Wright > Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)