[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736881#comment-16736881 ] Karl Wright edited comment on CONNECTORS-1562 at 1/8/19 11:39 AM: -- Good to know that you got beyond the crawling issue. If you run any MCF job to completion, all no-longer-present documents should be removed from the index. That applies to web jobs too. So I expect that to work as per design. If you remove a document from the site map, and you want MCF to pick up that the document is now unreachable and should be removed, you can do this by setting a hopcount maximum that is large but also selecting "delete unreachable documents". The only thing I'd caution you about if you use this approach is that links BETWEEN documents will also be traversed, so if you want the sitemap to be a whitelist then you want hopcount max = 2. was (Author: kwri...@metacarta.com): Good to know that you got beyond the crawling issue. If you run any MCF job to completion, all no-longer-present documents should be removed from the index. That applies to web jobs too. So I expect that to work as per design. If you remove a document from the site map, and you want MCF to pick up that the document is now unreachable and should be removed, you can do this by setting a hopcount maximum that is large but also selecting "delete unreachable documents". The only thing I'd caution you about if you use this approach is that links BETWEEN documents will also be traversed, so if you want the sitemap to be a whitelist then you want hopcount max = 1. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736881#comment-16736881 ] Karl Wright edited comment on CONNECTORS-1562 at 1/8/19 8:38 AM: - Good to know that you got beyond the crawling issue. If you run any MCF job to completion, all no-longer-present documents should be removed from the index. That applies to web jobs too. So I expect that to work as per design. If you remove a document from the site map, and you want MCF to pick up that the document is now unreachable and should be removed, you can do this by setting a hopcount maximum that is large but also selecting "delete unreachable documents". The only thing I'd caution you about if you use this approach is that links BETWEEN documents will also be traversed, so if you want the sitemap to be a whitelist then you want hopcount max = 1. was (Author: kwri...@metacarta.com): Good to know that you got beyond the crawling issue. If you run any MCF job to completion, all no-longer-present documents should be removed from the index. That applies to web jobs too. So I expect that to work as per design. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731290#comment-16731290 ] Karl Wright edited comment on CONNECTORS-1562 at 12/31/18 12:06 PM: {code} Error: Repeated service interruptions - failure processing document: Stream Closed {code} This is not a crash; this just means that the job aborts. It also comes with numerous stack traces, in the manifoldcf log, one for each time the document retries. That stack trace would be very helpful to have. Thanks! was (Author: kwri...@metacarta.com): {code} Error: Repeated service interruptions - failure processing document: Stream Closed {code} This is not a crash; this just means that the job aborts. It also comes with numerous stack traces, one for each time the document retries. That stack trace would be very helpful to have. Thanks! > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724016#comment-16724016 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 12:47 PM: -- There is no regex, there is no possibility to make a regex for this. That's the issue with creating the exclude/blacklist. there is already a regex being used for images, documents and other files that don't have to be crawled. 'started acting strange' stopped working and crashed, because of the amount of URL's it stopped indexing, no messages or error's were given, just the job stopped working. This is not the question. answer my question please. is this the way we have to run the job: {panel} _*Tim Steenbeke added a comment*_ we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ...{panel} was (Author: steenti): There is no regex, there is no possibility to make a regex for this. That's the issue with creating the exclude/blacklist. 'started acting strange' stopped working and crashed. This is not the question. answer my question please. is this the way we have to run the job: {panel} _*Tim Steenbeke added a comment*_ we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ...{panel} > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723981#comment-16723981 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 11:52 AM: -- [~kwri...@metacarta.com] - So then with the seed map URL: we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ... Last time we tried this manifold started acting strange because of the amount of url's/links located in the sitemap URL (sitemap url: [https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true]) (blacklist url's: [https://www.uantwerpen.be/admin/system/sitemap/sitemap_revokes.aspx?lang=en=true]) was (Author: steenti): [~kwri...@metacarta.com] - So then with the seed map URL: we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ... Last time we tried this manifold started acting strange because of the amount of url's/links located in the sitemap URL (sitemap url: [https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true]) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723782#comment-16723782 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 7:48 AM: - [~kwri...@metacarta.com] If we update to manifold 2.12 can we than use the seedmap as originaly intended by us ? so we create a job with X seeds, ES output, web input and HopCount 0 for links and redirect: # Put X seeds in seedmap # run job # X documents get pushed to ES # update job to have X minus 20 seeds wait till scheduled time # run job # 20 documents get deleted from ES # X minus 20 documents get updated # wait till scheduled time # ... Will it work like this ? was (Author: steenti): If we update to manifold 2.12 can we than use the seedmap as originaly intended by us ? so we create a job with X seeds, ES output, web input and HopCount 0 for links and redirect: # Put X seeds in seedmap # run job # X documents get pushed to ES # update job to have X minus 20 seeds wait till scheduled time # run job # 20 documents get deleted from ES # X minus 20 documents get updated # wait till scheduled time # ... Will it work like this ? > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)