[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716411#comment-16716411 ] Karl Wright commented on CONNECTORS-1562: - I tried this out using a small number of the specific seeds provided. I started with the following: {code} https://www.uantwerpen.be/en/ https://www.uantwerpen.be/en/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/hof-van-liere/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club https://www.uantwerpen.be/en/about-uantwerp/facts-figures {code} This generated seven ingestions. I then more-or-less randomly removed a few seeds, leaving this: {code} https://www.uantwerpen.be/en/ https://www.uantwerpen.be/en/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club https://www.uantwerpen.be/en/about-uantwerp/facts-figures {code} Rerunning produced zero deletions, and a refetch of all seven previously-ingested documents, with no new ingestions. Finally, I removed all the seeds and ran it again. A deletion was logged for every indexed document. My quick analysis of what is happening here is this: - ManifoldCF keeps grave markers around for hopcount tracking. Hopcount tracking in MCF is extremely complex and much care is taken to avoid miscalculating the number of hops to a document, no matter what order documents are processed in. In order to make that work, documents cannot be deleted from the queue just because their hopcount is too large; instead, quite a number of documents are put in the queue and may or may not be fetched, depending if they wind up with a low enough hopcount - The document deletion phase removes unreachable documents, but documents that simply have too great a hopcount but otherwise are in the queue are not precisely unreachable In other words, the cleanup phase of a job seems to interact badly with documents that are reachable but just have too great a hopcount; these documents seem to be overlooked for cleanup, and will ONLY be cleaned up when they become truly unreachable. This is not intended behavior. However, it's also a behavior change in a very complex part of the software, and will therefore require great care to correct without breaking something. Because it is not something simple, you should expect me to require a couple of weeks elapsed time to come up with the right fix. Furthermore, it is still true that this model is not one that I'd recommend for crawling a web site. The web connector is not designed to operate with hundreds of thousands of seeds; hundreds, maybe, or thousands on a bad day, but trying to control exactly what MCF indexes by fooling with the seed list is not what it was designed for. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 > 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1562: Summary: Documents unreachable due to hopcount are not considered unreachable on cleanup pass (was: Document removal Elastic) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 > 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reopened CONNECTORS-1562: - > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 > 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714835#comment-16714835 ] Karl Wright commented on CONNECTORS-1562: - Hi [~SteenTi], you are in essence making a seed list that is intended to be the entire list of all URLs that are crawled, and using hopcount filtering to try and make sure no links are taken. You are then removing individual seeds and expecting the individual URLs to be removed from the index. This is a usage model that is not well tested (because of the hopcount involvement), so I can well believe it doesn't do exactly what you'd expect. We do not generally recommend this model because the seed list may well wind up being huge. If there's no way you can create an index page of some kind, then you might be stuck with it, but bear in mind that the Web Connector is not designed to support this model. If this is the model you nevertheless intend to operate under, I will reopen the ticket and try to reproduce the problem, but it will not be looked at until next weekend at the earliest, as this is not my day job and this is not a supported model. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 > 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: (was: Screenshot from 2018-12-05 09-01-46.png) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 > 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: Screenshot from 2018-12-10 14-07-46.png > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-05 > 09-01-46.png, Screenshot from 2018-12-10 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714692#comment-16714692 ] Tim Steenbeke commented on CONNECTORS-1562: --- # I created a job with a Null-Outputconnector # put 30 url's as seeds # set the hopfilter to 0 so no links or redirects will be checked, # run the job. Check Simple History: All the docuemtns get fetched and processed (if: {color:#33}RESPONSECODENOTINDEXABLE{color}) # I edit the JOB # delete all but 3 URL's, seeds are now just 3 URL's # run the job Check Simple History: all documents get fetched even though they aren't in the seeds anymore no document gets deleted and the job ends !30URLSeeds.png! !3URLSeed.png! !Screenshot from 2018-12-10 14-07-46.png! > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-05 > 09-01-46.png, Screenshot from 2018-12-10 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: 3URLSeed.png > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-05 > 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: 30URLSeeds.png > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714610#comment-16714610 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/10/18 11:49 AM: -- Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted 3 documents and not 10 so it partially worked) was (Author: steenti): Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted 3 documents and not 10) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714595#comment-16714595 ] Karl Wright edited comment on CONNECTORS-1562 at 12/10/18 11:58 AM: [~SteenTi], good that the scheduler is working as expected. {quote} Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. {quote} The scheduler doesn't have any impact on the way a job runs, unless you tell it to do a "minimal" run rather than a "complete" one. There's a pulldown for every schedule record you create that lets you decide which it's going to be. What is selected for your schedule record? Also, were you able to see deletions when you followed my steps above? was (Author: kwri...@metacarta.com): [~SteenTi], good that the scheduler is working as expected. {quote} Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. {quote} The scheduler doesn't have any impact on the way a job runs, unless you tell it to do a "minimal" run rather than a "complete" one. There's a pulldown for every schedule record you create that lets you decide which it's going to be. What is selected for your schedule record? Also, were you able to see deletions when you follows my steps above? > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Comment: was deleted (was: Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted 3 documents and not 10 so it partially worked)) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714610#comment-16714610 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/10/18 11:42 AM: -- Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted 3 documents and not 10) was (Author: steenti): Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted documents that shouldn't have been indexed in the first place, documents that were added to ES but weren't in the scope in the original run.) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714610#comment-16714610 ] Tim Steenbeke commented on CONNECTORS-1562: --- Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted documents that shouldn't have been indexed in the first place, documents that were added to ES but weren't in the scope in the original run.) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714595#comment-16714595 ] Karl Wright commented on CONNECTORS-1562: - [~SteenTi], good that the scheduler is working as expected. {quote} Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. {quote} The scheduler doesn't have any impact on the way a job runs, unless you tell it to do a "minimal" run rather than a "complete" one. There's a pulldown for every schedule record you create that lets you decide which it's going to be. What is selected for your schedule record? Also, were you able to see deletions when you follows my steps above? > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/10/18 8:55 AM: - Hi [~kwri...@metacarta.com], So i Set up a Job as you explained above. The scheduler worked fine now, even with multiple values. I tested the same with the ES output connector and It also started up at the scheduled time. So it seems there was an issue in the import of the job schedule which has been resolved now. Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. Also in the Document Status for the Jobs there were no deletions registered. (also on the null output but this is probably normal cause it's Null) was (Author: steenti): Hi [~kwri...@metacarta.com], So i Set up a Job as you explained above. The scheduler worked fine now, even with multiple values. I tested the same with the ES output connector and It also started up at the scheduled time. So it seems there was an issue in the import of the job schedule which has been resolved now. Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. (also on the null output but this is probably normal cause it's Null) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Tim Steenbeke commented on CONNECTORS-1562: --- Hi [~kwri...@metacarta.com], So i Set up a Job as you explained above. The scheduler worked fine now, even with multiple values. I tested the same with the ES output connector and It also started up at the scheduled time. So it seems there was an issue in the import of the job schedule which has been resolved now. Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. (also on the null output but this is probably normal cause it's Null) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)