[jira] [Resolved] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1562. - Resolution: Fixed Fix Version/s: ManifoldCF 2.12 r1849001 | kwright | 2018-12-15 12:47:31 -0500 (Sat, 15 Dec 2018) | 1 line Final fix for CONNECTORS-1562. r1849000 | kwright | 2018-12-15 12:02:07 -0500 (Sat, 15 Dec 2018) | 1 line More debugging and refactoring r1848999 | kwright | 2018-12-15 09:29:23 -0500 (Sat, 15 Dec 2018) | 1 line Log all delete dependencies that we record, and do more refactoring r1848992 | kwright | 2018-12-15 07:56:23 -0500 (Sat, 15 Dec 2018) | 1 line More minor refactoring of HopCount module r1848991 | kwright | 2018-12-15 07:46:16 -0500 (Sat, 15 Dec 2018) | 1 line Minor refactoring to bring code off of the java 1.4 world r1848981 | kwright | 2018-12-15 03:23:57 -0500 (Sat, 15 Dec 2018) | 1 line Improve hopcount logging further, this time on the query side r1848911 | kwright | 2018-12-14 00:58:42 -0500 (Fri, 14 Dec 2018) | 1 line Improve hopcount logging and commenting > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Karl Wright commented on CONNECTORS-1562: - It's a bit more complicated than I originally thought. Once the job has been run, the thing is corrupted because there's an existing distance that never gets invalidated, and that can never be fixed so long as the job hangs around. So I will need to capture a run from scratch with database debugging on in order to see exactly what dependencies are getting recorded, and why the query meant to pick them up and invalidate them on subsequent runs is missing the records inserted during the seeding process. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722182#comment-16722182 ] Karl Wright commented on CONNECTORS-1562: - I think I determined what the problem is: no delete dependencies are being recorded for seeds. That means we never invalidate the initial hopcount answers, which explains why this problem seems confined to seeds and nothing else. A fix should be straightforward but I will also want to construct an integration test that exercises it before a final commit is done. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722124#comment-16722124 ] Karl Wright commented on CONNECTORS-1562: - Analysis: On the reduced pass, some documents had 'link' hopcount of 1, but they all had 'redirect' hopcount of 0. The hopcount computation queue, furthermore, was processed but found always empty, which means that no "invalid" markers were ever detected. This argues that the seeding phase did not operate as expected as far as marking hop count rows as invalid. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1562: Attachment: manifoldcf.log.reduced > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722066#comment-16722066 ] Karl Wright commented on CONNECTORS-1562: - Tonight's researches were inconclusive because logging was not adequate for hopcount querying. I've added better logging and run the example again, recollecting the reduced phase log as before. This has been attached but not yet examined. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1562: Attachment: (was: manifoldcf.log.reduced) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)