[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
[ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432210#comment-13432210 ] Karl Wright commented on CONNECTORS-501: Another potentially more interesting trick would be to only recrawl those documents that have a hopcount that is on the edge, and whose hopcounts decline. Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents -- Key: CONNECTORS-501 URL: https://issues.apache.org/jira/browse/CONNECTORS-501 Project: ManifoldCF Issue Type: Bug Components: Framework agents process, Web connector Affects Versions: ManifoldCF 0.6 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 0.7 Attachments: capture.txt The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 0 documents it is supposed to. It only discovered 10603 when I ran it just now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
[ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430945#comment-13430945 ] Karl Wright commented on CONNECTORS-501: Even after adding the new logic, I'm still seeing random differences with the expected number of documents, on the order of 8% or so. Currently I'm stumped as to a scenario that would account for it; I'll need to do a run on a smaller set and attempt some forensics, seems to me. Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents -- Key: CONNECTORS-501 URL: https://issues.apache.org/jira/browse/CONNECTORS-501 Project: ManifoldCF Issue Type: Bug Components: Framework agents process, Web connector Affects Versions: ManifoldCF 0.6 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 0.7 The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 0 documents it is supposed to. It only discovered 10603 when I ran it just now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
[ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431222#comment-13431222 ] Karl Wright commented on CONNECTORS-501: I have confirmed that most if not all deletions are the result of hopcount delete code being triggered. Furthermore, I have a scenario that would account for the deletions. The scenario looks like this: - Start with two documents, a and b - There are two paths from a to b, one longer than the other - There are two paths from the seed to a, one longer than the other - If we arrive at b via the longer path from seed to a and the shorter path from a to b, then b may be removed along with the (shorter) link from a to b - The system will not recover because only the longer link from a to b will be discoverable after the shorter link has been removed Basically this means that we cannot remove intrinsic links even though job queue entries have been removed. Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents -- Key: CONNECTORS-501 URL: https://issues.apache.org/jira/browse/CONNECTORS-501 Project: ManifoldCF Issue Type: Bug Components: Framework agents process, Web connector Affects Versions: ManifoldCF 0.6 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 0.7 The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 0 documents it is supposed to. It only discovered 10603 when I ran it just now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
[ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431248#comment-13431248 ] Karl Wright commented on CONNECTORS-501: The fix for CONNECTORS-464 seems to be the source of this bug. The fix removed intrinsic links on either end of a document in the jobqueue. The logic in question may well have been in place to address this problem. Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents -- Key: CONNECTORS-501 URL: https://issues.apache.org/jira/browse/CONNECTORS-501 Project: ManifoldCF Issue Type: Bug Components: Framework agents process, Web connector Affects Versions: ManifoldCF 0.6 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 0.7 The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 0 documents it is supposed to. It only discovered 10603 when I ran it just now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents
[ https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431434#comment-13431434 ] Karl Wright commented on CONNECTORS-501: Hmm, I tried a straight reversion of the fix for CONNECTORS-464, and that also did not arrive at the correct doc count. Meanwhile, I attempted to revert the key changes for CONNECTORS-464 from the CONNECTORS-501 branch, but wound up with code that clearly still deletes too many intrinsiclink table records. Debugging now... Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents -- Key: CONNECTORS-501 URL: https://issues.apache.org/jira/browse/CONNECTORS-501 Project: ManifoldCF Issue Type: Bug Components: Framework agents process, Web connector Affects Versions: ManifoldCF 0.6 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 0.7 The new web crawler Postgresql load test, which uses hopcount-based filtering, does not discover all 0 documents it is supposed to. It only discovered 10603 when I ran it just now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira