[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716411#comment-16716411 ]
Karl Wright commented on CONNECTORS-1562: ----------------------------------------- I tried this out using a small number of the specific seeds provided. I started with the following: {code} https://www.uantwerpen.be/en/ https://www.uantwerpen.be/en/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/hof-van-liere/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club https://www.uantwerpen.be/en/about-uantwerp/facts-figures {code} This generated seven ingestions. I then more-or-less randomly removed a few seeds, leaving this: {code} https://www.uantwerpen.be/en/ https://www.uantwerpen.be/en/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club https://www.uantwerpen.be/en/about-uantwerp/facts-figures {code} Rerunning produced zero deletions, and a refetch of all seven previously-ingested documents, with no new ingestions. Finally, I removed all the seeds and ran it again. A deletion was logged for every indexed document. My quick analysis of what is happening here is this: - ManifoldCF keeps grave markers around for hopcount tracking. Hopcount tracking in MCF is extremely complex and much care is taken to avoid miscalculating the number of hops to a document, no matter what order documents are processed in. In order to make that work, documents cannot be deleted from the queue just because their hopcount is too large; instead, quite a number of documents are put in the queue and may or may not be fetched, depending if they wind up with a low enough hopcount - The document deletion phase removes unreachable documents, but documents that simply have too great a hopcount but otherwise are in the queue are not precisely unreachable In other words, the cleanup phase of a job seems to interact badly with documents that are reachable but just have too great a hopcount; these documents seem to be overlooked for cleanup, and will ONLY be cleaned up when they become truly unreachable. This is not intended behavior. However, it's also a behavior change in a very complex part of the software, and will therefore require great care to correct without breaking something. Because it is not something simple, you should expect me to require a couple of weeks elapsed time to come up with the right fix. Furthermore, it is still true that this model is not one that I'd recommend for crawling a web site. The web connector is not designed to operate with hundreds of thousands of seeds; hundreds, maybe, or thousands on a bad day, but trying to control exactly what MCF indexes by fooling with the seed list is not what it was designed for. > Document removal Elastic > ------------------------ > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector > Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic > Reporter: Tim Steenbeke > Assignee: Karl Wright > Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 > 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)