Hallo. Does MCF use robots.txt is on http://aaa.bb.com/ccc/robots.txt or does it search for robots.txt only on the root http://aaa.bb.com/ ? I restart today , so after many hours, and I suppose caches expires but MCF scans everithing on the subfolders. I read on the postgres table robotsdata of MCF: "<binary data>";"aaa.bb.com:80";1410939267040
Details of the MCF job: Seeds: http://aaa.bb.com/ccc/ Include in crawl : .* Include in index: .* Include only hosts matching seeds? X Thanks a lot Mario Da: Karl Wright [mailto:daddy...@gmail.com] Inviato: martedì 16 settembre 2014 19:22 A: user@manifoldcf.apache.org Oggetto: Re: Web crawling , robots.txt and access credentials Hi Mario, I looked at your robots.txt. In its current form, it should disallow EVERYTHING from your site. The reason is that some of your paths start with "/", but the allow clauses do not. As for why MCF is letting files through, I suspect that this is because MCF caches robots data. If you changed the file and expected MCF to pick that up immediately, it won't. The cached copy expires after, I believe, 1 hour. It's kept in the database so even if you recycle the agents process it won't purge the cache. Karl On Tue, Sep 16, 2014 at 11:44 AM, Karl Wright <daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote: Authentication does not bypass robots ever. You will want to turn on connector debug logging to see the decisions that the web connector is making with respect to which documents are fetched or not fetched, and why. Karl On Tue, Sep 16, 2014 at 11:04 AM, Bisonti Mario <mario.biso...@vimar.com<mailto:mario.biso...@vimar.com>> wrote: Hallo. I would like to crawl some documents in a subfolder of a web site: http://aaa.bb.com/ Structure is: http://aaa.bb.com/ccc/folder1 http://aaa.bb.com/ccc/folder2 http://aaa.bb.com/ccc/folder3 Folder ccc and subfolder, are with a Basic security username: joe Password: ppppp I want to permit the crawling of only some docs on folder1 So I put robots.txt on http://aaa.bb.com/ccc/robots.txt The contents of file robots.txt is User-agent: * Disallow: / Allow: folder1/doc1.pdf Allow: folder1/doc2.pdf Allow: folder1/doc3.pdf I setup on MCF 1.7 a repository web connection with: “Obey robots.txt for all fetches” and on Access credentials: http://aaa.bb.com/ccc/ Basic authentication: joe and ppp When I create a job : Include in crawl : .* Include in index: .* Include only hosts matching seeds? X and I start it, it happens that it crawls all the content of folder1, folder2, and folder3, instead, as I expected, only the : http://aaa.bb.com/ccc/folder1/doc1.pdf http://aaa.bb.com/ccc/folder1/doc2.pdf http://aaa.bb.com/ccc/folder1/doc3.pdf Why this? Perhaps the Basic Authentication, bypass the specific “Obey robots.txt for all fetches” ? Thanks a lot for your help. Mario