R: Web crawling , robots.txt and access credentials

Bisonti Mario Wed, 17 Sep 2014 00:40:00 -0700

Hallo.
Does MCF use robots.txt is on http://aaa.bb.com/ccc/robots.txt  or does it 
search for robots.txt only on the root  http://aaa.bb.com/  ?
I restart today , so after many hours, and I suppose caches expires but MCF 
scans everithing on the subfolders.
I read on the postgres table robotsdata of MCF:
"<binary data>";"aaa.bb.com:80";1410939267040


Details of the MCF job:
Seeds:
http://aaa.bb.com/ccc/
Include in crawl : .*
Include in index: .*
Include only hosts matching seeds? X

Thanks a lot
Mario








Da: Karl Wright [mailto:daddy...@gmail.com]
Inviato: martedì 16 settembre 2014 19:22
A: user@manifoldcf.apache.org
Oggetto: Re: Web crawling , robots.txt and access credentials

Hi Mario,
I looked at your robots.txt.  In its current form, it should disallow 
EVERYTHING from your site.  The reason is that some of your paths start with 
"/", but the allow clauses do not.
As for why MCF is letting files through, I suspect that this is because MCF 
caches robots data.  If you changed the file and expected MCF to pick that up 
immediately, it won't.  The cached copy expires after, I believe, 1 hour.  It's 
kept in the database so even if you recycle the agents process it won't purge 
the cache.
Karl

On Tue, Sep 16, 2014 at 11:44 AM, Karl Wright 
<daddy...@gmail.com<mailto:daddy...@gmail.com>> wrote:
Authentication does not bypass robots ever.
You will want to turn on connector debug logging to see the decisions that the 
web connector is making with respect to which documents are fetched or not 
fetched, and why.

Karl

On Tue, Sep 16, 2014 at 11:04 AM, Bisonti Mario 
<mario.biso...@vimar.com<mailto:mario.biso...@vimar.com>> wrote:

Hallo.

I would like to crawl some documents in a subfolder of a web site:
http://aaa.bb.com/

Structure is:
http://aaa.bb.com/ccc/folder1
http://aaa.bb.com/ccc/folder2
http://aaa.bb.com/ccc/folder3

Folder ccc and subfolder, are with a Basic security
username: joe
Password: ppppp

I want to permit the crawling of only some docs on folder1
So I put robots.txt on
http://aaa.bb.com/ccc/robots.txt

The contents of file robots.txt is
User-agent: *
Disallow: /
Allow: folder1/doc1.pdf
Allow: folder1/doc2.pdf
Allow: folder1/doc3.pdf


I setup on MCF 1.7 a repository web connection with:
“Obey robots.txt for all fetches”
and on Access credentials:
http://aaa.bb.com/ccc/
Basic authentication: joe and ppp

When I create a job :
Include in crawl : .*
Include in index: .*
Include only hosts matching seeds? X

and I start it, it happens that it crawls all the content of folder1, folder2, 
and folder3,
instead, as I expected, only the :
http://aaa.bb.com/ccc/folder1/doc1.pdf

http://aaa.bb.com/ccc/folder1/doc2.pdf

http://aaa.bb.com/ccc/folder1/doc3.pdf


Why this?

Perhaps the Basic Authentication, bypass the specific “Obey robots.txt for all 
fetches” ?

Thanks a lot for your help.
Mario

R: Web crawling , robots.txt and access credentials

Reply via email to