Continuing the noble tradition of replying to my own messages, I have
a small update on the topic of the crawler crawling outside of the
given list of hosts in spite of db.ignore.external.links=true...

2006/10/25, Tomi NA <[EMAIL PROTECTED]>:

> > Could you give an example of a root URL, which leads to this symptom
> > (i.e. leaks outside the original site)?
>
> I'll try to find out exactly where the crawler starts to run loose as
> I have several web sites in my initial URL list.

I'm using nutch 0.9 now and have run into the problem again. It's a
bit hard to reproduce as I have dozens of hosts in my initial URL list
and the crawler leaves them days after I start the crawl: it's very
difficult to pinpoint how or why the crawler steps outside it's
bounds.

Did anyone else run into such a problem?
Is there anything else I needed to do set up except
db.ignore.external.links=true?

TIA,
t.n.a.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to