Continuing the noble tradition of replying to my own messages, I have a small update on the topic of the crawler crawling outside of the given list of hosts in spite of db.ignore.external.links=true...
2006/10/25, Tomi NA <[EMAIL PROTECTED]>: > > Could you give an example of a root URL, which leads to this symptom > > (i.e. leaks outside the original site)? > > I'll try to find out exactly where the crawler starts to run loose as > I have several web sites in my initial URL list. I'm using nutch 0.9 now and have run into the problem again. It's a bit hard to reproduce as I have dozens of hosts in my initial URL list and the crawler leaves them days after I start the crawl: it's very difficult to pinpoint how or why the crawler steps outside it's bounds. Did anyone else run into such a problem? Is there anything else I needed to do set up except db.ignore.external.links=true? TIA, t.n.a. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
