I am using Fetcher2, pls. refer the code fragment below:
if (newUrl != null && !newUrl.equals(fit.url.toString())) {
UTF8 redirUrl = new UTF8(newUrl);
if (maxRedirect > 0) {
redirecting = true;
redirectCount++;
fit = FetchItem.create(redirUrl,new CrawlDatum(), byIP);
FetchItemQueue fiq = fetchQueues.getFetchItemQueue(fit.queueID);
fiq.addInProgressFetchItem(fit);
if (LOG.isDebugEnabled()) {
LOG.debug(" - protocol redirect to "+ redirUrl + " (fetching now)");
}
add statement to check whether redirUrl and fit.url are from the same
host.I'll provide a patch through JIRA soon,but it need some time for me to
learn how to provide patch.
If you are too lazy to change code, just set "http.max.redirect" in your nutch
site configure to 0 also should work.
----- Original Message -----
From: "Tomi N/A" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, April 19, 2007 10:07 PM
Subject: Re: Fetching outside the domain ?
> 2007/4/19, qi wu <[EMAIL PROTECTED]>:
>> I find there is a bug in Fetcher,which cause the problem you reported...
>> Now,Nutch only take external link check during the parsing process,which can
>> make sure all the outlinks generated are in the same host as the
>> from-URL.But for the links which will be redirected during fetch,this is
>> not enough.we also need to make sure the redirected url is are in the same
>> host with in the source URL.
>> Just take the link below as an example:
>> http://www.nxtravel.net/?feed=AS&template=Lander_Hybrid&rank=4&keyword=Loans&d=unsecured-direct-loan.com&rid=http%3A%2F%2Fwww.google.com%2Furl%3Fsa%3DL%26ai%3DBLo7nXConRq6MG5_IhQS6xtEClJquHNzjjKMGrOuW0wTAuAIQBBgEIInKzAcoBzABOAFQ0PfZ2vj_____AWCdudCBkAWYAeeHAZgBhogBqgEFMDI1MTSyAQxueHRyYXZlbC5uZXTIAQHaAQxueHRyYXZlbC5uZXTIApS06QHZAzr5xMjNnhl44AMC%26num%3D4%26q%3Dhttp%3A%2F%2Funsecured-direct-loan.com%2Funsecured-loans-online.html%26usg%3DAFrqEzct1VSZnZ48RrXOwHNyxS8qzm9O_w
>> it will be redirected to
>> http://unsecured-direct-loan.com/unsecured-loans-online.html
>
> Nice to know I haven't lost it completely: finally someone else
> acknowledged the problem exists. :)
> Could you please clarify what you ment by "So just add external link
> check for moved and temp_moved urls should fix this problem"?
>
> TIA,
> t.n.a.
>
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general