[Nutch-dev] Re: ProtocolStatus.MOVED

Stefan Groschupf Mon, 06 Feb 2006 16:19:16 -0800

This is just in case a page forward the parser to a new url.
Real url filtering is done in (nutch .8):
ParseOutputFormat line:100.
HTH
Stefan


Am 04.02.2006 um 06:31 schrieb Fuad Efendi:

I am checking also Fetcher, it seems strange for me:

Case ProtocolStatus.MOVED:
Case ProtocolStatus.TEMP_MOVED:
        handleFetch(fle, output);
        String newurl = pstat.getMessage();
        newurl = URLFilters.filter(newurl);

So, we are calling "handleFetch" before "filter"... Error?

EnabledHost - sends redirect to DisabledHost

DisabledHost - parsed(!), links to unknown-hosts are probablystored (in not

disabled explicitly)


-----Original Message-----
From: Fuad Efendi [mailto:[EMAIL PROTECTED]
Sent: Friday, February 03, 2006 11:47 PM
To: [email protected]
Cc: [email protected]
Subject: RE: takes too long to remove a page from WEBDB


We have following code:

org.apache.nutch.parse.ParseOutputFormat.java
...
[94]    toUrl = urlNormalizer.normalize(toUrl);
[95]    toUrl = URLFilters.filter(toUrl);
...

It normalizes, then filters normalized URL, than writes it to /crawl_parse

In some cases normalized URL is not same as raw URL, and it is notfiltered.



-----Original Message-----
From: Fuad Efendi [mailto:[EMAIL PROTECTED]
Sent: Friday, February 03, 2006 10:53 PM
To: [email protected]
Subject: RE: takes too long to remove a page from WEBDB

It will also be generated in case if non-filtered page have "SendRedirect"

to another page (which should be filtered)...

I have same problem in my modified DOMContentUtils.java,
...

if (url.getHost().equals(base.getHost())) { outlinks.add(..........); }

...

- it doesn't help, I see some URLs from "filtered" hosts again...

-----Original Message-----
From: Keren Yu [mailto:[EMAIL PROTECTED]
Sent: Friday, February 03, 2006 4:01 PM
To: [email protected]
Subject: Re: takes too long to remove a page from WEBDB

Hi Stefan,

As I understand, when you use 'nutch generate' to
generate fetch list, it doesn't call urlfilter. Only
in 'nutch updatedb' and 'nutch fetch' it does call
urlfilter. So the page after 30 days will be generated
even if you use url filter to filter it.

Best regards,
Keren

--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:

not if you filter it in the url filter.
There is a database based url filter I think in the
jira somewhere
somehow, this can help to filter larger lists of
urls.

Am 03.02.2006 um 21:35 schrieb Keren Yu:

Hi Stefan,

Thank you. You are right. I have to use a url

filter

and remove it from the index. But after 30 days

later,

the page will be generated again in generating

fetch

list.

Thanks,
Keren

--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:

And also it makes no sense, since it will come

back

as soon the link
is found on a page.
Use a url filter instead  and remove it from the
index.
Removing from webdb makes no sense.

Am 03.02.2006 um 21:27 schrieb Keren Yu:

Hi everyone,

It took about 10 minutes to remove a page from

WEBDB

using WebDBWriter. Does anyone know other method

to

remove a page, which is faster.

Thanks,
Keren

__________________________________________________

Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam

protection around

http://mail.yahoo.com



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam

protection around

http://mail.yahoo.com



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

[Nutch-dev] Re: ProtocolStatus.MOVED

Reply via email to