Re: [MASSMAIL]Re: How to get the status page after crawl?

Jorge Luis Betancourt González Wed, 18 Mar 2015 21:45:45 -0700

In short you're after a parse plugin, this are quite commons, as we usually 
want to extract more data from the webpage than just all the text in one field. 
My general advice on this is to browse through the source code of the shipped 
plugins provided by Nutch, this usually are quite a great starting point. Other 
advice could be to select that plugin that is somehow similar to the one you're 
trying to write and take it as a starting point then move forward, if the 
plugins you're studying comes with tests even better, you can learn a lot about 
a plugin by taking a peek into the tests.


One of the great things about Nutch is that you can write your plugins in peace 
without even looking into the source of some other sections that you don't 
require, which is kind of nice for starters. 

Regards,

----- Original Message -----
From: "Mohammed Omer" <[email protected]>
To: [email protected], "julien alvez" <[email protected]>
Sent: Thursday, March 19, 2015 12:09:29 AM
Subject: [MASSMAIL]Re: How to get the status page after crawl?

There's a straight-forward tutorial on writing a plugin to add custom
fields, albeit based on v0.9, at
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

More info on plugins at:
http://wiki.apache.org/nutch/AboutPlugins
http://wiki.apache.org/nutch/WritingPluginExample-1.2

Thank you,

Mo

On Wed, Mar 18, 2015 at 10:25 AM, julien <[email protected]> wrote:

> Hello,
>
> After a crawl nutch : I would like recover all status urls.
> Do you know how I can retrieve the status of urls? Code 200, 404, 503 ...?
>
> Thank
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-the-status-page-after-crawl-tp4193761.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: [MASSMAIL]Re: How to get the status page after crawl?

Reply via email to