Re: [Nutch-general] Using Nutch for special content pages

Zaheed Haque Tue, 09 Jan 2007 01:30:55 -0800

Hi:

In order to find a specific text or subject or group of text you need
to process the document i.e. you need to download the page to your
disk -- process it -- delete or keep based on rules. But you still need
to download the page. This means you will need a lot of disk space "temporarily"
if you are planning to crawl the world :-)


there is a creative commons plugin in nutch src/plugin/creativecommons .. which
does somewhat similar things could be good starting point. As you have lot
of time then its best you make the new plugin a bit generic :-) So we can all
enjoy it!

Cheers

On 1/9/07, Tor Harald Thorland <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> I have a question about Nutch..
> I'm a total newbi and are wondering:
> Is it possible to setup nutch to crawl any address it finds, and only
> store pages where he finds something about a subject...
> I'll like to make a search place for ship/engine related material, and
> were thinking to start with .no domains... ( I have lots of time for
> this, ans the pages I'm looking for is not really getting "outdated",
> but i don't like to waste a lot of disk space etc. for pages which
> don't include what I'm looking for
>
> Best Regards
> Tor Harald Thorland
>
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Using Nutch for special content pages

Reply via email to