Re: [Nutch-general] How to dump all the valid links which has been crawled?

Meryl Silverburgh Wed, 11 Apr 2007 21:16:13 -0700

I think I find out the answer to my previous question by doing this:

 bin/nutch readlinkdb crawl/linkdb/ -dump test



But my next question is why the result shows URLs with 'gif', 'js', etc,etc

I have this line in my craw-urlfilter.txt, so i don't except I will
crawl things like images, javascript files,

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|rss|swf)$


Can you please tell me how to fix my problem?

Thank you.

On 4/11/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I read this article about nutch crawling:
> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
>
> How can I dumped out the valid links which has been crawled?
> This command described in the article does not work in nutch 0.9. What
> should I use instead?
>
> bin/nutch readdb crawl-tinysite/db -dumplinks
>
> Thank you for any help.
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to dump all the valid links which has been crawled?

Reply via email to