I think I find out the answer to my previous question by doing this:
bin/nutch readlinkdb crawl/linkdb/ -dump test But my next question is why the result shows URLs with 'gif', 'js', etc,etc I have this line in my craw-urlfilter.txt, so i don't except I will crawl things like images, javascript files, # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|rss|swf)$ Can you please tell me how to fix my problem? Thank you. On 4/11/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
Hi, I read this article about nutch crawling: http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html How can I dumped out the valid links which has been crawled? This command described in the article does not work in nutch 0.9. What should I use instead? bin/nutch readdb crawl-tinysite/db -dumplinks Thank you for any help.
