Re: [Nutch-general] How to dump all the valid links which has been crawled?

Briggs Thu, 19 Apr 2007 14:58:01 -0700

Those links are links that were discovered. It does not mean that they
were fetched, they weren't.


On 4/12/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
> I think I find out the answer to my previous question by doing this:
>
>  bin/nutch readlinkdb crawl/linkdb/ -dump test
>
>
> But my next question is why the result shows URLs with 'gif', 'js', etc,etc
>
> I have this line in my craw-urlfilter.txt, so i don't except I will
> crawl things like images, javascript files,
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|rss|swf)$
>
>
> Can you please tell me how to fix my problem?
>
> Thank you.
>
> On 4/11/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I read this article about nutch crawling:
> > http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
> >
> > How can I dumped out the valid links which has been crawled?
> > This command described in the article does not work in nutch 0.9. What
> > should I use instead?
> >
> > bin/nutch readdb crawl-tinysite/db -dumplinks
> >
> > Thank you for any help.
> >
>


-- 
"Conscious decisions by concious minds are what make reality real"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to dump all the valid links which has been crawled?

Reply via email to