Re: [Nutch-general] How to dump all the valid links which has been crawled?

Briggs Fri, 20 Apr 2007 08:31:27 -0700

That one is a bit more complicated because it has to do with
complexities of the underlying scoring algorithm(s).  But, basically,
that means "give me the top 35 links within the crawl db and put them
in the file called 'test'".  Top links are calculated by their
relevance when dealing with how many other other links, from other
pages/sites point to them.


Basically, when the crawler crawls, it stores all discovered links
within the db.   If the crawler finds the same link from multiple
resources (other pages) then that link's score goes up.

That is just a simple explanation, but I think it is close enough.

You may want to look more into the OPIC filter and how that algorithm
works, if you really want to get into the grit of the code.   You can
see how scoring is calculated by running the nutch example web
application and clicking on the 'explain' link on a result.




On 4/19/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
> Can you please tell me what is the meaning of this command? what is
> the top 35 links? how  nutch rank the top 35 links?
>
> "bin/nutch readdb crawl/crawldb -topN 35 test"
>
> On 4/19/07, Briggs <[EMAIL PROTECTED]> wrote:
> > Those links are links that were discovered. It does not mean that they
> > were fetched, they weren't.
> >
> > On 4/12/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
> > > I think I find out the answer to my previous question by doing this:
> > >
> > >  bin/nutch readlinkdb crawl/linkdb/ -dump test
> > >
> > >
> > > But my next question is why the result shows URLs with 'gif', 'js', 
> > > etc,etc
> > >
> > > I have this line in my craw-urlfilter.txt, so i don't except I will
> > > crawl things like images, javascript files,
> > >
> > > # skip image and other suffixes we can't yet parse
> > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|rss|swf)$
> > >
> > >
> > > Can you please tell me how to fix my problem?
> > >
> > > Thank you.
> > >
> > > On 4/11/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
> > > > Hi,
> > > >
> > > > I read this article about nutch crawling:
> > > > http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
> > > >
> > > > How can I dumped out the valid links which has been crawled?
> > > > This command described in the article does not work in nutch 0.9. What
> > > > should I use instead?
> > > >
> > > > bin/nutch readdb crawl-tinysite/db -dumplinks
> > > >
> > > > Thank you for any help.
> > > >
> > >
> >
> >
> > --
> > "Conscious decisions by concious minds are what make reality real"
> >
>


-- 
"Conscious decisions by concious minds are what make reality real"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to dump all the valid links which has been crawled?

Reply via email to