That one is a bit more complicated because it has to do with complexities of the underlying scoring algorithm(s). But, basically, that means "give me the top 35 links within the crawl db and put them in the file called 'test'". Top links are calculated by their relevance when dealing with how many other other links, from other pages/sites point to them.
Basically, when the crawler crawls, it stores all discovered links within the db. If the crawler finds the same link from multiple resources (other pages) then that link's score goes up. That is just a simple explanation, but I think it is close enough. You may want to look more into the OPIC filter and how that algorithm works, if you really want to get into the grit of the code. You can see how scoring is calculated by running the nutch example web application and clicking on the 'explain' link on a result. On 4/19/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: > Can you please tell me what is the meaning of this command? what is > the top 35 links? how nutch rank the top 35 links? > > "bin/nutch readdb crawl/crawldb -topN 35 test" > > On 4/19/07, Briggs <[EMAIL PROTECTED]> wrote: > > Those links are links that were discovered. It does not mean that they > > were fetched, they weren't. > > > > On 4/12/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: > > > I think I find out the answer to my previous question by doing this: > > > > > > bin/nutch readlinkdb crawl/linkdb/ -dump test > > > > > > > > > But my next question is why the result shows URLs with 'gif', 'js', > > > etc,etc > > > > > > I have this line in my craw-urlfilter.txt, so i don't except I will > > > crawl things like images, javascript files, > > > > > > # skip image and other suffixes we can't yet parse > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|rss|swf)$ > > > > > > > > > Can you please tell me how to fix my problem? > > > > > > Thank you. > > > > > > On 4/11/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > > > > > I read this article about nutch crawling: > > > > http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html > > > > > > > > How can I dumped out the valid links which has been crawled? > > > > This command described in the article does not work in nutch 0.9. What > > > > should I use instead? > > > > > > > > bin/nutch readdb crawl-tinysite/db -dumplinks > > > > > > > > Thank you for any help. > > > > > > > > > > > > > -- > > "Conscious decisions by concious minds are what make reality real" > > > -- "Conscious decisions by concious minds are what make reality real" ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
