keep count of selected url

2005-10-12 Thread Daniele Menozzi
Hi all, I was interesting in keeping count of the number of time every URL is selected by an user. The problem is not on "how do I know what page is clicked", but "how can I store every page/number-of-clicks touple? What is the best way to store theese informations, and let nutch use them? Can yo

Re: what contibute to fetch slowing down

2005-10-10 Thread Daniele Menozzi
On 09:59:45 03/Oct , Doug Cutting wrote: > I suspect threads are hanging, probably in the parser, I tried to not parse, but without good results. If I use 100 threads, I can download pages at 500KB/s for about 5 seconds, but after that, the download rate falls to 0. If I set 20 threads, I can do

Re: Re[2]: what contibute to fetch slowing down

2005-10-10 Thread Daniele Menozzi
On 03:36:45 03/Oct , Michael wrote: > 3mbit, 100 threads = 15 pages/sec > cpu is low during fetch, so its bandwidth limit. yes, cpu is low, and even memory is quite free. But, with a 10MB in/out I cannot obtain good results (and I do not parse results, simply fetch them). If I use 100 threads, I

Re: what contibute to fetch slowing down

2005-09-28 Thread Daniele Menozzi
On 10:27:55 28/Sep , AJ Chen wrote: > I started the crawler with about 2000 sites. The fetcher could achieve > 7 pages/sec initially, but the performance gradually dropped to about 2 > pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages > and I used 500 threads. What are th

Re: Index Infos

2005-09-17 Thread Daniele Menozzi
On 10:09:12 17/Sep , Michael Ji wrote: > there is a book called "Lucene in Action" published, ah, ok, I believed that the indexing phase was a modified "version" of original Lucene; in this case, I will look at lucene's web site. Thanks! Menoz -- Free Software E

Index Infos

2005-09-17 Thread Daniele Menozzi
Hi all, can you please point me to a document in which is described the Indexing step? I havn't found anything inside the wiki, and I do not understand very well what happens... Thank you again!!! Menoz -- Free Software Enthusiast Debian Powered Lin

Re: Problems on Crawling

2005-09-17 Thread Daniele Menozzi
On 11:44:00 17/Sep , Piotr Kosiorowski wrote: > Yes - depth means in fact - number of interations of > generate/fetch/update cycle. ok, now it's clear :) > nutch generate - will include already fetched pages in new segment for > fetching after some time (I think default is 30 days and you can

Re: Clustering

2005-09-16 Thread Daniele Menozzi
On 19:37:42 16/Sep , Dawid Weiss wrote: > I also provided a sample implementation and it is a plugin available in > Nutch) using Carrot2 clustering components -- > > http://carrot2.sf.net, or the demo at http://carrot.cs.put.poznan.pl very interesting.. But, what are the main differences betwee

Re: Problems on Crawling

2005-09-16 Thread Daniele Menozzi
On 19:33:57 16/Sep , Piotr Kosiorowski wrote: > bin/nutch updatedb db $s1 > command updates WebDB with links you fetched in segment $s1. ok, so the depth value is only used to stop the crawling at a certain point, and proceed with the indexing, right? But, another thing: how can I refresh old p

Clustering

2005-09-16 Thread Daniele Menozzi
Hi All, I'm interested in clustering (data clustering,more or less like vivisimo.com does), is there a plugin or an addon for it? I'm also interested in writing it, so, if someone has some advice, or some lines of code, it would be very helpful :) Thank you, Menoz --

Problems on Crawling

2005-09-16 Thread Daniele Menozzi
Hi all, I have questions regarding org.apache.nutch.tools.CrawlTool: I do not have really understood what is the ralationship between depth,segments,fetching.. Take for example the tutorial, I understand theese 2 steps: bin/nutch admin db -create bin/nutch inject db -dmozfile conte

Re: Nutch API

2005-09-12 Thread Daniele Menozzi
On 19:29:45 12/Sep , Fredrik Andersson wrote: > Hello Daniele! Hi! > Yes. Look at the bin/nutch script and you will see the entry points in the > Java classes. oh,ok,thank you > > Ho many pages can nutch actually manage? Is there a limit? > > Nope. The filesystem sets the limit, and it's tot

Nutch API

2005-09-12 Thread Daniele Menozzi
Hi all,I'm interested in nutch project, and it seems pretty good,but there are a few things I've not understood: - can nutch's functions (crawling,indexing,etc) be called from an external program written in java,or I have always to use a shell script? If so, where can I find informations on A