Hi all, I was interesting in keeping count of the number of time every URL
is selected by an user.
The problem is not on "how do I know what page is clicked", but "how can
I store every page/number-of-clicks touple?
What is the best way to store theese informations, and let nutch use them?
Can yo
On 09:59:45 03/Oct , Doug Cutting wrote:
> I suspect threads are hanging, probably in the parser,
I tried to not parse, but without good results.
If I use 100 threads, I can download pages at 500KB/s for about 5 seconds,
but after that, the download rate falls to 0. If I set 20 threads, I can
do
On 03:36:45 03/Oct , Michael wrote:
> 3mbit, 100 threads = 15 pages/sec
> cpu is low during fetch, so its bandwidth limit.
yes, cpu is low, and even memory is quite free. But, with a 10MB in/out
I cannot obtain good results (and I do not parse results, simply fetch
them).
If I use 100 threads, I
On 10:27:55 28/Sep , AJ Chen wrote:
> I started the crawler with about 2000 sites. The fetcher could achieve
> 7 pages/sec initially, but the performance gradually dropped to about 2
> pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages
> and I used 500 threads. What are th
On 10:09:12 17/Sep , Michael Ji wrote:
> there is a book called "Lucene in Action" published,
ah, ok, I believed that the indexing phase was a modified "version" of
original Lucene; in this case, I will look at lucene's web site.
Thanks!
Menoz
--
Free Software E
Hi all, can you please point me to a document in which is described the
Indexing step? I havn't found anything inside the wiki, and I do not
understand very well what happens...
Thank you again!!!
Menoz
--
Free Software Enthusiast
Debian Powered Lin
On 11:44:00 17/Sep , Piotr Kosiorowski wrote:
> Yes - depth means in fact - number of interations of
> generate/fetch/update cycle.
ok, now it's clear :)
> nutch generate - will include already fetched pages in new segment for
> fetching after some time (I think default is 30 days and you can
On 19:37:42 16/Sep , Dawid Weiss wrote:
> I also provided a sample implementation and it is a plugin available in
> Nutch) using Carrot2 clustering components --
>
> http://carrot2.sf.net, or the demo at http://carrot.cs.put.poznan.pl
very interesting.. But, what are the main differences betwee
On 19:33:57 16/Sep , Piotr Kosiorowski wrote:
> bin/nutch updatedb db $s1
> command updates WebDB with links you fetched in segment $s1.
ok, so the depth value is only used to stop the crawling at a certain
point, and proceed with the indexing, right?
But, another thing: how can I refresh old p
Hi All, I'm interested in clustering (data clustering,more or less like
vivisimo.com does), is there a plugin or an addon for it?
I'm also interested in writing it, so, if someone has some advice, or some
lines of code, it would be very helpful :)
Thank you,
Menoz
--
Hi all, I have questions regarding org.apache.nutch.tools.CrawlTool: I do
not have really understood what is the ralationship between
depth,segments,fetching..
Take for example the tutorial, I understand theese 2 steps:
bin/nutch admin db -create
bin/nutch inject db -dmozfile conte
On 19:29:45 12/Sep , Fredrik Andersson wrote:
> Hello Daniele!
Hi!
> Yes. Look at the bin/nutch script and you will see the entry points in the
> Java classes.
oh,ok,thank you
> > Ho many pages can nutch actually manage? Is there a limit?
>
> Nope. The filesystem sets the limit, and it's tot
Hi all,I'm interested in nutch project, and it seems pretty good,but there
are a few things I've not understood:
- can nutch's functions (crawling,indexing,etc) be called from an external
program written in java,or I have always to use a shell script?
If so, where can I find informations on A
13 matches
Mail list logo