I have started to see this problem recently. topN=20 per crawl, but
fetched pages = 15 - 17, while error pages = 2000 - 5000. >25000
pages are missing. this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.
Depending on how you have Nutch con
I have started to see this problem recently. topN=20 per crawl, but
fetched pages = 15 - 17, while error pages = 2000 - 5000. >25000
pages are missing. this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.
I also see lots of "Response content
Jérôme Charron wrote:
A related issue is that these two plugins replicate a lot of code. At
some point we should try to fix that. See:
http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
I have beginning working on this. Nobody else? Can I go on?
> > A related issue is that these two plugins replicate a lot of code. At
> > some point we should try to fix that. See:
> >
> >
> http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
I have beginning working on this. Nobody else? Can I go on?
Jérôme
--
http://motrech.fr
Andrzej Bialecki wrote:
Hmm... I'm not saying it's flawless, there were surely some mysterious
things going on with it. That large crawl you mention, was it with the
(recently updated in Nutch) release 3.0? What were the issues?
No, it was in early December, with the previous version. I don't
Doug Cutting wrote:
Stefan Groschupf wrote:
However in case it is known as buggy, we may should not set up as
default http protocol plugin as it is by today.
+1
I have found protocol-http to be more reliable for large crawls than
protocol-httpclient and would be in favor of switching the
Stefan Groschupf wrote:
However in case it is known as buggy, we may should not set up as
default http protocol plugin as it is by today.
+1
I have found protocol-http to be more reliable for large crawls than
protocol-httpclient and would be in favor of switching the default back
to protoc
Stefan Groschupf wrote:
OK I will do that tomorrow!
However in case it is known as buggy, we may should not set up as
default http protocol plugin as it is by today.
Newbies checking out nutch ill use the version that does not fetch
all pages, since most people start with the standard config
The same problem on FreeBSD 6.0 + jdk1.4.2
I think it was also reported some time ago by Rod Taylor.
Switch to protocol-http.
SG> Hi there,
SG> is there someone out there that can confirm a problem we discovered?
SG> We was wondering why not all pages of a generated segments was
SG> fetched.
OK I will do that tomorrow!
However in case it is known as buggy, we may should not set up as
default http protocol plugin as it is by today.
Newbies checking out nutch ill use the version that does not fetch
all pages, since most people start with the standard configuration.
Am 19.12.2005 u
Stefan Groschupf wrote:
Anyway today we note that when fetching with http-client the sum of
errors and fetched pages is much less than the size defined when
generating the segment.
Changing to protocol-http solves the problem.
Has anyone also note this behavior?
I haven't, but this plugi
11 matches
Mail list logo