Hi,
in the fetcher line 192 in case the status is NOTMODIFIED we collect
null as content but we already have the content.
I'm worry what is happen with a page that does not change for 60
days, since the concept of nutch is do delete segments that are older
than db.default.fetch.interval,
Stefan Groschupf wrote:
Hi,
in the fetcher line 192 in case the status is NOTMODIFIED we collect
null as content but we already have the content.
I'm worry what is happen with a page that does not change for 60 days,
since the concept of nutch is do delete segments that are older than
[
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378387 ]
Dawid Weiss commented on NUTCH-134:
---
(back from holidays, so a bit delayed, but) I confirm Andrzej's suggestion -- a
plain-text only summarized is ideal for clustering for
As far I know a lot of http servers response with chunked content at
least all that return dynamically generated pages.
Should I file a bug?
Any thoughts?
In fact, the requests issued from http plugin are in HTTP 1.0, so the
servers should never return some chuncked content.
I think that the
Getting Clustered results in better form.
-
Key: NUTCH-265
URL: http://issues.apache.org/jira/browse/NUTCH-265
Project: Nutch
Type: Improvement
Components: searcher
Versions: 0.7.2
Reporter: Kris K
The
Chris Fellows wrote:
Hello,
So the last discussion on merging segments was back in
Jan. Has there been any progress in this direction?
What would be the benefit of being able merge
segments? Would being able to merge segments open up
new functionality options or is merging just a
convience?
[
http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12378425 ]
Dawid Weiss commented on NUTCH-265:
---
The clustering interface is very simple in Nutch because it usually needs to be
adjusted to the needs of a particular application.
I'm almost sure that this is not related to http 1.0 requests.
Am 08.05.2006 um 03:20 schrieb Jérôme Charron:
As far I know a lot of http servers response with chunked content at
least all that return dynamically generated pages.
Should I file a bug?
Any thoughts?
In fact, the requests
[
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ]
Doug Cutting commented on NUTCH-134:
+1 for Summary as Writable and change HitSummarizer.getSummary() to return a
Summary directly rather than a String. I don't think
http://www.apple.com for example answer with chunked content also if
you request with a http 1.0 header.
Am 08.05.2006 um 03:20 schrieb Jérôme Charron:
As far I know a lot of http servers response with chunked content at
least all that return dynamically generated pages.
Should I file a bug?
That's great.
Well, my follow up to that then is:
Will the new tool allow any form of diff'ing
segments? In practice this would allow you to run a
crawl on a series of sites one week. Then run another
crawl on the same sites a week or so later. Diff the
segments and allow users to search on
Chris Fellows wrote:
That's great.
Well, my follow up to that then is:
Will the new tool allow any form of diff'ing
segments? In practice this would allow you to run a
No, it does only two things - merging and slicing. That's already one
too many... ;)
crawl on a series of sites one
Furthermore, we can read in HTTP/1.1 specification
that A server MUST NOT
send
transfer-codings to an HTTP/1.0 client.
I once did an socket implementation against
Anonymizer. This is well established proxy service
that services $100K+ government and private contracts.
Their server always
Just remembered, got around it by using HTTPClient
which handles reading the response (chunked or not)
transparently. Haven't looked at the nutch code, but
if we were to use HTTPClient 3.0.x or later, should
take care of it.
--- Chris Fellows [EMAIL PROTECTED] wrote:
Furthermore, we can read
hadoop bug when doing updatedb
--
Key: NUTCH-266
URL: http://issues.apache.org/jira/browse/NUTCH-266
Project: Nutch
Type: Bug
Versions: 0.8-dev
Environment: windows xp, JDK 1.4.2_04
Reporter: Eugen Kochuev
I constantly get
Chris Fellows wrote:
Just remembered, got around it by using HTTPClient
which handles reading the response (chunked or not)
transparently. Haven't looked at the nutch code, but
if we were to use HTTPClient 3.0.x or later, should
take care of it.
Take a look at protocol-httpclient. This
[ http://issues.apache.org/jira/browse/NUTCH-264?page=all ]
Andrzej Bialecki closed NUTCH-264:
---
Resolution: Fixed
A version of this patch was included in rev. 405183
Tools for merging and filtering CrawlDb-s and LinkDb-s
[ http://issues.apache.org/jira/browse/NUTCH-263?page=all ]
Andrzej Bialecki closed NUTCH-263:
---
Resolution: Fixed
Patch applied in rev. 405179. If further improvements are needed please re-open
this issue.
MapWritable.equals() doesn't work
Okay, saw the code in the http-protocol plugin. I
remember looking at this about a year ago. RFC 2616
(HTTP/1.1) does say, as Jerome pointed out:
A server MUST NOT send transfer-codings to an
HTTP/1.0 client.
Regardless, I can attest that there are servers out
there that return chunked content
Indexer doesn't consider linkdb when calculating boost value
Key: NUTCH-267
URL: http://issues.apache.org/jira/browse/NUTCH-267
Project: Nutch
Type: Bug
Components: indexer
Versions: 0.8-dev
[
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378560 ]
Doug Cutting commented on NUTCH-267:
The OPIC score is much like a count of incoming links, but a bit more refined.
OPIC(P) is one plus the sum of the OPIC contributions
21 matches
Mail list logo