nutch is loosing not modified pages

2006-05-08 Thread Stefan Groschupf
Hi, in the fetcher line 192 in case the status is NOTMODIFIED we collect null as content but we already have the content. I'm worry what is happen with a page that does not change for 60 days, since the concept of nutch is do delete segments that are older than db.default.fetch.interval,

Re: nutch is loosing not modified pages

2006-05-08 Thread Andrzej Bialecki
Stefan Groschupf wrote: Hi, in the fetcher line 192 in case the status is NOTMODIFIED we collect null as content but we already have the content. I'm worry what is happen with a page that does not change for 60 days, since the concept of nutch is do delete segments that are older than

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-08 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378387 ] Dawid Weiss commented on NUTCH-134: --- (back from holidays, so a bit delayed, but) I confirm Andrzej's suggestion -- a plain-text only summarized is ideal for clustering for

Re: http chunked content

2006-05-08 Thread Jérôme Charron
As far I know a lot of http servers response with chunked content at least all that return dynamically generated pages. Should I file a bug? Any thoughts? In fact, the requests issued from http plugin are in HTTP 1.0, so the servers should never return some chuncked content. I think that the

[jira] Created: (NUTCH-265) Getting Clustered results in better form.

2006-05-08 Thread Kris K (JIRA)
Getting Clustered results in better form. - Key: NUTCH-265 URL: http://issues.apache.org/jira/browse/NUTCH-265 Project: Nutch Type: Improvement Components: searcher Versions: 0.7.2 Reporter: Kris K The

Re: Merging segments

2006-05-08 Thread Andrzej Bialecki
Chris Fellows wrote: Hello, So the last discussion on merging segments was back in Jan. Has there been any progress in this direction? What would be the benefit of being able merge segments? Would being able to merge segments open up new functionality options or is merging just a convience?

[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

2006-05-08 Thread Dawid Weiss (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12378425 ] Dawid Weiss commented on NUTCH-265: --- The clustering interface is very simple in Nutch because it usually needs to be adjusted to the needs of a particular application.

Re: http chunked content

2006-05-08 Thread Stefan Groschupf
I'm almost sure that this is not related to http 1.0 requests. Am 08.05.2006 um 03:20 schrieb Jérôme Charron: As far I know a lot of http servers response with chunked content at least all that return dynamically generated pages. Should I file a bug? Any thoughts? In fact, the requests

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ] Doug Cutting commented on NUTCH-134: +1 for Summary as Writable and change HitSummarizer.getSummary() to return a Summary directly rather than a String. I don't think

Re: http chunked content

2006-05-08 Thread Stefan Groschupf
http://www.apple.com for example answer with chunked content also if you request with a http 1.0 header. Am 08.05.2006 um 03:20 schrieb Jérôme Charron: As far I know a lot of http servers response with chunked content at least all that return dynamically generated pages. Should I file a bug?

Re: Merging segments

2006-05-08 Thread Chris Fellows
That's great. Well, my follow up to that then is: Will the new tool allow any form of diff'ing segments? In practice this would allow you to run a crawl on a series of sites one week. Then run another crawl on the same sites a week or so later. Diff the segments and allow users to search on

Re: Merging segments

2006-05-08 Thread Andrzej Bialecki
Chris Fellows wrote: That's great. Well, my follow up to that then is: Will the new tool allow any form of diff'ing segments? In practice this would allow you to run a No, it does only two things - merging and slicing. That's already one too many... ;) crawl on a series of sites one

Re: http chunked content

2006-05-08 Thread Chris Fellows
Furthermore, we can read in HTTP/1.1 specification that A server MUST NOT send transfer-codings to an HTTP/1.0 client. I once did an socket implementation against Anonymizer. This is well established proxy service that services $100K+ government and private contracts. Their server always

Re: http chunked content

2006-05-08 Thread Chris Fellows
Just remembered, got around it by using HTTPClient which handles reading the response (chunked or not) transparently. Haven't looked at the nutch code, but if we were to use HTTPClient 3.0.x or later, should take care of it. --- Chris Fellows [EMAIL PROTECTED] wrote: Furthermore, we can read

[jira] Created: (NUTCH-266) hadoop bug when doing updatedb

2006-05-08 Thread Eugen Kochuev (JIRA)
hadoop bug when doing updatedb -- Key: NUTCH-266 URL: http://issues.apache.org/jira/browse/NUTCH-266 Project: Nutch Type: Bug Versions: 0.8-dev Environment: windows xp, JDK 1.4.2_04 Reporter: Eugen Kochuev I constantly get

Re: http chunked content

2006-05-08 Thread Andrzej Bialecki
Chris Fellows wrote: Just remembered, got around it by using HTTPClient which handles reading the response (chunked or not) transparently. Haven't looked at the nutch code, but if we were to use HTTPClient 3.0.x or later, should take care of it. Take a look at protocol-httpclient. This

[jira] Closed: (NUTCH-264) Tools for merging and filtering CrawlDb-s and LinkDb-s

2006-05-08 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-264?page=all ] Andrzej Bialecki closed NUTCH-264: --- Resolution: Fixed A version of this patch was included in rev. 405183 Tools for merging and filtering CrawlDb-s and LinkDb-s

[jira] Closed: (NUTCH-263) MapWritable.equals() doesn't work properly

2006-05-08 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-263?page=all ] Andrzej Bialecki closed NUTCH-263: --- Resolution: Fixed Patch applied in rev. 405179. If further improvements are needed please re-open this issue. MapWritable.equals() doesn't work

Re: http chunked content

2006-05-08 Thread Chris Fellows
Okay, saw the code in the http-protocol plugin. I remember looking at this about a year ago. RFC 2616 (HTTP/1.1) does say, as Jerome pointed out: A server MUST NOT send transfer-codings to an HTTP/1.0 client. Regardless, I can attest that there are servers out there that return chunked content

[jira] Created: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-08 Thread Chris Schneider (JIRA)
Indexer doesn't consider linkdb when calculating boost value Key: NUTCH-267 URL: http://issues.apache.org/jira/browse/NUTCH-267 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev

[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378560 ] Doug Cutting commented on NUTCH-267: The OPIC score is much like a count of incoming links, but a bit more refined. OPIC(P) is one plus the sum of the OPIC contributions