[
http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370426 ]
Andrzej Bialecki commented on NUTCH-230:
-
Yes, these are good examples - I'll prepare a patch to make this a boolean
setting; if false (default) the calculation will b
[
http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370424 ]
Ken Krugler commented on NUTCH-230:
---
So Doug beat me to this comment :)
I was going to describe the two cases we'd run into...
1. There's a great page, but most of the links
Andrzej Bialecki wrote:
When we used WebDB it was possible to overlap generate / fetch / update
cycles, because we would "lock" pages selected by FetchListTool for a
period of time.
Now we don't do this. The advantage is that we don't have to rewrite
CrawlDB. But operations on CrawlDB are con
(Better late than never... I forgot I didn't yet respond to your posting).
Doug Cutting wrote:
I think all that you're saying is that we should not run two CrawlDB
updates at once, right? But there are lots of reasons we cannot do
that besides the OPIC calculation.
When we used WebDB it was
[
http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370381 ]
Doug Cutting commented on NUTCH-230:
Andrzej, that's true if we think links that are filtered are bad links, but if
we instead think of them as non-links then this fix is c
[
http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370356 ]
Andrzej Bialecki commented on NUTCH-230:
-
Hmmm, this is a deeply philosophical question... Should you spread out the OPIC
score to all links that a page sports, or jus
The probability of encountering a $ sign somewhere inside URL is not
insignificant... I agree that it's very unlikely (perhaps even illegal)
to use ^ in URLs, but $ are sometimes used.
I'd have to take a look at the spec, but I think both characters should
be URL-encoded anyway. Maybe it'd b