[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370426 ] Andrzej Bialecki commented on NUTCH-230: - Yes, these are good examples - I'll prepare a patch to make this a boolean setting; if false (default) the calculation will b

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Ken Krugler (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370424 ] Ken Krugler commented on NUTCH-230: --- So Doug beat me to this comment :) I was going to describe the two cases we'd run into... 1. There's a great page, but most of the links

Re: OPIC score calculation issues

2006-03-14 Thread Doug Cutting
Andrzej Bialecki wrote: When we used WebDB it was possible to overlap generate / fetch / update cycles, because we would "lock" pages selected by FetchListTool for a period of time. Now we don't do this. The advantage is that we don't have to rewrite CrawlDB. But operations on CrawlDB are con

Re: OPIC score calculation issues

2006-03-14 Thread Andrzej Bialecki
(Better late than never... I forgot I didn't yet respond to your posting). Doug Cutting wrote: I think all that you're saying is that we should not run two CrawlDB updates at once, right? But there are lots of reasons we cannot do that besides the OPIC calculation. When we used WebDB it was

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370381 ] Doug Cutting commented on NUTCH-230: Andrzej, that's true if we think links that are filtered are bad links, but if we instead think of them as non-links then this fix is c

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370356 ] Andrzej Bialecki commented on NUTCH-230: - Hmmm, this is a deeply philosophical question... Should you spread out the OPIC score to all links that a page sports, or jus

Re: Much faster RegExp lib needed in nutch?

2006-03-14 Thread Dawid Weiss
The probability of encountering a $ sign somewhere inside URL is not insignificant... I agree that it's very unlikely (perhaps even illegal) to use ^ in URLs, but $ are sometimes used. I'd have to take a look at the spec, but I think both characters should be URL-encoded anyway. Maybe it'd b