[
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851923#action_12851923
]
Ken Krugler commented on NUTCH-706:
---
Two comments about this:
1. From my experiences with
[
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846459#action_12846459
]
Ken Krugler commented on NUTCH-797:
---
Agreed re crawler-commons...feels like there's a beef
[
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846424#action_12846424
]
Ken Krugler commented on NUTCH-797:
---
I thought this same issue (relative URL with leading
[
https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830109#action_12830109
]
Ken Krugler commented on NUTCH-786:
---
Is this something that should also be applied to craw
[
https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798890#action_12798890
]
Ken Krugler commented on NUTCH-751:
---
i agree that this should be in crawler-commons. E.g.
[
https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753069#action_12753069
]
Ken Krugler commented on NUTCH-751:
---
I'm using HttpClient 4.0 in Bixo, and I agree that Nu
[
https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722242#action_12722242
]
Ken Krugler commented on NUTCH-731:
---
This is definitely an issue - I've been pinging vario
[
https://issues.apache.org/jira/browse/NUTCH-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722014#action_12722014
]
Ken Krugler commented on NUTCH-101:
---
1. Not sure if the reported problem with "Disallow:"
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714277#action_12714277
]
Ken Krugler commented on NUTCH-739:
---
There's another approach that works well here, and th
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525
]
Ken Krugler commented on NUTCH-25:
--
I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like this.
The
[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466261
]
Ken Krugler commented on NUTCH-353:
---
Wait, looks like maybe change 490607 (fix for NUTCH-273) might fix the issue I
[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466260
]
Ken Krugler commented on NUTCH-353:
---
Another small note about this (see NUTCH-411 for a related but different
probl
[
http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12444162 ]
Ken Krugler commented on NUTCH-385:
---
There is a middle ground, though we don't know yet how important it is to
address.
When we crawl partner sites, we sometimes
[
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ]
Ken Krugler commented on NUTCH-353:
---
+1 that the redirect target is not always the "real" URL that we want to keep.
For example, http://www.ibm.com/developerworks
[
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412621 ]
Ken Krugler commented on NUTCH-272:
---
The generate.max.per.host parameter does work, but with the following
limitations that we've run into:
1. The current code uses the enti
[
http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370424 ]
Ken Krugler commented on NUTCH-230:
---
So Doug beat me to this comment :)
I was going to describe the two cases we'd run into...
1. There's a great page, but most of the links
OPIC score for outlinks should be based on # of valid links, not total # of
links.
--
Key: NUTCH-230
URL: http://issues.apache.org/jira/browse/NUTCH-230
Project: Nutch
Type: Improvement
17 matches
Mail list logo