[jira] Commented: (NUTCH-419) unavailable robots.txt kills fetch

2009-02-28 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677704#action_12677704 ] Doug Cook commented on NUTCH-419: - I ran into this same problem, and spent some time

[jira] Updated: (NUTCH-419) unavailable robots.txt kills fetch

2009-02-28 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-419: Attachment: diffs Here's a context diff. Hopefully this will work, am rusty at creating patches, and did

[jira] Commented: (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2007-10-31 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539146 ] Doug Cook commented on NUTCH-566: - Hi Doğacan. Thanks for following up. The issue has gotten a little more

[jira] Commented: (NUTCH-567) Proper (?) handling of URIs in TagSoup.

2007-10-17 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535593 ] Doug Cook commented on NUTCH-567: - What a nice birthday present! I will check out the fix and see how it works

Re: Anyone looked for a better HTML parser?

2007-10-16 Thread Doug Cook
Sami Siren-2 wrote: Do you have urls of such bad content available to look at? Thousands. Here is one: http://www.valtravieso.com/ver_finca.phtml?idioma=1 The hrefs that have amp;sub in them get interpreted as the subset character by tagsoup, and thus become broken links. With a few

[jira] Commented: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

2007-10-16 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535272 ] Doug Cook commented on NUTCH-436: - It looks like Nutch-566, and associated patch, which I recently filed

[jira] Created: (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2007-10-10 Thread Doug Cook (JIRA)
: fetcher Affects Versions: 0.9.0, 0.8.1, 0.8 Environment: MacOS X and Linux (CentOS 4.5) both Reporter: Doug Cook Priority: Minor I'm using 0.81, but this will affect all other versions as well. Relative links of the form ?blah are resolved incorrectly

[jira] Updated: (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2007-10-10 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-566: Attachment: RelativeURL.java Here's a static method to work around the problem. Sun's URL class has bug

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-08-01 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066 ] Doug Cook commented on NUTCH-25: Cool -- will take a look at the new patch (and will try to make stripGarbage more

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342 ] Doug Cook commented on NUTCH-25: Doğacan, Thanks for the quick feedback. * EncodingDetector api is way too open

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515461 ] Doug Cook commented on NUTCH-25: Can you provide a link on icu4j's language detection? http://www.icu-project.org

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026 ] Doug Cook commented on NUTCH-25: OK, I've got more data, and a proposed solution. I created a test set with a number

[jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-25: --- Attachment: EncodingDetector.java patch needs 'character encoding' detector

[jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-25: --- Attachment: (was: EncodingDetector.java) needs 'character encoding' detector

[jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-25: --- Attachment: EncodingDetector.java I cleaned up EncodingDetector a little; here's a functionally identical

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426 ] Doug Cook commented on NUTCH-25: Not sure where this belongs architecturally and aesthetically -- will think about

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438 ] Doug Cook commented on NUTCH-25: As far as the problem cases, I'm running a test now on my test DB (the ~60K doc one

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375 ] Doug Cook commented on NUTCH-25: Hi, Doğacan. My sincere apologies for the slow response, especially given

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514377 ] Doug Cook commented on NUTCH-25: I should also add that a significant number of the URLs seem to have been fixed

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382 ] Doug Cook commented on NUTCH-25: Oops, spoke to soon. On running a more extensive test, I saw quite a few

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-22 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498041 ] Doug Cook commented on NUTCH-25: Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye shall

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507 ] Doug Cook commented on NUTCH-25: We might want to think about raising the priority of this. I've seen encoding

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466284 ] Doug Cook commented on NUTCH-353: - I have a local fix for this problem (partly Paul Gauthier's work, partly mine

[jira] Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring

2006-12-20 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ] Doug Cook commented on NUTCH-416: - You may also want to make the status codes ORed values, so that, for example, all of the various kinds of failure all have

[jira] Created: (NUTCH-410) Faster RegexNormalize with more features

2006-11-29 Thread Doug Cook (JIRA)
: 0.8 Environment: Tested on MacOS X 10.4.7/10.4.8 Reporter: Doug Cook Priority: Minor The patch associated with this is backwards-compatible and has several improvements over the stock 0.8 RegexURLNormalizer: 1) About a 34% performance improvement, from only

Re: Should URL normalization iterate?

2006-11-29 Thread Doug Cook
on all 10 rules 10 times. That will cover the case where your rules 'chain' in a 10 step sequence. Sure it's an edge case to do that, but I can see rule sets where you construct 3-step chains (like swapping strings or something). Thanks Neal On 8/30/06, Doug Cook [EMAIL PROTECTED] wrote

[jira] Updated: (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling

2006-11-25 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-409?page=all ] Doug Cook updated NUTCH-409: Attachment: shortcircuit.patch Add short circuit notion to filters to speedup mixed site/subsite crawling

Re: More fetcher speed increases

2006-11-25 Thread Doug Cook
and AutomatonURLFilter combination sounds interesting. Could you please attach the patch to JIRA? Thanks - Scott On 11/17/06, Doug Cook [EMAIL PROTECTED] wrote: Hi, folks, I, too, was slowed down by reduce operations in fetch. Some benchmarking showed that in my case, the limiting operation was filtering

[jira] Commented: (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling

2006-11-25 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-409?page=comments#action_12452617 ] Doug Cook commented on NUTCH-409: - I should also note that this approach is still not optimal (though it is faster for my usage pattern). I'm still running

Re: need help to speed up map-reduce

2006-11-06 Thread Doug Cook
I've been planning to spend some time looking at this, but haven't gotten round to it yet -- I see the same (serious) performance problems on a single machine setup -- reduce takes quite a bit longer than the fetch (map) operation in my case, and this is on a very fast 4-CPU machine with a ton of

[jira] Created: (NUTCH-396) mergesegs sorts URLs, making segments useless for subsequent fetch

2006-11-03 Thread Doug Cook (JIRA)
Components: generator Affects Versions: 0.8 Environment: Mac OS X 10.4.7 Reporter: Doug Cook Priority: Minor Mergesegs leaves the output segment in URL-sorted order. This is a problem if the segment was just generated and not yet fetched - the fetcher likes

Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-03 Thread Doug Cook
In this case, the site uses the right kind of redirect. Unfortunately, as you point out, it's not at all clear that we can rely on sites correctly choosing the type of redirect (I tried a few sites and most were 302s, even in cases where the redirect was to the permanent, canonical version of

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-02 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ] Doug Cook commented on NUTCH-353: - This is definitely a complex issue. It is also high priority -- issues with redirects and duplicates, which URL is chosen

[jira] Commented: (NUTCH-364) Javascript parser creates some fairly bogus URLs

2006-09-19 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-364?page=comments#action_12435945 ] Doug Cook commented on NUTCH-364: - I've been looking into this a little bit. I see two problems: (1) The current two pass heuristic URL-like string extractor has

[jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-18 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435449 ] Doug Cook commented on NUTCH-365: - It still seems to me that iterative normalization is useful and not risky. By definition, a normalizer is something which

[jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-09 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433613 ] Doug Cook commented on NUTCH-365: - Hi, Andrzej. Sounds very cool. Haven't had a chance to check out the patch yet to see if it supports this, but attaching

[jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-09 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433617 ] Doug Cook commented on NUTCH-365: - PS. I like your idea of combining URL filters normalization. In a sense, a filter is just a normalizer that happens

[jira] Created: (NUTCH-363) Fetcher normalizes everything at least twice

2006-09-08 Thread Doug Cook (JIRA)
: 0.8 Environment: OS X 10.4.7 Reporter: Doug Cook Priority: Minor New links are normalized twice by the fetcher: First in DOMContentUtils.getOutlinks, where the constructor Outlink(url.toString(), linkText.toString().trim(), conf) normalizes the URL. The second

[jira] Created: (NUTCH-364) Javascript parser creates some fairly bogus URLs

2006-09-08 Thread Doug Cook (JIRA)
Environment: OS X 10.4.7 Reporter: Doug Cook If one crawls, say, http://www.metropoleparis.com/2000/501/ with the Javascript parser enabled, one gets outlinks of the form: 2006-09-08 16:55:06,301 DEBUG js.JSParseFilter - - outlink from JS: 'http://www.metropoleparis.com/2000/501

Re: Missing pages anchor text

2006-08-31 Thread Doug Cook
the anchor text we have, it's by far the most important page feature for relevance. -doug Doug Cook wrote: Hi Stefan, Yes, you're right. The index built without deduping does not have the first instance of the problem (though of course, it's also filled with duplicates, so it has other

Should URL normalization iterate?

2006-08-30 Thread Doug Cook
Hi, I've run across a few patterns in URLs where applying a normalization puts the URL in a form matching another normalization pattern (or even the same one). But that pattern won't get executed because the patterns are applied only once. Should normalization iterate until no patterns match

Re: Missing pages anchor text

2006-08-29 Thread Doug Cook
So may be we should think about a general solution of the forwarding problem. Greetings, Stefan Am 28.08.2006 um 11:33 schrieb Doug Cook: Hi, folks, I have just started digging into relevance issues with Nutch, and I'm running into some mysteries. Before I dig too deep, I wanted

Missing pages anchor text

2006-08-28 Thread Doug Cook
Hi, folks, I have just started digging into relevance issues with Nutch, and I'm running into some mysteries. Before I dig too deep, I wanted to check to see if these were known issues (a quick search of the email archives and of JIRA didn't turn up anything). I'm running 0.8 with a handful of