[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-08-01 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066 ] Doug Cook commented on NUTCH-25: Cool -- will take a look at the new patch (and will try to make stripGarbage more

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515461 ] Doug Cook commented on NUTCH-25: > Can you provide a link on icu4j's language detection? http://www.icu-pro

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342 ] Doug Cook commented on NUTCH-25: Doğacan, Thanks for the quick feedback. > * EncodingDetector api is way too o

[Nutch-dev] [jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-25: --- Attachment: EncodingDetector.java I cleaned up EncodingDetector a little; here's a functionally identical

[Nutch-dev] [jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-25: --- Attachment: (was: EncodingDetector.java) > needs 'character encoding&#

[Nutch-dev] [jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cook updated NUTCH-25: --- Attachment: EncodingDetector.java patch > needs 'character encoding&#

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026 ] Doug Cook commented on NUTCH-25: OK, I've got more data, and a proposed solution. I created a test set with a n

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438 ] Doug Cook commented on NUTCH-25: As far as the problem cases, I'm running a test now on my test DB (the ~60K doc

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426 ] Doug Cook commented on NUTCH-25: Not sure where this belongs architecturally and aesthetically -- will think about

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382 ] Doug Cook commented on NUTCH-25: Oops, spoke to soon. On running a more extensive test, I saw quite a few

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514377 ] Doug Cook commented on NUTCH-25: I should also add that a significant number of the URLs seem to have been fixed by

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375 ] Doug Cook commented on NUTCH-25: Hi, Doğacan. My sincere apologies for the slow response, especially given the

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-22 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498041 ] Doug Cook commented on NUTCH-25: Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye

[Nutch-dev] [jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507 ] Doug Cook commented on NUTCH-25: We might want to think about raising the priority of this. I've seen enc

[Nutch-dev] [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-01-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466284 ] Doug Cook commented on NUTCH-353: - I have a local fix for this problem (partly Paul Gauthier's work, partly

[Nutch-dev] [jira] Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring

2006-12-20 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ] Doug Cook commented on NUTCH-416: - You may also want to make the status codes ORed values, so that, for example, all of the various kinds of failure all have a

Re: [Nutch-dev] Should URL normalization iterate?

2006-11-29 Thread Doug Cook
f rewrite rules you > have. So if you have 10 rules, you iterate on all 10 rules 10 times. > That > will cover the case where your rules 'chain' in a 10 step sequence. Sure > it's an edge case to do that, but I can see rule sets where you construct > 3-step chains (lik

[Nutch-dev] [jira] Updated: (NUTCH-410) Faster RegexNormalize with more features

2006-11-29 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-410?page=all ] Doug Cook updated NUTCH-410: Attachment: betterRegexNorm.patch > Faster RegexNormalize with more features > > > Ke

[Nutch-dev] [jira] Created: (NUTCH-410) Faster RegexNormalize with more features

2006-11-29 Thread Doug Cook (JIRA)
: 0.8 Environment: Tested on MacOS X 10.4.7/10.4.8 Reporter: Doug Cook Priority: Minor The patch associated with this is backwards-compatible and has several improvements over the stock 0.8 RegexURLNormalizer: 1) About a 34% performance improvement, from only

[Nutch-dev] [jira] Commented: (NUTCH-409) Add "short circuit" notion to filters to speedup mixed site/subsite crawling

2006-11-25 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-409?page=comments#action_12452617 ] Doug Cook commented on NUTCH-409: - I should also note that this approach is still not optimal (though it is faster for my usage pattern). I'm still runnin

Re: [Nutch-dev] More fetcher speed increases

2006-11-25 Thread Doug Cook
efixURLFilter and AutomatonURLFilter combination > sounds interesting. Could you please attach the patch to JIRA? Thanks > > - Scott > > On 11/17/06, Doug Cook <[EMAIL PROTECTED]> wrote: >> >> Hi, folks, >> >> I, too, was slowed down by reduce operations in fe

[Nutch-dev] [jira] Updated: (NUTCH-409) Add "short circuit" notion to filters to speedup mixed site/subsite crawling

2006-11-25 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-409?page=all ] Doug Cook updated NUTCH-409: Attachment: shortcircuit.patch > Add "short circuit" notion to filters to speedup mixed site/sub

[Nutch-dev] [jira] Created: (NUTCH-409) Add "short circuit" notion to filters to speedup mixed site/subsite crawling

2006-11-25 Thread Doug Cook (JIRA)
ect: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Reporter: Doug Cook Priority: Minor In the case where one is crawling a mixture of sites and sub-sites, the prefix matcher can match the sites quite quickly, but either the regex or

[Nutch-dev] More fetcher speed increases

2006-11-16 Thread Doug Cook
Hi, folks, I, too, was slowed down by reduce operations in fetch. Some benchmarking showed that in my case, the limiting operation was filtering (though a distant second was the time spent calculating Levenshtein distances, presumably part of the spellchecking that Sami just removed to speed thin

Re: [Nutch-dev] need help to speed up map-reduce

2006-11-06 Thread Doug Cook
I've been planning to spend some time looking at this, but haven't gotten round to it yet -- I see the same (serious) performance problems on a single machine setup -- reduce takes quite a bit longer than the fetch (map) operation in my case, and this is on a very fast 4-CPU machine with a ton of

[Nutch-dev] [jira] Created: (NUTCH-396) mergesegs sorts URLs, making segments useless for subsequent fetch

2006-11-03 Thread Doug Cook (JIRA)
Components: generator Affects Versions: 0.8 Environment: Mac OS X 10.4.7 Reporter: Doug Cook Priority: Minor Mergesegs leaves the output segment in URL-sorted order. This is a problem if the segment was just generated and not yet fetched - the fetcher likes

Re: [Nutch-dev] [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-03 Thread Doug Cook
In this case, the site uses the "right" kind of redirect. Unfortunately, as you point out, it's not at all clear that we can rely on sites correctly choosing the type of redirect (I tried a few sites and most were 302s, even in cases where the redirect was to the permanent, canonical version of t

[Nutch-dev] [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-02 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ] Doug Cook commented on NUTCH-353: - This is definitely a complex issue. It is also high priority -- issues with redirects and duplicates, which URL is chosen, and

[Nutch-dev] [jira] Commented: (NUTCH-364) Javascript parser creates some fairly bogus URLs

2006-09-19 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-364?page=comments#action_12435945 ] Doug Cook commented on NUTCH-364: - I've been looking into this a little bit. I see two problems: (1) The current "two pass" heuristic URL-like strin

[Nutch-dev] [jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-18 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435449 ] Doug Cook commented on NUTCH-365: - It still seems to me that iterative normalization is useful and not risky. By definition, a "normalizer" is somet

[Nutch-dev] [jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-09 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433617 ] Doug Cook commented on NUTCH-365: - PS. I like your idea of combining URL filters & normalization. In a sense, a "filter" is just a normalizer t

[Nutch-dev] [jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-09 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433613 ] Doug Cook commented on NUTCH-365: - Hi, Andrzej. Sounds very cool. Haven't had a chance to check out the patch yet to see if it supports this, but attach

[Nutch-dev] [jira] Created: (NUTCH-364) Javascript parser creates some fairly bogus URLs

2006-09-08 Thread Doug Cook (JIRA)
Environment: OS X 10.4.7 Reporter: Doug Cook If one crawls, say, http://www.metropoleparis.com/2000/501/ with the Javascript parser enabled, one gets outlinks of the form: 2006-09-08 16:55:06,301 DEBUG js.JSParseFilter - - outlink from JS: 'http://www.metropoleparis.com/2000/501/&

[Nutch-dev] [jira] Created: (NUTCH-363) Fetcher normalizes everything at least twice

2006-09-08 Thread Doug Cook (JIRA)
: 0.8 Environment: OS X 10.4.7 Reporter: Doug Cook Priority: Minor New links are normalized twice by the fetcher: First in DOMContentUtils.getOutlinks, where the constructor Outlink(url.toString(), linkText.toString().trim(), conf) normalizes the URL. The second

Re: [Nutch-dev] Missing pages & anchor text

2006-08-31 Thread Doug Cook
Hi, Andrzej. Thanks for the quick response! > Andrzej Bialecki wrote: > Doug Cook wrote: > > I'm thinking I should file issues on the following- > > > > 1. The scoring bug. Not sure what to file here, since such things are > hard > > to pi

Re: [Nutch-dev] Missing pages & anchor text

2006-08-31 Thread Doug Cook
what happens to the inbound anchor text. We should work very very hard to keep all the anchor text we have, it's by far the most important page feature for relevance. -doug Doug Cook wrote: > > Hi Stefan, > > Yes, you're right. The index built without deduping does not h

[Nutch-dev] Should URL normalization iterate?

2006-08-30 Thread Doug Cook
Hi, I've run across a few patterns in URLs where applying a normalization puts the URL in a form matching another normalization pattern (or even the same one). But that pattern won't get executed because the patterns are applied only once. Should normalization iterate until no patterns match (wi

Re: [Nutch-dev] Missing pages & anchor text

2006-08-29 Thread Doug Cook
he forwarding problem also in a other case. > https://issues.apache.org/jira/browse/NUTCH-353 > So may be we should think about a general solution of the forwarding > problem. > > Greetings, > Stefan > > > Am 28.08.2006 um 11:33 schrieb Doug Cook: > >> >>

[Nutch-dev] Missing pages & anchor text

2006-08-28 Thread Doug Cook
Hi, folks, I have just started digging into relevance issues with Nutch, and I'm running into some mysteries. Before I dig too deep, I wanted to check to see if these were known issues (a quick search of the email archives and of JIRA didn't turn up anything). I'm running 0.8 with a handful of pa