[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066
]
Doug Cook commented on NUTCH-25:
Cool -- will take a look at the new patch (and will try to make stripGarbage
more
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515461
]
Doug Cook commented on NUTCH-25:
> Can you provide a link on icu4j's language detection?
http://www.icu-pro
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342
]
Doug Cook commented on NUTCH-25:
Doğacan,
Thanks for the quick feedback.
> * EncodingDetector api is way too o
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: EncodingDetector.java
I cleaned up EncodingDetector a little; here's a functionally identical
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: (was: EncodingDetector.java)
> needs 'character encoding
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: EncodingDetector.java
patch
> needs 'character encoding
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026
]
Doug Cook commented on NUTCH-25:
OK, I've got more data, and a proposed solution.
I created a test set with a n
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438
]
Doug Cook commented on NUTCH-25:
As far as the problem cases, I'm running a test now on my test DB (the ~60K doc
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426
]
Doug Cook commented on NUTCH-25:
Not sure where this belongs architecturally and aesthetically -- will think
about
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382
]
Doug Cook commented on NUTCH-25:
Oops, spoke to soon. On running a more extensive test, I saw quite a few
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514377
]
Doug Cook commented on NUTCH-25:
I should also add that a significant number of the URLs seem to have been fixed
by
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375
]
Doug Cook commented on NUTCH-25:
Hi, Doğacan.
My sincere apologies for the slow response, especially given the
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498041
]
Doug Cook commented on NUTCH-25:
Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507
]
Doug Cook commented on NUTCH-25:
We might want to think about raising the priority of this. I've seen enc
[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466284
]
Doug Cook commented on NUTCH-353:
-
I have a local fix for this problem (partly Paul Gauthier's work, partly
[
http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ]
Doug Cook commented on NUTCH-416:
-
You may also want to make the status codes ORed values, so that, for example,
all of the various kinds of failure all have a
f rewrite rules you
> have. So if you have 10 rules, you iterate on all 10 rules 10 times.
> That
> will cover the case where your rules 'chain' in a 10 step sequence. Sure
> it's an edge case to do that, but I can see rule sets where you construct
> 3-step chains (lik
[ http://issues.apache.org/jira/browse/NUTCH-410?page=all ]
Doug Cook updated NUTCH-410:
Attachment: betterRegexNorm.patch
> Faster RegexNormalize with more features
>
>
> Ke
: 0.8
Environment: Tested on MacOS X 10.4.7/10.4.8
Reporter: Doug Cook
Priority: Minor
The patch associated with this is backwards-compatible and has several
improvements over the stock 0.8 RegexURLNormalizer:
1) About a 34% performance improvement, from only
[
http://issues.apache.org/jira/browse/NUTCH-409?page=comments#action_12452617 ]
Doug Cook commented on NUTCH-409:
-
I should also note that this approach is still not optimal (though it is faster
for my usage pattern). I'm still runnin
efixURLFilter and AutomatonURLFilter combination
> sounds interesting. Could you please attach the patch to JIRA? Thanks
>
> - Scott
>
> On 11/17/06, Doug Cook <[EMAIL PROTECTED]> wrote:
>>
>> Hi, folks,
>>
>> I, too, was slowed down by reduce operations in fe
[ http://issues.apache.org/jira/browse/NUTCH-409?page=all ]
Doug Cook updated NUTCH-409:
Attachment: shortcircuit.patch
> Add "short circuit" notion to filters to speedup mixed site/sub
ect: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cook
Priority: Minor
In the case where one is crawling a mixture of sites and sub-sites, the prefix
matcher can match the sites quite quickly, but either the regex or
Hi, folks,
I, too, was slowed down by reduce operations in fetch. Some benchmarking
showed that in my case, the limiting operation was filtering (though a
distant second was the time spent calculating Levenshtein distances,
presumably part of the spellchecking that Sami just removed to speed thin
I've been planning to spend some time looking at this, but haven't gotten
round to it yet -- I see the same (serious) performance problems on a single
machine setup -- reduce takes quite a bit longer than the fetch (map)
operation in my case, and this is on a very fast 4-CPU machine with a ton of
Components: generator
Affects Versions: 0.8
Environment: Mac OS X 10.4.7
Reporter: Doug Cook
Priority: Minor
Mergesegs leaves the output segment in URL-sorted order.
This is a problem if the segment was just generated and not yet fetched - the
fetcher likes
In this case, the site uses the "right" kind of redirect. Unfortunately, as
you point out, it's not at all clear that we can rely on sites correctly
choosing the type of redirect (I tried a few sites and most were 302s, even
in cases where the redirect was to the permanent, canonical version of t
[
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ]
Doug Cook commented on NUTCH-353:
-
This is definitely a complex issue. It is also high priority -- issues with
redirects and duplicates, which URL is chosen, and
[
http://issues.apache.org/jira/browse/NUTCH-364?page=comments#action_12435945 ]
Doug Cook commented on NUTCH-364:
-
I've been looking into this a little bit. I see two problems:
(1) The current "two pass" heuristic URL-like strin
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435449 ]
Doug Cook commented on NUTCH-365:
-
It still seems to me that iterative normalization is useful and not risky. By
definition, a "normalizer" is somet
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433617 ]
Doug Cook commented on NUTCH-365:
-
PS. I like your idea of combining URL filters & normalization. In a sense, a
"filter" is just a normalizer t
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433613 ]
Doug Cook commented on NUTCH-365:
-
Hi, Andrzej.
Sounds very cool. Haven't had a chance to check out the patch yet to see if it
supports this, but attach
Environment: OS X 10.4.7
Reporter: Doug Cook
If one crawls, say,
http://www.metropoleparis.com/2000/501/
with the Javascript parser enabled, one gets outlinks of the form:
2006-09-08 16:55:06,301 DEBUG js.JSParseFilter - - outlink from JS:
'http://www.metropoleparis.com/2000/501/&
: 0.8
Environment: OS X 10.4.7
Reporter: Doug Cook
Priority: Minor
New links are normalized twice by the fetcher:
First in DOMContentUtils.getOutlinks, where the constructor
Outlink(url.toString(), linkText.toString().trim(), conf) normalizes the URL.
The second
Hi, Andrzej.
Thanks for the quick response!
> Andrzej Bialecki wrote:
> Doug Cook wrote:
> > I'm thinking I should file issues on the following-
> >
> > 1. The scoring bug. Not sure what to file here, since such things are
> hard
> > to pi
what
happens to the inbound anchor text. We should work very very hard to keep
all the anchor text we have, it's by far the most important page feature for
relevance.
-doug
Doug Cook wrote:
>
> Hi Stefan,
>
> Yes, you're right. The index built without deduping does not h
Hi,
I've run across a few patterns in URLs where applying a normalization puts
the URL in a form matching another normalization pattern (or even the same
one). But that pattern won't get executed because the patterns are applied
only once.
Should normalization iterate until no patterns match (wi
he forwarding problem also in a other case.
> https://issues.apache.org/jira/browse/NUTCH-353
> So may be we should think about a general solution of the forwarding
> problem.
>
> Greetings,
> Stefan
>
>
> Am 28.08.2006 um 11:33 schrieb Doug Cook:
>
>>
>>
Hi, folks,
I have just started digging into relevance issues with Nutch, and I'm
running into some mysteries. Before I dig too deep, I wanted to check to see
if these were known issues (a quick search of the email archives and of JIRA
didn't turn up anything). I'm running 0.8 with a handful of pa
39 matches
Mail list logo