[
https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677704#action_12677704
]
Doug Cook commented on NUTCH-419:
-
I ran into this same problem, and spent some time
[
https://issues.apache.org/jira/browse/NUTCH-419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-419:
Attachment: diffs
Here's a context diff. Hopefully this will work, am rusty at creating patches,
and did
[
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539146
]
Doug Cook commented on NUTCH-566:
-
Hi Doğacan.
Thanks for following up. The issue has gotten a little more
[
https://issues.apache.org/jira/browse/NUTCH-567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535593
]
Doug Cook commented on NUTCH-567:
-
What a nice birthday present!
I will check out the fix and see how it works
Sami Siren-2 wrote:
Do you have urls of such bad content available to look at?
Thousands. Here is one:
http://www.valtravieso.com/ver_finca.phtml?idioma=1
The hrefs that have amp;sub in them get interpreted as the subset character
by tagsoup, and thus become broken links. With a few
[
https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535272
]
Doug Cook commented on NUTCH-436:
-
It looks like Nutch-566, and associated patch, which I recently filed
: fetcher
Affects Versions: 0.9.0, 0.8.1, 0.8
Environment: MacOS X and Linux (CentOS 4.5) both
Reporter: Doug Cook
Priority: Minor
I'm using 0.81, but this will affect all other versions as well.
Relative links of the form ?blah are resolved incorrectly
[
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-566:
Attachment: RelativeURL.java
Here's a static method to work around the problem.
Sun's URL class has bug
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066
]
Doug Cook commented on NUTCH-25:
Cool -- will take a look at the new patch (and will try to make stripGarbage
more
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342
]
Doug Cook commented on NUTCH-25:
Doğacan,
Thanks for the quick feedback.
* EncodingDetector api is way too open
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515461
]
Doug Cook commented on NUTCH-25:
Can you provide a link on icu4j's language detection?
http://www.icu-project.org
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026
]
Doug Cook commented on NUTCH-25:
OK, I've got more data, and a proposed solution.
I created a test set with a number
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: EncodingDetector.java
patch
needs 'character encoding' detector
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: (was: EncodingDetector.java)
needs 'character encoding' detector
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doug Cook updated NUTCH-25:
---
Attachment: EncodingDetector.java
I cleaned up EncodingDetector a little; here's a functionally identical
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426
]
Doug Cook commented on NUTCH-25:
Not sure where this belongs architecturally and aesthetically -- will think
about
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438
]
Doug Cook commented on NUTCH-25:
As far as the problem cases, I'm running a test now on my test DB (the ~60K doc
one
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375
]
Doug Cook commented on NUTCH-25:
Hi, Doğacan.
My sincere apologies for the slow response, especially given
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514377
]
Doug Cook commented on NUTCH-25:
I should also add that a significant number of the URLs seem to have been fixed
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382
]
Doug Cook commented on NUTCH-25:
Oops, spoke to soon. On running a more extensive test, I saw quite a few
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498041
]
Doug Cook commented on NUTCH-25:
Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye
shall
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507
]
Doug Cook commented on NUTCH-25:
We might want to think about raising the priority of this. I've seen encoding
[
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466284
]
Doug Cook commented on NUTCH-353:
-
I have a local fix for this problem (partly Paul Gauthier's work, partly mine
[
http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ]
Doug Cook commented on NUTCH-416:
-
You may also want to make the status codes ORed values, so that, for example,
all of the various kinds of failure all have
: 0.8
Environment: Tested on MacOS X 10.4.7/10.4.8
Reporter: Doug Cook
Priority: Minor
The patch associated with this is backwards-compatible and has several
improvements over the stock 0.8 RegexURLNormalizer:
1) About a 34% performance improvement, from only
on all 10 rules 10 times.
That
will cover the case where your rules 'chain' in a 10 step sequence. Sure
it's an edge case to do that, but I can see rule sets where you construct
3-step chains (like swapping strings or something).
Thanks
Neal
On 8/30/06, Doug Cook [EMAIL PROTECTED] wrote
[ http://issues.apache.org/jira/browse/NUTCH-409?page=all ]
Doug Cook updated NUTCH-409:
Attachment: shortcircuit.patch
Add short circuit notion to filters to speedup mixed site/subsite crawling
and AutomatonURLFilter combination
sounds interesting. Could you please attach the patch to JIRA? Thanks
- Scott
On 11/17/06, Doug Cook [EMAIL PROTECTED] wrote:
Hi, folks,
I, too, was slowed down by reduce operations in fetch. Some benchmarking
showed that in my case, the limiting operation was filtering
[
http://issues.apache.org/jira/browse/NUTCH-409?page=comments#action_12452617 ]
Doug Cook commented on NUTCH-409:
-
I should also note that this approach is still not optimal (though it is faster
for my usage pattern). I'm still running
I've been planning to spend some time looking at this, but haven't gotten
round to it yet -- I see the same (serious) performance problems on a single
machine setup -- reduce takes quite a bit longer than the fetch (map)
operation in my case, and this is on a very fast 4-CPU machine with a ton of
Components: generator
Affects Versions: 0.8
Environment: Mac OS X 10.4.7
Reporter: Doug Cook
Priority: Minor
Mergesegs leaves the output segment in URL-sorted order.
This is a problem if the segment was just generated and not yet fetched - the
fetcher likes
In this case, the site uses the right kind of redirect. Unfortunately, as
you point out, it's not at all clear that we can rely on sites correctly
choosing the type of redirect (I tried a few sites and most were 302s, even
in cases where the redirect was to the permanent, canonical version of
[
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ]
Doug Cook commented on NUTCH-353:
-
This is definitely a complex issue. It is also high priority -- issues with
redirects and duplicates, which URL is chosen
[
http://issues.apache.org/jira/browse/NUTCH-364?page=comments#action_12435945 ]
Doug Cook commented on NUTCH-364:
-
I've been looking into this a little bit. I see two problems:
(1) The current two pass heuristic URL-like string extractor has
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435449 ]
Doug Cook commented on NUTCH-365:
-
It still seems to me that iterative normalization is useful and not risky. By
definition, a normalizer is something which
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433613 ]
Doug Cook commented on NUTCH-365:
-
Hi, Andrzej.
Sounds very cool. Haven't had a chance to check out the patch yet to see if it
supports this, but attaching
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433617 ]
Doug Cook commented on NUTCH-365:
-
PS. I like your idea of combining URL filters normalization. In a sense, a
filter is just a normalizer that happens
: 0.8
Environment: OS X 10.4.7
Reporter: Doug Cook
Priority: Minor
New links are normalized twice by the fetcher:
First in DOMContentUtils.getOutlinks, where the constructor
Outlink(url.toString(), linkText.toString().trim(), conf) normalizes the URL.
The second
Environment: OS X 10.4.7
Reporter: Doug Cook
If one crawls, say,
http://www.metropoleparis.com/2000/501/
with the Javascript parser enabled, one gets outlinks of the form:
2006-09-08 16:55:06,301 DEBUG js.JSParseFilter - - outlink from JS:
'http://www.metropoleparis.com/2000/501
the anchor text we have, it's by far the most important page feature for
relevance.
-doug
Doug Cook wrote:
Hi Stefan,
Yes, you're right. The index built without deduping does not have the
first instance of the problem (though of course, it's also filled with
duplicates, so it has other
Hi,
I've run across a few patterns in URLs where applying a normalization puts
the URL in a form matching another normalization pattern (or even the same
one). But that pattern won't get executed because the patterns are applied
only once.
Should normalization iterate until no patterns match
So may be we should think about a general solution of the forwarding
problem.
Greetings,
Stefan
Am 28.08.2006 um 11:33 schrieb Doug Cook:
Hi, folks,
I have just started digging into relevance issues with Nutch, and I'm
running into some mysteries. Before I dig too deep, I wanted
Hi, folks,
I have just started digging into relevance issues with Nutch, and I'm
running into some mysteries. Before I dig too deep, I wanted to check to see
if these were known issues (a quick search of the email archives and of JIRA
didn't turn up anything). I'm running 0.8 with a handful of
43 matches
Mail list logo