[ 
https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1062:
----------------------------------------
    Fix Version/s:     (was: 2.4)
                   2.3.1

> Migrate BasicURLNormalizer from Apache ORO to java.util.regex
> -------------------------------------------------------------
>
>                 Key: NUTCH-1062
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1062
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.10, 2.3.1
>
>
> Issue for migration from ORO to j.u.regex. There is a small problem here. I 
> began the migration mostly because of the double slash issue using lookback 
> which was not supported in ORO. This was to prevent the URL schema from being 
> reduced to one slash. The current Basic URL Normalizer has this problem 
> built-in!
> {code}
>         // this pattern tries to find spots like "xx//yy" in the url,
>         // which could be replaced by a "/"
>         adjacentSlashRule = new Rule();
>         adjacentSlashRule.pattern = (Perl5Pattern)      
>           compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK);     
>         adjacentSlashRule.substitution = new Perl5Substitution("/");
> {code}
> But provides the wrong solution as it touches the schema as well. What to do? 
> Migrate to j.u.regex and keep this `feature` intact? 
> edit: reading more it looks like it is being fixed at a later stage. A slash 
> is added for URI schema's http & ftp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to