[ https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1062: ---------------------------------------- Fix Version/s: (was: 2.4) 2.3.1 > Migrate BasicURLNormalizer from Apache ORO to java.util.regex > ------------------------------------------------------------- > > Key: NUTCH-1062 > URL: https://issues.apache.org/jira/browse/NUTCH-1062 > Project: Nutch > Issue Type: Improvement > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.10, 2.3.1 > > > Issue for migration from ORO to j.u.regex. There is a small problem here. I > began the migration mostly because of the double slash issue using lookback > which was not supported in ORO. This was to prevent the URL schema from being > reduced to one slash. The current Basic URL Normalizer has this problem > built-in! > {code} > // this pattern tries to find spots like "xx//yy" in the url, > // which could be replaced by a "/" > adjacentSlashRule = new Rule(); > adjacentSlashRule.pattern = (Perl5Pattern) > compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK); > adjacentSlashRule.substitution = new Perl5Substitution("/"); > {code} > But provides the wrong solution as it touches the schema as well. What to do? > Migrate to j.u.regex and keep this `feature` intact? > edit: reading more it looks like it is being fixed at a later stage. A slash > is added for URI schema's http & ftp. -- This message was sent by Atlassian JIRA (v6.3.4#6332)