[ https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-1990: ----------------------------------- Attachment: NUTCH-1990-v1.patch Uuuh, a lot of garbage :( I've also run the test after spending BasicURLNormalizer a main() method: * found another bug in the current version: "http://107jamz.com/registration/?referer=http://107jamz.com" looses the double slash in the query part. That's because currently the slash and dot segment normalization is run on the part returned by url.getFile(). Should be run only on the part returned getPath(). But that's fixed by the new version. * the trial is 50% slower using Julien's test set. But that's expected because only a small fraction of the URLs contains paths with dot segments or double slashes. * but after a check is added to avoid needless work: it's as fast as previously (maybe a slightly faster): 0:49.78 (before), 1:03.11 (trial), 0:45.49 (patch v1) > Use URI.normalise() in BasicURLNormalizer > ----------------------------------------- > > Key: NUTCH-1990 > URL: https://issues.apache.org/jira/browse/NUTCH-1990 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.9 > Reporter: Julien Nioche > Assignee: Julien Nioche > Attachments: NUTCH-1990-trial1.patch, NUTCH-1990-v1.patch > > > One of the things that > [BasicURLNormalizer|https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java] > is to remove unnecessary dot segments in path. > Instead of implementing the logic ourselves with some antiquated regex > library, we should simply use > [http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()] > which does the same and is probably more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332)