[ 
https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1990:
-----------------------------------
    Attachment: NUTCH-1990-v1.patch

Uuuh, a lot of garbage :(  I've also run the test after spending 
BasicURLNormalizer a main() method:
* found another bug in the current version: 
"http://107jamz.com/registration/?referer=http://107jamz.com"; looses the double 
slash in the query part. That's because currently the slash and dot segment 
normalization is run on the part returned by url.getFile(). Should be run only 
on the part returned getPath(). But that's fixed by the new version.
* the trial is 50% slower using Julien's test set. But that's expected because 
only a small fraction of the URLs contains paths with dot segments or double 
slashes.
* but after a check is added to avoid needless work: it's as fast as previously 
(maybe a slightly faster): 0:49.78 (before), 1:03.11 (trial), 0:45.49 (patch v1)


> Use URI.normalise() in BasicURLNormalizer
> -----------------------------------------
>
>                 Key: NUTCH-1990
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1990
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-1990-trial1.patch, NUTCH-1990-v1.patch
>
>
> One of the things that 
> [BasicURLNormalizer|https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java]
>  is to remove unnecessary dot segments in path.
> Instead of implementing the logic ourselves with some antiquated regex 
> library, we should simply use 
> [http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()] 
> which does the same and is probably more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to