[ 
https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717083#action_12717083
 ] 

Mingfai Ma commented on DROIDS-45:
----------------------------------

not sure if null path should be normalized to "/"
{code}
assertEquals("http://www.apache.org/";, 
normalizer.normalize("http://www.apache.org";));
{code}

if a website behaves differently for null and "/" path, then there might be 
problem. 

{code}
LinkNormalizer
  //apply pattens
        if (path != null && !"".equals(path))
            for (Pattern pattern : PATH_REPLACEMENTS.keySet()) {
                path = 
pattern.matcher(path).replaceAll(PATH_REPLACEMENTS.get(pattern));
            }
        else {
            path = "/";
        }
{code}

changing "/" to null path is odd but may cause less problem. e.g. for 
http://www.apache.org, it just redirect the request to "http://www.apache.org";, 
and the fetching operation won't be affected. I tested a couple of 
popular/famous websites and they will either redirect null path request to 
another url or to "/" path. One of the main function of this normalization is 
to avoid duplicated link as much as possible. 

> Fail to resolve outlink correctly
> ---------------------------------
>
>                 Key: DROIDS-45
>                 URL: https://issues.apache.org/jira/browse/DROIDS-45
>             Project: Droids
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: DROIDS-45b.patch, DROIDS-45c.patch
>
>
> I've encountered several cases that outlinks are not extracted correctly. 
> Most are cause by the use of URI.resolve(). 
> 1. For a base URI of new URI("http://www.domain.com";), <a 
> href="test.html">test.html</a> will be resolved to 
> http://www.domain.comtest.html
> 2. For a base URI of new URI("http://www.domain.com/index.php";), <a 
> href="?test=true">test with param</a> will be resolved to 
> http://www.domain.com/?test=true
> 3. for <a href="http://www.yahoo.com\n";>line break!</a>, URL.resolve will 
> throw exception. And in a browser, it can resolves the URI. (remarks: I 
> didn't check if this scenario affect the default Tika/NekoHTML parsing. )
> I suspect there are many different scenarios, many of them are probably 
> caused by non-standard usage. (but a crawler has to handle non-standard usage 
> in order to function) Obviously, we cannot cater every case, and I suggest to 
> consider a resolve failure as a bug if a link works in a Mozilla browser but 
> not in Droids LinkExtractor. 
> this issue is related to the LinkExtractor created in DROIDS-8

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to