[ 
https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingfai Ma updated DROIDS-45:
-----------------------------

    Attachment: LinkResolverTests.java
                LinkResolver.java

Changed the API base on Thorsten's comment. 

Notice that these two classes need further processing to put into Droids. The 
classes are not in Droids package, there are no license terms, and the style 
doesn't align to the original LinkExtractor. They are attached as a base for a 
Droids implementation.

> Fail to resolve outlink correctly
> ---------------------------------
>
>                 Key: DROIDS-45
>                 URL: https://issues.apache.org/jira/browse/DROIDS-45
>             Project: Droids
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: LinkResolver.java, LinkResolver.java, 
> LinkResolverTests.java, LinkResolverTests.java
>
>
> I've encountered several cases that outlinks are not extracted correctly. 
> Most are cause by the use of URI.resolve(). 
> 1. For a base URI of new URI("http://www.domain.com";), <a 
> href="test.html">test.html</a> will be resolved to 
> http://www.domain.comtest.html
> 2. For a base URI of new URI("http://www.domain.com/index.php";), <a 
> href="?test=true">test with param</a> will be resolved to 
> http://www.domain.com/?test=true
> 3. for <a href="http://www.yahoo.com\n";>line break!</a>, URL.resolve will 
> throw exception. And in a browser, it can resolves the URI. (remarks: I 
> didn't check if this scenario affect the default Tika/NekoHTML parsing. )
> I suspect there are many different scenarios, many of them are probably 
> caused by non-standard usage. (but a crawler has to handle non-standard usage 
> in order to function) Obviously, we cannot cater every case, and I suggest to 
> consider a resolve failure as a bug if a link works in a Mozilla browser but 
> not in Droids LinkExtractor. 
> this issue is related to the LinkExtractor created in DROIDS-8

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to