Fail to resovle outlink correctly
---------------------------------

                 Key: DROIDS-45
                 URL: https://issues.apache.org/jira/browse/DROIDS-45
             Project: Droids
          Issue Type: Bug
          Components: core
    Affects Versions: 0.01
            Reporter: Mingfai Ma


I've encountered several cases that outlinks are not extracted correctly. Most 
are cause by the use of URI.resolve(). 

1. For a base URI of new URI("http://www.domain.com";), <a 
href="test.html">test.html</a> will be resolved to 
http://www.domain.comtest.html

2. For a base URI of new URI("http://www.domain.com/index.php";), <a 
href="?test=true">test with param</a> will be resolved to 
http://www.domain.com/?test=true

3. for <a href="http://www.yahoo.com\n";>line break!</a>, URL.resolve will throw 
exception. And in a browser, it can resolves the URI. (remarks: I didn't check 
if this scenario affect the default Tika/NekoHTML parsing. )

I suspect there are many different scenarios, many of them are probably caused 
by non-standard usage. (but a crawler has to handle non-standard usage in order 
to function) Obviously, we cannot cater every case, and I suggest to consider a 
resolve failure as a bug if a link works in a Mozilla browser but not in Droids 
LinkExtractor. 

this issue is related to the LinkExtractor created in DROIDS-8

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to