[
https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694995#action_12694995
]
Mingfai Ma commented on DROIDS-45:
----------------------------------
the LinkExtractor doesn't append '/' automatically. and I think it shouldn't,
as it is possible for a server to handle with and without '/' differently. For
root domain URL, it may be ok. but for deeper URL, we can't just assume the
last segment of the request path is a directory
Apache mod_dir should append a trailing slash but unfortunately, not all web
server on the internet have this feature enabled :-)
http://httpd.apache.org/docs/2.2/mod/mod_dir.html
> Fail to resovle outlink correctly
> ---------------------------------
>
> Key: DROIDS-45
> URL: https://issues.apache.org/jira/browse/DROIDS-45
> Project: Droids
> Issue Type: Bug
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
>
> I've encountered several cases that outlinks are not extracted correctly.
> Most are cause by the use of URI.resolve().
> 1. For a base URI of new URI("http://www.domain.com"), <a
> href="test.html">test.html</a> will be resolved to
> http://www.domain.comtest.html
> 2. For a base URI of new URI("http://www.domain.com/index.php"), <a
> href="?test=true">test with param</a> will be resolved to
> http://www.domain.com/?test=true
> 3. for <a href="http://www.yahoo.com\n">line break!</a>, URL.resolve will
> throw exception. And in a browser, it can resolves the URI. (remarks: I
> didn't check if this scenario affect the default Tika/NekoHTML parsing. )
> I suspect there are many different scenarios, many of them are probably
> caused by non-standard usage. (but a crawler has to handle non-standard usage
> in order to function) Obviously, we cannot cater every case, and I suggest to
> consider a resolve failure as a bug if a link works in a Mozilla browser but
> not in Droids LinkExtractor.
> this issue is related to the LinkExtractor created in DROIDS-8
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.