[ 
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-566:
---------------------------------------

    Patch Info: Patch Available
    
> Sun's URL class has bug in creation of relative query URLs
> ----------------------------------------------------------
>
>                 Key: NUTCH-566
>                 URL: https://issues.apache.org/jira/browse/NUTCH-566
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: MacOS X and Linux (CentOS 4.5) both
>            Reporter: Doug Cook
>            Priority: Minor
>             Fix For: 1.7, 2.2
>
>         Attachments: RelativeURL.java
>
>
> I'm using 0.81, but this will affect all other versions as well.
> Relative links of the form "?blah" are resolved incorrectly. For example, 
> with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link 
> of "?id_entrep=111", Nutch will resolve this pair to the link
> "http://www.fleurie.org/?id_entrep=111";. No such URL exists, and all browsers 
> I tried will resolve the pair to 
> "http://www.fleurie.org/entreprise.asp?id_entrep=111";.
> I tracked this down to what could be called a bug in Sun's URL class. 
> According to Sun's spec, they parse the relative URL according to RFC 2396. 
> But the original RFC for relative links was RFC 1808, and the two RFCs differ 
> in how they handle relative links beginning with "?". Most browsers 
> (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for 
> compatibility and also because the behavior makes more sense). Apparently 
> even the people that wrote RFC 2396 recognized that this was a mistake, and 
> the specified behavior was changed in RFC 3986 to match what browsers do. 
> For a discussion of this, see  
> http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
> Sun's URL implementation, however, still implements RFC2396, as far as I can 
> tell, and is out of step with the rest of the world.
> This breaks link extraction on a number of sites.
> I implemented a simple workaround, which I'm attaching. It is a static method 
> to create URLs which behaves exactly as new URL(URL base, String 
> relativePath), and I use it as a drop-in replacement for that in 
> DOMContentUtils, Javascript link extraction, etc. Obviously, it really only 
> matters wherever links are extracted. I haven't included the calling code 
> from DOMContentUtils, etc. because my local versions are largely rewritten, 
> but it should be pretty obvious.
> I put it in the org.apache.nutch.net directory, but obviously feel free to 
> move it to another place if you feel it belongs there!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to