[ https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-566: --------------------------------------- Patch Info: Patch Available > Sun's URL class has bug in creation of relative query URLs > ---------------------------------------------------------- > > Key: NUTCH-566 > URL: https://issues.apache.org/jira/browse/NUTCH-566 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8, 0.8.1, 0.9.0 > Environment: MacOS X and Linux (CentOS 4.5) both > Reporter: Doug Cook > Priority: Minor > Fix For: 1.7, 2.2 > > Attachments: RelativeURL.java > > > I'm using 0.81, but this will affect all other versions as well. > Relative links of the form "?blah" are resolved incorrectly. For example, > with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link > of "?id_entrep=111", Nutch will resolve this pair to the link > "http://www.fleurie.org/?id_entrep=111". No such URL exists, and all browsers > I tried will resolve the pair to > "http://www.fleurie.org/entreprise.asp?id_entrep=111". > I tracked this down to what could be called a bug in Sun's URL class. > According to Sun's spec, they parse the relative URL according to RFC 2396. > But the original RFC for relative links was RFC 1808, and the two RFCs differ > in how they handle relative links beginning with "?". Most browsers > (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for > compatibility and also because the behavior makes more sense). Apparently > even the people that wrote RFC 2396 recognized that this was a mistake, and > the specified behavior was changed in RFC 3986 to match what browsers do. > For a discussion of this, see > http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query > Sun's URL implementation, however, still implements RFC2396, as far as I can > tell, and is out of step with the rest of the world. > This breaks link extraction on a number of sites. > I implemented a simple workaround, which I'm attaching. It is a static method > to create URLs which behaves exactly as new URL(URL base, String > relativePath), and I use it as a drop-in replacement for that in > DOMContentUtils, Javascript link extraction, etc. Obviously, it really only > matters wherever links are extracted. I haven't included the calling code > from DOMContentUtils, etc. because my local versions are largely rewritten, > but it should be pretty obvious. > I put it in the org.apache.nutch.net directory, but obviously feel free to > move it to another place if you feel it belongs there! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira