[jira] [Updated] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2014-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-566:
--

Fix Version/s: (was: 1.9)

 Sun's URL class has bug in creation of relative query URLs
 --

 Key: NUTCH-566
 URL: https://issues.apache.org/jira/browse/NUTCH-566
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: MacOS X and Linux (CentOS 4.5) both
Reporter: Doug Cook
Priority: Minor
 Attachments: RelativeURL.java


 I'm using 0.81, but this will affect all other versions as well.
 Relative links of the form ?blah are resolved incorrectly. For example, 
 with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link 
 of ?id_entrep=111, Nutch will resolve this pair to the link
 http://www.fleurie.org/?id_entrep=111;. No such URL exists, and all browsers 
 I tried will resolve the pair to 
 http://www.fleurie.org/entreprise.asp?id_entrep=111;.
 I tracked this down to what could be called a bug in Sun's URL class. 
 According to Sun's spec, they parse the relative URL according to RFC 2396. 
 But the original RFC for relative links was RFC 1808, and the two RFCs differ 
 in how they handle relative links beginning with ?. Most browsers 
 (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for 
 compatibility and also because the behavior makes more sense). Apparently 
 even the people that wrote RFC 2396 recognized that this was a mistake, and 
 the specified behavior was changed in RFC 3986 to match what browsers do. 
 For a discussion of this, see  
 http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
 Sun's URL implementation, however, still implements RFC2396, as far as I can 
 tell, and is out of step with the rest of the world.
 This breaks link extraction on a number of sites.
 I implemented a simple workaround, which I'm attaching. It is a static method 
 to create URLs which behaves exactly as new URL(URL base, String 
 relativePath), and I use it as a drop-in replacement for that in 
 DOMContentUtils, Javascript link extraction, etc. Obviously, it really only 
 matters wherever links are extracted. I haven't included the calling code 
 from DOMContentUtils, etc. because my local versions are largely rewritten, 
 but it should be pretty obvious.
 I put it in the org.apache.nutch.net directory, but obviously feel free to 
 move it to another place if you feel it belongs there!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2013-05-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-566:
--

Fix Version/s: 1.8

 Sun's URL class has bug in creation of relative query URLs
 --

 Key: NUTCH-566
 URL: https://issues.apache.org/jira/browse/NUTCH-566
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: MacOS X and Linux (CentOS 4.5) both
Reporter: Doug Cook
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: RelativeURL.java


 I'm using 0.81, but this will affect all other versions as well.
 Relative links of the form ?blah are resolved incorrectly. For example, 
 with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link 
 of ?id_entrep=111, Nutch will resolve this pair to the link
 http://www.fleurie.org/?id_entrep=111;. No such URL exists, and all browsers 
 I tried will resolve the pair to 
 http://www.fleurie.org/entreprise.asp?id_entrep=111;.
 I tracked this down to what could be called a bug in Sun's URL class. 
 According to Sun's spec, they parse the relative URL according to RFC 2396. 
 But the original RFC for relative links was RFC 1808, and the two RFCs differ 
 in how they handle relative links beginning with ?. Most browsers 
 (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for 
 compatibility and also because the behavior makes more sense). Apparently 
 even the people that wrote RFC 2396 recognized that this was a mistake, and 
 the specified behavior was changed in RFC 3986 to match what browsers do. 
 For a discussion of this, see  
 http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
 Sun's URL implementation, however, still implements RFC2396, as far as I can 
 tell, and is out of step with the rest of the world.
 This breaks link extraction on a number of sites.
 I implemented a simple workaround, which I'm attaching. It is a static method 
 to create URLs which behaves exactly as new URL(URL base, String 
 relativePath), and I use it as a drop-in replacement for that in 
 DOMContentUtils, Javascript link extraction, etc. Obviously, it really only 
 matters wherever links are extracted. I haven't included the calling code 
 from DOMContentUtils, etc. because my local versions are largely rewritten, 
 but it should be pretty obvious.
 I put it in the org.apache.nutch.net directory, but obviously feel free to 
 move it to another place if you feel it belongs there!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-566:
---

Fix Version/s: 2.2
   1.7

 Sun's URL class has bug in creation of relative query URLs
 --

 Key: NUTCH-566
 URL: https://issues.apache.org/jira/browse/NUTCH-566
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: MacOS X and Linux (CentOS 4.5) both
Reporter: Doug Cook
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: RelativeURL.java


 I'm using 0.81, but this will affect all other versions as well.
 Relative links of the form ?blah are resolved incorrectly. For example, 
 with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link 
 of ?id_entrep=111, Nutch will resolve this pair to the link
 http://www.fleurie.org/?id_entrep=111;. No such URL exists, and all browsers 
 I tried will resolve the pair to 
 http://www.fleurie.org/entreprise.asp?id_entrep=111;.
 I tracked this down to what could be called a bug in Sun's URL class. 
 According to Sun's spec, they parse the relative URL according to RFC 2396. 
 But the original RFC for relative links was RFC 1808, and the two RFCs differ 
 in how they handle relative links beginning with ?. Most browsers 
 (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for 
 compatibility and also because the behavior makes more sense). Apparently 
 even the people that wrote RFC 2396 recognized that this was a mistake, and 
 the specified behavior was changed in RFC 3986 to match what browsers do. 
 For a discussion of this, see  
 http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
 Sun's URL implementation, however, still implements RFC2396, as far as I can 
 tell, and is out of step with the rest of the world.
 This breaks link extraction on a number of sites.
 I implemented a simple workaround, which I'm attaching. It is a static method 
 to create URLs which behaves exactly as new URL(URL base, String 
 relativePath), and I use it as a drop-in replacement for that in 
 DOMContentUtils, Javascript link extraction, etc. Obviously, it really only 
 matters wherever links are extracted. I haven't included the calling code 
 from DOMContentUtils, etc. because my local versions are largely rewritten, 
 but it should be pretty obvious.
 I put it in the org.apache.nutch.net directory, but obviously feel free to 
 move it to another place if you feel it belongs there!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira