[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-797: -- Attachment: test_nutch_797.html Tested using parsechecker (cf. NUTCH-1743) with attached sample document: * fixed for trunk and parse-tika * still open for parse-html in 2.x Same applies to NUTCH-566 and NUTCH-952. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1, nutchgora Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.9 Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch, test_nutch_797.html This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String
[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-797: -- Attachment: NUTCH-797-2x.patch Patch for 2.x: - port URLUtil.resolveURL() from 1.x (including unit test) - removed fixEmbeddedParams(): it's still in 1.x but unused (NUTCH-797 removed/deactivated NUTCH-1115) parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1, nutchgora Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.9 Attachments: NUTCH-797-2x.patch, NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch, test_nutch_797.html This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); +
[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-797: -- Fix Version/s: 1.8 parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1, nutchgora Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1) + { + baseRightMost = basePath.substring(baseRightMostIdx+1); + } + + if
[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-797: --- Fix Version/s: (was: 2.1) 2.2 parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1, nutchgora Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: 2.2 Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1) + { + baseRightMost =
[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-797: --- Affects Version/s: nutchgora Fix Version/s: (was: nutchgora) 2.1 Set and Classify parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1, nutchgora Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: 2.1 Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1)
[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-797: Attachment: NUTCH-797.patch Tentative patch, which changes the meaning of fixEmbeddedParams to removeEmbeddedParams. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: nutchgora Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1) + {
[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-797: Fix Version/s: 2.0 1.4 Back on radar: has this ever been committed at all? parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.4, 2.0 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1) + { + baseRightMost =