[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-797: ---------------------------------- Attachment: test_nutch_797.html Tested using parsechecker (cf. NUTCH-1743) with attached sample document: * fixed for trunk and parse-tika * still open for parse-html in 2.x Same applies to NUTCH-566 and NUTCH-952. > parse-tika is not properly constructing URLs when the target begins with a "?" > ------------------------------------------------------------------------------ > > Key: NUTCH-797 > URL: https://issues.apache.org/jira/browse/NUTCH-797 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.1, nutchgora > Environment: Win 7, Java(TM) SE Runtime Environment (build > 1.6.0_16-b01) > Also repro's on RHEL and java 1.4.2 > Reporter: Robert Hohman > Assignee: Andrzej Bialecki > Priority: Minor > Fix For: 1.9 > > Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, > pureQueryUrl.patch, test_nutch_797.html > > > This is my first bug and patch on nutch, so apologies if I have not provided > enough detail. > In crawling the page at > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are > links in the page that look like this: > <a href="?co=0&sk=0&p=2&pi=1">2</a></td><td><a > href="?co=0&sk=0&p=3&pi=1">3</a> > in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as > getOutlinks looks for links, it comes across this link, and constucts a new > url with a base URL class built from > "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0", and a > target of "?co=0&sk=0&p=2&pi=1" > The URL class, per RFC 3986 at > http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines > how to merge these two, and per the RFC, the URL class merges these to: > http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1 > because the RFC explicitly states that the rightmost url segment (the > Search.aspx in this case) should be ripped off before combining. > While this is compliant with the RFC, it means the URLs which are created for > the next round of fetching are incorrect. Modern browsers seem to handle > this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure > exception or handling of what is a poorly formed url on accenture's part. > I have fixed this by modifying DOMContentUtils to look for the case where a ? > begins the target, and then pulling the rightmost component out of the base > and inserting it into the target before the ?, so the target in this example > becomes: > Search.aspx?co=0&sk=0&p=2&pi=1 > The URL class then properly constructs the new url as: > http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1 > If it is agreed that this solution works, I believe the other html parsers in > nutch would need to be modified in a similar way. > Can I get feedback on this proposed solution? Specifically I'm worried about > unforeseen side effects. > Much thanks > Here is the patch info: > Index: > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > =================================================================== > --- > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > (revision 916362) > +++ > src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > (working copy) > @@ -299,6 +299,50 @@ > return false; > } > > + private URL fixURL(URL base, String target) throws MalformedURLException > + { > + // handle params that are embedded into the base url - move them to > target > + // so URL class constructs the new url class properly > + if (base.toString().indexOf(';') > 0) > + return fixEmbeddedParams(base, target); > + > + // handle the case that there is a target that is a pure query. > + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on > how to assemble > + // URLs but I've seen this in numerous places, for example at > + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 > + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by > default > + // URL constructs the base+target combo as > + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, > incorrectly > + // dropping the Search.aspx target > + // > + // Browsers handle these just fine, they must have an exception > similar to this > + if (target.startsWith("?")) > + { > + return fixPureQueryTargets(base, target); > + } > + > + return new URL(base, target); > + } > + > + private URL fixPureQueryTargets(URL base, String target) throws > MalformedURLException > + { > + if (!target.startsWith("?")) > + return new URL(base, target); > + > + String basePath = base.getPath(); > + String baseRightMost=""; > + int baseRightMostIdx = basePath.lastIndexOf("/"); > + if (baseRightMostIdx != -1) > + { > + baseRightMost = basePath.substring(baseRightMostIdx+1); > + } > + > + if (target.startsWith("?")) > + target = baseRightMost+target; > + > + return new URL(base, target); > + } > + > /** > * Handles cases where the url param information is encoded into the base > * url as opposed to the target. > @@ -400,8 +444,7 @@ > if (target != null && !noFollow && !post) > try { > > - URL url = (base.toString().indexOf(';') > 0) ? > - fixEmbeddedParams(base, target) : new URL(base, target); > + URL url = fixURL(base, target); > outlinks.add(new Outlink(url.toString(), > linkText.toString().trim())); > } catch (MalformedURLException e) { -- This message was sent by Atlassian JIRA (v6.2#6252)