Hi, all. If the "href" attribute value of a A tag begins with the question mark(?) in a HTML document, web browsers treat it as a query string and make no problem.
But nutch generates a malformed url with it because of java.util.URL class, so it cannot crawl the right page. Let's see the source code of org.apache.nutch.parse.html.DOMContentUtils.getOutlinks. 403 URL url = (base.toString().indexOf(';') > 0) ? 404 fixEmbeddedParams(base, target) : new URL(base, target); 405 outlinks.add(new Outlink(url.toString(), 406 linkText.toString().trim())); And see http://java.sun.com/javase/6/docs/api/java/net/URL.html#URL(java.net.URL, java.lang.String) public URL(URL context, String spec) If the spec's path component begins with a slash character "/" then the path is treated as absolute and the spec path replaces the context path. Otherwise, the path is treated as a relative path and is appended to the context path, as described in RFC2396. Also, in this case, the path is canonicalized through the removal of directory changes made by occurences of ".." and ".". Because of this Constructor, nutch got the malformed url. For example, if the base url is "http://some.domain/dir/page?param=value" and the target url is "?param1=value1¶m2=value2", new URL(base,target) makes " http://some.domain/dir/?param1=value1¶m2=value2", not " http://some.domain/dir/page?param1=value1¶m2=value2". And then nutch would crawl a wrong url. I think DOMContentUtils.getOutlinks() method should be modified. Thanks in advance. - Donghyeok Kang