On Sun, 2009-04-05 at 00:44 +0100, Robin Howlett wrote: > I was just looking through NoRobotClient and have concern whether Droids > will actually respect robots.txt when force allow is false in most > scenarios; consider the following robots.txt:
It is easier to have a test class to debug this. > > User-agent: * > Disallow: /foo/ > > and the starting URI: http://www.example.com/foo/bar.html > > In the code I see - in NoRobotClient.isUrlAllowed() - the following: > > String path = uri.getPath(); > String basepath = baseURI.getPath(); The base path in our example is http://www.example.com. > if (path.startsWith(basepath)) { > path = path.substring(basepath.length()); > if (!path.startsWith("/")) { > path = "/" + path; > } > } path is /foo/bar.html > ... > > Boolean allowed = this.rules != null ? this.rules.isAllowed( path ) : null; > if(allowed == null) { > allowed = this.wildcardRules != null ? this.wildcardRules.isAllowed( path ) > : null; > } > if(allowed == null) { > allowed = Boolean.TRUE; > } > > The path will always be converted to /bar.html and is checked against the > Rules in rules and wildcardRules but won't be found. However, basepath (which > will now be /foo) is never checked against the Rules, therefore giving an > incorrect true result for the isUrlAllowed method, no? Hmm, see above, I disagree but have not debug yet. will do that now. salu2 > robin -- Thorsten Scherler <thorsten.at.apache.org> Open Source <consulting, training and solutions>
