[ https://issues.apache.org/jira/browse/NUTCH-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Otis Gospodnetic resolved NUTCH-101. ------------------------------------ Resolution: Fixed Thank you Ken. > RobotRulesParser > ---------------- > > Key: NUTCH-101 > URL: https://issues.apache.org/jira/browse/NUTCH-101 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.6, 0.7, 0.7.1, 0.8 > Reporter: Fuad Efendi > > I noticed this code in protocol-http & protocol-httpclient plugins: > } else if ( (line.length() >= 6) > && (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) { > However, according to the original 1994 protocol description, there is NO > "Allow:" field. To allow, simply use "Disallow: ". > http://www.robotstxt.org/wc/norobots.html > Please, try to test with www.newegg.com/robots.txt > - their site has this: > User-agent: * > Disallow: > And Nutch does not work with New Egg, but it should! > Sorry guys, I don't have enough time to double-ensure, could you please > verify all this... > I noticed strange discussion at nutch-agent:lucene.apache.org, it seems that > we need to test ......./robots.txt > User-agent: ia_archiver > Disallow: / > User-agent: Googlebot-Image > Disallow: / > User-agent: Nutch > Disallow: / > User-agent: TurnitinBot > Disallow: / > - everything according to standard protocol. Can you retest please whether it > works with multiline? It's a standard! > I see this in code: > StringTokenizer tok = new StringTokenizer(agentNames, ","); > > Comma separated? It's not accepted standard yet... > Sorry WebExpertsAmerica, I really didn't have any time to make any test... > Please do not execute tests against production sites. > Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.