[ http://issues.apache.org/jira/browse/NUTCH-101?page=comments#action_12331658 ]
Fuad Efendi commented on NUTCH-101: ----------------------------------- 1. There is a bug in method parseRules(byte[] robotContent): ... StringTokenizer lineParser= new StringTokenizer(content, "\n\r"); ... Should be: ... content = content.replaceAll("\\r+", "\n"); content = content.replaceAll("\\n+", "\n"); StringTokenizer lineParser = new StringTokenizer(content, "\n"); ... (or something better) Even more characters should be allowed: - newline (line feed) character ('\n'), - carriage-return character followed immediately by a newline character ("\r\n"), - standalone carriage-return character ('\r'), - next-line character ('\u0085'), - line-separator character ('\u2028') - paragraph-separator character ('\u2029) 2. The code contains check "Allow:" - however it works fine with standard empty "Disallow:" == allow everything 3. There is minor bug in main(): ... String[] robotNames= new String[argv.length - 1]; ... Must be: ... String[] robotNames= new String[argv.length - 2]; ... > RobotRulesParser > ---------------- > > Key: NUTCH-101 > URL: http://issues.apache.org/jira/browse/NUTCH-101 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.7, 0.8-dev > Reporter: Fuad Efendi > > I noticed this code in protocol-http & protocol-httpclient plugins: > } else if ( (line.length() >= 6) > && (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) { > However, according to the original 1994 protocol description, there is NO > "Allow:" field. To allow, simply use "Disallow: ". > http://www.robotstxt.org/wc/norobots.html > Please, try to test with www.newegg.com/robots.txt > - their site has this: > User-agent: * > Disallow: > And Nutch does not work with New Egg, but it should! > Sorry guys, I don't have enough time to double-ensure, could you please > verify all this... > I noticed strange discussion at nutch-agent:lucene.apache.org, it seems that > we need to test ......./robots.txt > User-agent: ia_archiver > Disallow: / > User-agent: Googlebot-Image > Disallow: / > User-agent: Nutch > Disallow: / > User-agent: TurnitinBot > Disallow: / > - everything according to standard protocol. Can you retest please whether it > works with multiline? It's a standard! > I see this in code: > StringTokenizer tok = new StringTokenizer(agentNames, ","); > > Comma separated? It's not accepted standard yet... > Sorry WebExpertsAmerica, I really didn't have any time to make any test... > Please do not execute tests against production sites. > Thanks! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by: Power Architecture Resource Center: Free content, downloads, discussions, and more. http://solutions.newsforge.com/ibmarch.tmpl _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers