Hi All,
Ok, lets push something above average into this aged topic, What about Robot? Recently I browsed Nutch (and Yahoo invested a lot in India ); and Bixo (and I invested something ) Both still have problems, and (if you are really trying to be Polite Robot): 1. Dont follow http://www.robotstxt.org this is privately owned website, information there is outdated at least 12 years, dont forget to click Google Ads. 2. There are some new standards, and all such standards were pushed by robots! 3. . A LOT!!! I was thinking about coding style projects such as a NUtch, SOLR, BIXO, Cascading use static classes, so that codebase seems very small, single class can do 10 times more that yours 100 classes But I think its better to improve existing codebase instead of complete rewrite (and, very bad: copy-paste!) Here are some improvements which I am going to work on, tell me if you are interested: - Speel Check: user-agent, useragent, usreagent, . - UTF8: according to specs, URL must be decoded before applying rule from robots.txt; additionally, %2F need not be decoded! For instance, both Nutch and BIXO rely on Droids, but nothing happens I think framework should be clear enough so that we can add new rules (such as recrawl rate or sitemap of even custom domain-specific rules (such as Nutch RegEx Filter)) I want to push some code but I think its much better to follow Nutch coding style (local/static/private) instead of this extremely naïve interface & implementation Thanks, Fuad Efendi +1 416-993-2060 <http://www.linkedin.com/in/liferay> http://www.linkedin.com/in/liferay Tokenizer Inc. <http://www.tokenizer.ca/> http://www.tokenizer.ca/ Data Mining, Vertical Search (sorry for Search Engine Optimization trick but it is so popular here!)
