Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp

------------------------------------------------------------------------------
  
  Below are some potential topics for discussion - feel free to add/comment.
  
- * Potential synergies between crawler projects - e.g. sharing robots.txt 
processing code.
+  * Potential synergies between crawler projects - e.g. sharing robots.txt 
processing code.
- * How to avoid end-user abuse - webmasters sometimes block crawlers because 
users configure it to be impolite.
+  * How to avoid end-user abuse - webmasters sometimes block crawlers because 
users configure it to be impolite.
- * Politeness vs. efficiency - various options for how to be considered 
polite, while still crawling quickly.
+  * Politeness vs. efficiency - various options for how to be considered 
polite, while still crawling quickly.
- * robots.txt processing - current problems with existing implementations
+  * robots.txt processing - current problems with existing implementations
- * Avoiding crawler traps - link farms, honeypots, etc.
+  * Avoiding crawler traps - link farms, honeypots, etc.
- * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping
+  * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping
- * Search infrastructure - options for serving up crawl results (Nutch, Solr, 
Katta, others?)
+  * Search infrastructure - options for serving up crawl results (Nutch, Solr, 
Katta, others?)
- * Testing challenges - is it possible to unit test a crawler?
+  * Testing challenges - is it possible to unit test a crawler?
- * Fuzzy classification - mime-type, charset, language.
+  * Fuzzy classification - mime-type, charset, language.
- * The future of Nutch, Droids, Heritrix, Bixo, etc.
+  * The future of Nutch, Droids, Heritrix, Bixo, etc.
- * Optimizing for types of crawling - intranet, focused, whole web.
+  * Optimizing for types of crawling - intranet, focused, whole web.
  

Reply via email to