I may be awfully wrong on that, but below is my plan for super-fast crawling. I have prepared it for a venture that does not need it anymore, but it looks like fun to do anyway. What would you all say: is there a need, and what's wrong with the plan?
Thank you, Mark Fast Crawl Plan ========= The goal of Nutch is exhaustive crawl. It works best for internal sites, or intranets. It has known problems with wide web search. It is optimized for correctness, and it is also an open-source engine for all dummies to use, so it has polite crawling that is hard to get messed up, but it is not optimized for performance. I also see another area that slows it down: it uses a database. This makes it easy to program, scale, and operate. It does not make it a fast runner. All fast applications don't use databases. Therefore, I would write my own crawler, optimized for performance. Here is what my approach would be: - I would look at Nutch code for code snippets, for example, I would look at Fetcher.java, so as not to re-invent a wheel; - Having made the individual in-thread performance reasonably fast, I would do the following optimization steps; - Use a fast mechanism of real-time thread coordination, not database, but JavaSpaces (free GigaSpaces implementation); - Prepare URLs to do simultaneous fetching from different domains in different threads, and for more-or-less polite crawling within a domain; - Build-in blocking detection. Today we don't even know when and if we are blocked - and blocking can give time-outs; - Do it on one crawler for starters, but keep in mind that the code should later be scaled to a Hadoop cluster. Mark On Tue, Nov 24, 2009 at 11:32 AM, MilleBii <mille...@gmail.com> wrote: > Why would DNS local caching work... It only is working if you are > going to crawl often the same site ... In which case you are hit by > the politeness. > > if you have segments with only/mainly different sites it is not/really > going to help. > > So far I have not seen my quad core + 100mb/s + pseudo distributed > hadoop going faster than 10 fetch / s... Let me check the DNS and I > will tell you. > > I vote for 100 Fetch/s not sure how to get it though > > > > 2009/11/24, Dennis Kubes <ku...@apache.org>: > > Hi Mark, > > > > I just put this up on the wiki. Hope it helps: > > > > http://wiki.apache.org/nutch/OptimizingCrawls > > > > Dennis > > > > > > Mark Kerzner wrote: > >> Hi, guys, > >> > >> my goal is to do by crawls at 100 fetches per second, observing, of > >> course, > >> polite crawling. But, when URLs are all different domains, what > >> theoretically would stop some software from downloading from 100 domains > >> at > >> once, achieving the desired speed? > >> > >> But, whatever I do, I can't make Nutch crawl at that speed. Even if it > >> starts at a few dozen URLs/second, it slows down at the end (as > discussed > >> by > >> many and by Krugler). > >> > >> Should I write something of my own, or are their fast crawlers? > >> > >> Thanks! > >> > >> Mark > >> > > > > -- > Envoyé avec mon mobile > > -MilleBii- >