On Fri, 25 Mar 2011 13:08:40 -0700 (PDT), Otis Gospodnetic <[email protected]> wrote: > Hi, > > Somebody (Paul?) mentioned using Droids for doing a 50M page crawl. > Anyone else > using Droids for crawls of that size?
Yes, but A "little" bit more :) I had a seed of 60M hosts, crawled about 2 billion pages on a 16 node cluster. The crawl took about 3 weeks with an average bandwidth of 35mbit per node. > I'm asking because I have a need to do a "semi-vertical" crawl on up to > 10K > domains and I'm considering Droids vs. Nutch. This may translate to > several > times that many different servers - say 100K. And that may translate to a > few > 100M web pages. Too big for Droids without having a persistent link > queue, > right? I had no more than 128 Droid-Threads running per machine with a total in-memory-queue limit (for all threads per node) of ~ 500.000 pages. Had a couple of tweaks here and there plus an efficient tree structure for storing visited urls. With 10gb you could easily go up to 2 million entries when running only one droid thread per VM. Perhaps even more. Here is something pretty to look at: http://twitpic.com/4d87i7 ;) My crawler design was rather simple: ((one host - one droid, stay on the host) * 128 threads ) * 16 nodes and one master node for the global seed queue. If your crawl is to behave more organic and may be executed in a less controlled environment: take a look at 80Legs. If you choose to go with droids you will have to spend some time on the http client settings and implement some workarounds in droids / http client to prevent stale/stuck sockets, handle chunked stream aborts correctly (crawler traps serving you an endless stream of links for instance), take care of rather critical robots issue. The neko html parser is also full of surprises, be careful with that. At least when you are lazy like me and use the DOM model instead of SAX to preparse some meta-tags and do content processing. I don't see how my spare time would allow to clean up my battlefield and release all of it back to the community any time soon, but I am not sitting on that code either. Talk to me if you want to port aspects of that back to droids. I would be more than happy to rip out chunks of that code and pass it along for proper integration into the main branch. Regards, Paul.
