Hey Guys, I've focused on a very specific web-crawl task when refactoring my copy of droids:
- Tasks have now a context which can be used to share arbitrary data - so do http entities (needed to store headers for a longer period of time) - Various crawl related metrics / counters exposed via JMX / MBeans (+munin config) - robotx.txt / domain caching - currently there is a huge problem in the trunk with that! Solutions provided but no patches... yet :/ - A buch of new link filters to clean up a url and prevent processing the same page over and over again - Visited-url tree structure to keep the memory consumption down. I do also plan to use the same tree as a Task queue. Should lower the Memory consumption significantly. - Pluggable DNS resolvers - Plenty of small bugfixes, some of them really minor. Others I tried to report and provide solutions to them. - Handling of redirect codes (via meta and/or header) and exposing that information to the extractors / writers etc. - Improved encoding detection / handling - Added TaskExecutionDeciders - simply filtering the links was not sufficient enough in some rare cases. - Simplification: removed spring dependencies, threw away classes / functionality not needed by me, nice and easy methods for spawning new crawlers / droids. - Thread pool for independent parallel execution of droids, limited by a load a single node in a cluster can take ( see http://codewut.de/content/load-sensitive-threadpool-queue-java ) - Managing new droids by polling new hosts to be crawled from an external source - Extended delay framework so delays can be computed based upon the processing / response time of the last page/task - Proxy support - Plenty of tweaks to the http-client params to prevent/skip hung sockets, slow responses (like 1200 baud) - Mechanism to do a clean exit (shutdown / SIG hooks), finish all items in the queue, close all writers properly. Alternatively a quick exit can be triggered to flush remaining items from the queue. - Stuff i have already forgotten :/ Maybe i should also mention where i am going with beat up version of droids: i am building a pseudo distributed crawler farm. pseudo distributed because there is not controlling server and shared task queue. Each node in my cluster runs multiple droids, each one crawling one host. Extracted data is collected from all of the instances per node (not droid) and fed into HDFS. Each node has a thread pool which polls new crawl specs from a master queue (in my case jdbc - although i am thinking about HBase or Membase) So yes, i took a huge step away from the idea of implementing generic droids framework and focused rather on a very specific way to crawl the web. Right now I tried my best to make droids fault tolerant (the internet is b-r-o-k-e-n, you have no idea how bad it is!), and produce helpful logs. What's next for me? Probably rewriting the robots.txt client. The crawlers itself did pass a first smaller crawl with ~20m pages with pleasing results. Fire away if you have questions. Regards, Paul. On Tue, 8 Feb 2011 09:50:06 +0100, Chapuis Bertil <[email protected]> wrote: > In previous emails and jira comments I saw several people mentionning the > fact they have a local copy of droids which evolved too much to be merged > back with the trunk. This is my case, and I think Paul Rogalinski is in the > same situation. > > Since the patches have only been applied periodically on the trunk during > the last months, I'd love to know if someone else is in the same situation > and what can of changes they made locally. -- Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich - Germany - mailto: [email protected] - Phone: +49-179-3574356 - msn: [email protected] - aim: pu1s4r - icq: 1177279 - skype: pulsar
