Hey Otis, I am staring a "big" crawl (~50m hosts) this or the next week. I am sure it will bring back some new bugs and issues to solve. Furthermore there is still the robots.txt part to be taken care of. I have been contracted to implement that crawler by another company, and I do also have the permission to contribute most of the work back (probably all but the content-extraction part). So I see no real problems in releasing the sources after a minor code review.
If you just need some specific pieces now, lets meet on IRC freenode/#droids (not monitoring the channel actively though - ping me on ircnet/pulsar) or some IM (see below for my accounts). I'll be able to pull stuff out and post it online as needed. Regards, Paul. msn: [email protected] · aim: pu1s4r · icq: 1177279 · skype: pulsar · yahoo: paulrogalinski · gtalk/XMPP: [email protected] On Wed, 9 Feb 2011 11:32:18 -0800 (PST), Otis Gospodnetic <[email protected]> wrote: > Hi, > > Wow, juicy! > I have just 1 question: (when) can you contribute individual pieces of > your > great work? :) > > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message ---- >> From: paul.vc <[email protected]> >> To: [email protected] >> Sent: Tue, February 8, 2011 6:31:05 AM >> Subject: Re: Local copies of droids >> >> Hey Guys, >> >> I've focused on a very specific web-crawl task when refactoring my copy >> of >> droids: >> >> - Tasks have now a context which can be used to share arbitrary data >> - so do http entities (needed to store headers for a longer period of >> time) >> - Various crawl related metrics / counters exposed via JMX / MBeans >> (+munin config) >> - robotx.txt / domain caching - currently there is a huge problem in the >> trunk with that! Solutions provided but no patches... yet :/ >> - A buch of new link filters to clean up a url and prevent processing >> the >> same page over and over again >> - Visited-url tree structure to keep the memory consumption down. I do >> also plan to use the same tree as a Task queue. Should lower the Memory >> consumption significantly. >> - Pluggable DNS resolvers >> - Plenty of small bugfixes, some of them really minor. Others I tried to >> report and provide solutions to them. >> - Handling of redirect codes (via meta and/or header) and exposing that >> information to the extractors / writers etc. >> - Improved encoding detection / handling >> - Added TaskExecutionDeciders - simply filtering the links was not >> sufficient enough in some rare cases. >> - Simplification: removed spring dependencies, threw away classes / >> functionality not needed by me, nice and easy methods for spawning new >> crawlers / droids. >> - Thread pool for independent parallel execution of droids, limited by a >> load a single node in a cluster can take ( see >> http://codewut.de/content/load-sensitive-threadpool-queue-java ) >> - Managing new droids by polling new hosts to be crawled from an >> external >> source >> - Extended delay framework so delays can be computed based upon the >> processing / response time of the last page/task >> - Proxy support >> - Plenty of tweaks to the http-client params to prevent/skip hung >> sockets, >> slow responses (like 1200 baud) >> - Mechanism to do a clean exit (shutdown / SIG hooks), finish all items >> in >> the queue, close all writers properly. Alternatively a quick exit can be >> triggered to flush remaining items from the queue. >> - Stuff i have already forgotten :/ >> >> Maybe i should also mention where i am going with beat up version of >> droids: i am building a pseudo distributed crawler farm. pseudo >> distributed >> because there is not controlling server and shared task queue. Each >> node in >> my cluster runs multiple droids, each one crawling one host. Extracted >> data >> is collected from all of the instances per node (not droid) and fed into >> HDFS. Each node has a thread pool which polls new crawl specs from a >> master >> queue (in my case jdbc - although i am thinking about HBase or Membase) >> >> So yes, i took a huge step away from the idea of implementing generic >> droids framework and focused rather on a very specific way to crawl the >> web. Right now I tried my best to make droids fault tolerant (the >> internet >> is b-r-o-k-e-n, you have no idea how bad it is!), and produce helpful >> logs. >> >> What's next for me? Probably rewriting the robots.txt client. The >> crawlers >> itself did pass a first smaller crawl with ~20m pages with pleasing >> results. >> >> Fire away if you have questions. >> >> Regards, >> Paul. >> >> >> On Tue, 8 Feb 2011 09:50:06 +0100, Chapuis Bertil <[email protected]> >> wrote: >> > In previous emails and jira comments I saw several people mentionning >> the >> > fact they have a local copy of droids which evolved too much to be >> merged >> > back with the trunk. This is my case, and I think Paul Rogalinski is >> > in >> the >> > same situation. >> > >> > Since the patches have only been applied periodically on the trunk >> during >> > the last months, I'd love to know if someone else is in the same >> situation >> > and what can of changes they made locally. >> >> -- >> Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich - >> Germany - mailto: [email protected] - Phone: +49-179-3574356 - >> msn: [email protected] - aim: pu1s4r - icq: 1177279 - skype: pulsar >> -- Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich - Germany - mailto: [email protected] - Phone: +49-179-3574356 - msn: [email protected] - aim: pu1s4r - icq: 1177279 - skype: pulsar
