Hi, Wow, juicy! I have just 1 question: (when) can you contribute individual pieces of your great work? :)
Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: paul.vc <[email protected]> > To: [email protected] > Sent: Tue, February 8, 2011 6:31:05 AM > Subject: Re: Local copies of droids > > Hey Guys, > > I've focused on a very specific web-crawl task when refactoring my copy of > droids: > > - Tasks have now a context which can be used to share arbitrary data > - so do http entities (needed to store headers for a longer period of > time) > - Various crawl related metrics / counters exposed via JMX / MBeans > (+munin config) > - robotx.txt / domain caching - currently there is a huge problem in the > trunk with that! Solutions provided but no patches... yet :/ > - A buch of new link filters to clean up a url and prevent processing the > same page over and over again > - Visited-url tree structure to keep the memory consumption down. I do > also plan to use the same tree as a Task queue. Should lower the Memory > consumption significantly. > - Pluggable DNS resolvers > - Plenty of small bugfixes, some of them really minor. Others I tried to > report and provide solutions to them. > - Handling of redirect codes (via meta and/or header) and exposing that > information to the extractors / writers etc. > - Improved encoding detection / handling > - Added TaskExecutionDeciders - simply filtering the links was not > sufficient enough in some rare cases. > - Simplification: removed spring dependencies, threw away classes / > functionality not needed by me, nice and easy methods for spawning new > crawlers / droids. > - Thread pool for independent parallel execution of droids, limited by a > load a single node in a cluster can take ( see > http://codewut.de/content/load-sensitive-threadpool-queue-java ) > - Managing new droids by polling new hosts to be crawled from an external > source > - Extended delay framework so delays can be computed based upon the > processing / response time of the last page/task > - Proxy support > - Plenty of tweaks to the http-client params to prevent/skip hung sockets, > slow responses (like 1200 baud) > - Mechanism to do a clean exit (shutdown / SIG hooks), finish all items in > the queue, close all writers properly. Alternatively a quick exit can be > triggered to flush remaining items from the queue. > - Stuff i have already forgotten :/ > > Maybe i should also mention where i am going with beat up version of > droids: i am building a pseudo distributed crawler farm. pseudo distributed > because there is not controlling server and shared task queue. Each node in > my cluster runs multiple droids, each one crawling one host. Extracted data > is collected from all of the instances per node (not droid) and fed into > HDFS. Each node has a thread pool which polls new crawl specs from a master > queue (in my case jdbc - although i am thinking about HBase or Membase) > > So yes, i took a huge step away from the idea of implementing generic > droids framework and focused rather on a very specific way to crawl the > web. Right now I tried my best to make droids fault tolerant (the internet > is b-r-o-k-e-n, you have no idea how bad it is!), and produce helpful logs. > > What's next for me? Probably rewriting the robots.txt client. The crawlers > itself did pass a first smaller crawl with ~20m pages with pleasing > results. > > Fire away if you have questions. > > Regards, > Paul. > > > On Tue, 8 Feb 2011 09:50:06 +0100, Chapuis Bertil <[email protected]> > wrote: > > In previous emails and jira comments I saw several people mentionning > the > > fact they have a local copy of droids which evolved too much to be > merged > > back with the trunk. This is my case, and I think Paul Rogalinski is in > the > > same situation. > > > > Since the patches have only been applied periodically on the trunk > during > > the last months, I'd love to know if someone else is in the same > situation > > and what can of changes they made locally. > > -- > Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich - > Germany - mailto: [email protected] - Phone: +49-179-3574356 - > msn: [email protected] - aim: pu1s4r - icq: 1177279 - skype: pulsar >
