Hi,

Wow, juicy!
I have just 1 question: (when) can you contribute individual pieces of your 
great work? :)

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: paul.vc <[email protected]>
> To: [email protected]
> Sent: Tue, February 8, 2011 6:31:05 AM
> Subject: Re: Local copies of droids
> 
> Hey Guys,
> 
> I've focused on a very specific web-crawl task when refactoring  my copy of
> droids:
> 
> - Tasks have now a context which can be used to  share arbitrary data
> - so do http entities (needed to store headers  for  a longer period of
> time)
> - Various crawl related metrics / counters  exposed via JMX / MBeans
> (+munin config)
> - robotx.txt / domain caching -  currently there is a huge problem in the
> trunk with that! Solutions provided  but no patches... yet :/
> - A buch of new link filters to clean up a url and  prevent processing the
> same page over and over again
> - Visited-url tree  structure to keep the memory consumption down. I do
> also plan to use the same  tree as a Task queue. Should lower the Memory
> consumption significantly.
> -  Pluggable DNS resolvers
> - Plenty of small bugfixes, some of them really  minor. Others I tried to
> report and provide solutions to them.
> - Handling  of redirect codes (via meta and/or header) and exposing that
> information to  the extractors / writers etc.
> - Improved encoding detection / handling
> -  Added TaskExecutionDeciders - simply filtering the links was not
> sufficient  enough in some rare cases.
> - Simplification: removed spring dependencies,  threw away classes /
> functionality not needed by me, nice and easy methods  for spawning new
> crawlers / droids.
> - Thread pool for independent parallel  execution of droids, limited by a
> load a single node in a cluster can take (  see
> http://codewut.de/content/load-sensitive-threadpool-queue-java )
> -  Managing new droids by polling new hosts to be crawled from an  external
> source
> - Extended delay framework so delays can be computed based  upon the
> processing / response time of the last page/task
> - Proxy  support
> - Plenty of tweaks to the http-client params to prevent/skip hung  sockets,
> slow responses (like 1200 baud)
> - Mechanism to do a clean exit  (shutdown / SIG hooks), finish all items in
> the queue, close all writers  properly. Alternatively a quick exit can be
> triggered to flush remaining  items from the queue.
> - Stuff i have already forgotten :/
> 
> Maybe i  should also mention where i am going with beat up version of
> droids: i am  building a pseudo distributed crawler farm. pseudo distributed
> because there  is not controlling server and shared task queue. Each node in
> my cluster runs  multiple droids, each one crawling one host. Extracted data
> is collected from  all of the instances per node (not droid) and fed into
> HDFS. Each node has a  thread pool which polls new crawl specs from a master
> queue (in my case jdbc  - although i am thinking about HBase or Membase)
> 
> So yes, i took a huge  step away from the idea of implementing generic
> droids framework and focused  rather on a very specific way to crawl the
> web. Right now I tried my best to  make droids fault tolerant (the internet
> is b-r-o-k-e-n, you have no idea how  bad it is!), and produce helpful logs.
> 
> What's next for me? Probably  rewriting the robots.txt client. The crawlers
> itself did pass a first smaller  crawl with ~20m pages with pleasing
> results.
> 
> Fire away if you have  questions.
> 
> Regards,
> Paul.
> 
> 
> On Tue, 8 Feb 2011 09:50:06  +0100, Chapuis Bertil <[email protected]>
> wrote:
> >  In previous emails and jira comments I saw several people  mentionning
> the
> > fact they have a local copy of droids which evolved  too much to be
> merged
> > back with the trunk. This is my case, and I  think Paul Rogalinski is in
> the
> > same situation.
> > 
> >  Since the patches have only been applied periodically on the  trunk
> during
> > the last months, I'd love to know if someone else is in  the same
> situation
> > and what can of changes they made  locally.
> 
> -- 
> Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17  -81543 Munich -
> Germany - mailto: [email protected] -  Phone: +49-179-3574356 -
> msn: [email protected] - aim: pu1s4r - icq:  1177279 - skype: pulsar
> 

Reply via email to