Re: Local copies of droids

paul.vc Tue, 08 Feb 2011 03:25:18 -0800

Hey Guys,

I've focused on a very specific web-crawl task when refactoring my copy of
droids:


- Tasks have now a context which can be used to share arbitrary data
- so do http entities (needed to store headers  for a longer period of
time)
- Various crawl related metrics / counters exposed via JMX / MBeans
(+munin config)
- robotx.txt / domain caching - currently there is a huge problem in the
trunk with that! Solutions provided but no patches... yet :/
- A buch of new link filters to clean up a url and prevent processing the
same page over and over again
- Visited-url tree structure to keep the memory consumption down. I do
also plan to use the same tree as a Task queue. Should lower the Memory
consumption significantly.
- Pluggable DNS resolvers
- Plenty of small bugfixes, some of them really minor. Others I tried to
report and provide solutions to them.
- Handling of redirect codes (via meta and/or header) and exposing that
information to the extractors / writers etc.
- Improved encoding detection / handling
- Added TaskExecutionDeciders - simply filtering the links was not
sufficient enough in some rare cases.
- Simplification: removed spring dependencies, threw away classes /
functionality not needed by me, nice and easy methods for spawning new
crawlers / droids.
- Thread pool for independent parallel execution of droids, limited by a
load a single node in a cluster can take ( see
http://codewut.de/content/load-sensitive-threadpool-queue-java )
- Managing new droids by polling new hosts to be crawled from an external
source
- Extended delay framework so delays can be computed based upon the
processing / response time of the last page/task
- Proxy support
- Plenty of tweaks to the http-client params to prevent/skip hung sockets,
slow responses (like 1200 baud)
- Mechanism to do a clean exit (shutdown / SIG hooks), finish all items in
the queue, close all writers properly. Alternatively a quick exit can be
triggered to flush remaining items from the queue.
- Stuff i have already forgotten :/

Maybe i should also mention where i am going with beat up version of
droids: i am building a pseudo distributed crawler farm. pseudo distributed
because there is not controlling server and shared task queue. Each node in
my cluster runs multiple droids, each one crawling one host. Extracted data
is collected from all of the instances per node (not droid) and fed into
HDFS. Each node has a thread pool which polls new crawl specs from a master
queue (in my case jdbc - although i am thinking about HBase or Membase)

So yes, i took a huge step away from the idea of implementing generic
droids framework and focused rather on a very specific way to crawl the
web. Right now I tried my best to make droids fault tolerant (the internet
is b-r-o-k-e-n, you have no idea how bad it is!), and produce helpful logs.

What's next for me? Probably rewriting the robots.txt client. The crawlers
itself did pass a first smaller crawl with ~20m pages with pleasing
results.

Fire away if you have questions.

Regards,
Paul.


On Tue, 8 Feb 2011 09:50:06 +0100, Chapuis Bertil <[email protected]>
wrote:
> In previous emails and jira comments I saw several people mentionning
the
> fact they have a local copy of droids which evolved too much to be
merged
> back with the trunk. This is my case, and I think Paul Rogalinski is in
the
> same situation.
> 
> Since the patches have only been applied periodically on the trunk
during
> the last months, I'd love to know if someone else is in the same
situation
> and what can of changes they made locally.

-- 
Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich -
Germany - mailto: [email protected] - Phone: +49-179-3574356 -
msn: [email protected] - aim: pu1s4r - icq: 1177279 - skype: pulsar

Re: Local copies of droids

Reply via email to