Re: Local copies of droids

paul.vc Wed, 09 Feb 2011 12:10:43 -0800

Hey Otis,

I am staring a "big" crawl (~50m hosts) this or the next week. I am sure
it will bring back some new bugs and issues to solve. Furthermore there is
still the robots.txt part to be taken care of. I have been contracted to
implement that crawler by another company, and I do also have the
permission to contribute most of the work back (probably all but the
content-extraction part). So I see no real problems in releasing the
sources after a minor code review.


If you just need some specific pieces now, lets meet on IRC
freenode/#droids (not monitoring the channel actively though - ping me on
ircnet/pulsar) or some IM (see below for my accounts). I'll be able to pull
stuff out and post it online as needed.

Regards,
Paul.

msn: [email protected] · aim: pu1s4r · icq: 1177279 · skype: pulsar ·
yahoo: paulrogalinski · gtalk/XMPP: [email protected] 


On Wed, 9 Feb 2011 11:32:18 -0800 (PST), Otis Gospodnetic
<[email protected]> wrote:
> Hi,
> 
> Wow, juicy!
> I have just 1 question: (when) can you contribute individual pieces of
> your 
> great work? :)
> 
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
>> From: paul.vc <[email protected]>
>> To: [email protected]
>> Sent: Tue, February 8, 2011 6:31:05 AM
>> Subject: Re: Local copies of droids
>> 
>> Hey Guys,
>> 
>> I've focused on a very specific web-crawl task when refactoring  my
copy
>> of
>> droids:
>> 
>> - Tasks have now a context which can be used to  share arbitrary data
>> - so do http entities (needed to store headers  for  a longer period of
>> time)
>> - Various crawl related metrics / counters  exposed via JMX / MBeans
>> (+munin config)
>> - robotx.txt / domain caching -  currently there is a huge problem in
the
>> trunk with that! Solutions provided  but no patches... yet :/
>> - A buch of new link filters to clean up a url and  prevent processing
>> the
>> same page over and over again
>> - Visited-url tree  structure to keep the memory consumption down. I do
>> also plan to use the same  tree as a Task queue. Should lower the
Memory
>> consumption significantly.
>> -  Pluggable DNS resolvers
>> - Plenty of small bugfixes, some of them really  minor. Others I tried
to
>> report and provide solutions to them.
>> - Handling  of redirect codes (via meta and/or header) and exposing
that
>> information to  the extractors / writers etc.
>> - Improved encoding detection / handling
>> -  Added TaskExecutionDeciders - simply filtering the links was not
>> sufficient  enough in some rare cases.
>> - Simplification: removed spring dependencies,  threw away classes /
>> functionality not needed by me, nice and easy methods  for spawning new
>> crawlers / droids.
>> - Thread pool for independent parallel  execution of droids, limited by
a
>> load a single node in a cluster can take (  see
>> http://codewut.de/content/load-sensitive-threadpool-queue-java )
>> -  Managing new droids by polling new hosts to be crawled from an 
>> external
>> source
>> - Extended delay framework so delays can be computed based  upon the
>> processing / response time of the last page/task
>> - Proxy  support
>> - Plenty of tweaks to the http-client params to prevent/skip hung 
>> sockets,
>> slow responses (like 1200 baud)
>> - Mechanism to do a clean exit  (shutdown / SIG hooks), finish all
items
>> in
>> the queue, close all writers  properly. Alternatively a quick exit can
be
>> triggered to flush remaining  items from the queue.
>> - Stuff i have already forgotten :/
>> 
>> Maybe i  should also mention where i am going with beat up version of
>> droids: i am  building a pseudo distributed crawler farm. pseudo
>> distributed
>> because there  is not controlling server and shared task queue. Each
>> node in
>> my cluster runs  multiple droids, each one crawling one host. Extracted
>> data
>> is collected from  all of the instances per node (not droid) and fed
into
>> HDFS. Each node has a  thread pool which polls new crawl specs from a
>> master
>> queue (in my case jdbc  - although i am thinking about HBase or
Membase)
>> 
>> So yes, i took a huge  step away from the idea of implementing generic
>> droids framework and focused  rather on a very specific way to crawl
the
>> web. Right now I tried my best to  make droids fault tolerant (the
>> internet
>> is b-r-o-k-e-n, you have no idea how  bad it is!), and produce helpful
>> logs.
>> 
>> What's next for me? Probably  rewriting the robots.txt client. The
>> crawlers
>> itself did pass a first smaller  crawl with ~20m pages with pleasing
>> results.
>> 
>> Fire away if you have  questions.
>> 
>> Regards,
>> Paul.
>> 
>> 
>> On Tue, 8 Feb 2011 09:50:06  +0100, Chapuis Bertil
<[email protected]>
>> wrote:
>> >  In previous emails and jira comments I saw several people 
mentionning
>> the
>> > fact they have a local copy of droids which evolved  too much to be
>> merged
>> > back with the trunk. This is my case, and I  think Paul Rogalinski is
>> > in
>> the
>> > same situation.
>> > 
>> >  Since the patches have only been applied periodically on the  trunk
>> during
>> > the last months, I'd love to know if someone else is in  the same
>> situation
>> > and what can of changes they made  locally.
>> 
>> -- 
>> Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17  -81543 Munich
-
>> Germany - mailto: [email protected] -  Phone: +49-179-3574356
-
>> msn: [email protected] - aim: pu1s4r - icq:  1177279 - skype: pulsar
>>

-- 
Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich -
Germany - mailto: [email protected] - Phone: +49-179-3574356 -
msn: [email protected] - aim: pu1s4r - icq: 1177279 - skype: pulsar

Re: Local copies of droids

Reply via email to