I'm also using a local branch which I'm now starting to integrate back into
the trunk. I'm mostly creating 0.0.2 issues, in the hope that 0.0.1 is going
to be released soon(ish). Some of the local branches described in this
thread sound very interesting, and it would be cool to at least see the
smaller bullet points committed back into the trunk, to at least reduce the
diff between the trunk and these local branches and to make committing
larger improvements possible.

On Wed, Feb 9, 2011 at 10:16 PM, paul.vc <[email protected]> wrote:

> Hey Otis,
>
> I am staring a "big" crawl (~50m hosts) this or the next week. I am sure
> it will bring back some new bugs and issues to solve. Furthermore there is
> still the robots.txt part to be taken care of. I have been contracted to
> implement that crawler by another company, and I do also have the
> permission to contribute most of the work back (probably all but the
> content-extraction part). So I see no real problems in releasing the
> sources after a minor code review.
>
> If you just need some specific pieces now, lets meet on IRC
> freenode/#droids (not monitoring the channel actively though - ping me on
> ircnet/pulsar) or some IM (see below for my accounts). I'll be able to pull
> stuff out and post it online as needed.
>
> Regards,
> Paul.
>
> msn: [email protected] · aim: pu1s4r · icq: 1177279 · skype: pulsar ·
> yahoo: paulrogalinski · gtalk/XMPP: [email protected]
>
>
> On Wed, 9 Feb 2011 11:32:18 -0800 (PST), Otis Gospodnetic
> <[email protected]> wrote:
> > Hi,
> >
> > Wow, juicy!
> > I have just 1 question: (when) can you contribute individual pieces of
> > your
> > great work? :)
> >
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > ----- Original Message ----
> >> From: paul.vc <[email protected]>
> >> To: [email protected]
> >> Sent: Tue, February 8, 2011 6:31:05 AM
> >> Subject: Re: Local copies of droids
> >>
> >> Hey Guys,
> >>
> >> I've focused on a very specific web-crawl task when refactoring  my
> copy
> >> of
> >> droids:
> >>
> >> - Tasks have now a context which can be used to  share arbitrary data
> >> - so do http entities (needed to store headers  for  a longer period of
> >> time)
> >> - Various crawl related metrics / counters  exposed via JMX / MBeans
> >> (+munin config)
> >> - robotx.txt / domain caching -  currently there is a huge problem in
> the
> >> trunk with that! Solutions provided  but no patches... yet :/
> >> - A buch of new link filters to clean up a url and  prevent processing
> >> the
> >> same page over and over again
> >> - Visited-url tree  structure to keep the memory consumption down. I do
> >> also plan to use the same  tree as a Task queue. Should lower the
> Memory
> >> consumption significantly.
> >> -  Pluggable DNS resolvers
> >> - Plenty of small bugfixes, some of them really  minor. Others I tried
> to
> >> report and provide solutions to them.
> >> - Handling  of redirect codes (via meta and/or header) and exposing
> that
> >> information to  the extractors / writers etc.
> >> - Improved encoding detection / handling
> >> -  Added TaskExecutionDeciders - simply filtering the links was not
> >> sufficient  enough in some rare cases.
> >> - Simplification: removed spring dependencies,  threw away classes /
> >> functionality not needed by me, nice and easy methods  for spawning new
> >> crawlers / droids.
> >> - Thread pool for independent parallel  execution of droids, limited by
> a
> >> load a single node in a cluster can take (  see
> >> http://codewut.de/content/load-sensitive-threadpool-queue-java )
> >> -  Managing new droids by polling new hosts to be crawled from an
> >> external
> >> source
> >> - Extended delay framework so delays can be computed based  upon the
> >> processing / response time of the last page/task
> >> - Proxy  support
> >> - Plenty of tweaks to the http-client params to prevent/skip hung
> >> sockets,
> >> slow responses (like 1200 baud)
> >> - Mechanism to do a clean exit  (shutdown / SIG hooks), finish all
> items
> >> in
> >> the queue, close all writers  properly. Alternatively a quick exit can
> be
> >> triggered to flush remaining  items from the queue.
> >> - Stuff i have already forgotten :/
> >>
> >> Maybe i  should also mention where i am going with beat up version of
> >> droids: i am  building a pseudo distributed crawler farm. pseudo
> >> distributed
> >> because there  is not controlling server and shared task queue. Each
> >> node in
> >> my cluster runs  multiple droids, each one crawling one host. Extracted
> >> data
> >> is collected from  all of the instances per node (not droid) and fed
> into
> >> HDFS. Each node has a  thread pool which polls new crawl specs from a
> >> master
> >> queue (in my case jdbc  - although i am thinking about HBase or
> Membase)
> >>
> >> So yes, i took a huge  step away from the idea of implementing generic
> >> droids framework and focused  rather on a very specific way to crawl
> the
> >> web. Right now I tried my best to  make droids fault tolerant (the
> >> internet
> >> is b-r-o-k-e-n, you have no idea how  bad it is!), and produce helpful
> >> logs.
> >>
> >> What's next for me? Probably  rewriting the robots.txt client. The
> >> crawlers
> >> itself did pass a first smaller  crawl with ~20m pages with pleasing
> >> results.
> >>
> >> Fire away if you have  questions.
> >>
> >> Regards,
> >> Paul.
> >>
> >>
> >> On Tue, 8 Feb 2011 09:50:06  +0100, Chapuis Bertil
> <[email protected]>
> >> wrote:
> >> >  In previous emails and jira comments I saw several people
> mentionning
> >> the
> >> > fact they have a local copy of droids which evolved  too much to be
> >> merged
> >> > back with the trunk. This is my case, and I  think Paul Rogalinski is
> >> > in
> >> the
> >> > same situation.
> >> >
> >> >  Since the patches have only been applied periodically on the  trunk
> >> during
> >> > the last months, I'd love to know if someone else is in  the same
> >> situation
> >> > and what can of changes they made  locally.
> >>
> >> --
> >> Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17  -81543 Munich
> -
> >> Germany - mailto: [email protected] -  Phone: +49-179-3574356
> -
> >> msn: [email protected] - aim: pu1s4r - icq:  1177279 - skype: pulsar
> >>
>
> --
> Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich -
> Germany - mailto: [email protected] - Phone: +49-179-3574356 -
> msn: [email protected] - aim: pu1s4r - icq: 1177279 - skype: pulsar
>

Reply via email to