I'm also using a local branch which I'm now starting to integrate back into the trunk. I'm mostly creating 0.0.2 issues, in the hope that 0.0.1 is going to be released soon(ish). Some of the local branches described in this thread sound very interesting, and it would be cool to at least see the smaller bullet points committed back into the trunk, to at least reduce the diff between the trunk and these local branches and to make committing larger improvements possible.
On Wed, Feb 9, 2011 at 10:16 PM, paul.vc <[email protected]> wrote: > Hey Otis, > > I am staring a "big" crawl (~50m hosts) this or the next week. I am sure > it will bring back some new bugs and issues to solve. Furthermore there is > still the robots.txt part to be taken care of. I have been contracted to > implement that crawler by another company, and I do also have the > permission to contribute most of the work back (probably all but the > content-extraction part). So I see no real problems in releasing the > sources after a minor code review. > > If you just need some specific pieces now, lets meet on IRC > freenode/#droids (not monitoring the channel actively though - ping me on > ircnet/pulsar) or some IM (see below for my accounts). I'll be able to pull > stuff out and post it online as needed. > > Regards, > Paul. > > msn: [email protected] · aim: pu1s4r · icq: 1177279 · skype: pulsar · > yahoo: paulrogalinski · gtalk/XMPP: [email protected] > > > On Wed, 9 Feb 2011 11:32:18 -0800 (PST), Otis Gospodnetic > <[email protected]> wrote: > > Hi, > > > > Wow, juicy! > > I have just 1 question: (when) can you contribute individual pieces of > > your > > great work? :) > > > > Otis > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > ----- Original Message ---- > >> From: paul.vc <[email protected]> > >> To: [email protected] > >> Sent: Tue, February 8, 2011 6:31:05 AM > >> Subject: Re: Local copies of droids > >> > >> Hey Guys, > >> > >> I've focused on a very specific web-crawl task when refactoring my > copy > >> of > >> droids: > >> > >> - Tasks have now a context which can be used to share arbitrary data > >> - so do http entities (needed to store headers for a longer period of > >> time) > >> - Various crawl related metrics / counters exposed via JMX / MBeans > >> (+munin config) > >> - robotx.txt / domain caching - currently there is a huge problem in > the > >> trunk with that! Solutions provided but no patches... yet :/ > >> - A buch of new link filters to clean up a url and prevent processing > >> the > >> same page over and over again > >> - Visited-url tree structure to keep the memory consumption down. I do > >> also plan to use the same tree as a Task queue. Should lower the > Memory > >> consumption significantly. > >> - Pluggable DNS resolvers > >> - Plenty of small bugfixes, some of them really minor. Others I tried > to > >> report and provide solutions to them. > >> - Handling of redirect codes (via meta and/or header) and exposing > that > >> information to the extractors / writers etc. > >> - Improved encoding detection / handling > >> - Added TaskExecutionDeciders - simply filtering the links was not > >> sufficient enough in some rare cases. > >> - Simplification: removed spring dependencies, threw away classes / > >> functionality not needed by me, nice and easy methods for spawning new > >> crawlers / droids. > >> - Thread pool for independent parallel execution of droids, limited by > a > >> load a single node in a cluster can take ( see > >> http://codewut.de/content/load-sensitive-threadpool-queue-java ) > >> - Managing new droids by polling new hosts to be crawled from an > >> external > >> source > >> - Extended delay framework so delays can be computed based upon the > >> processing / response time of the last page/task > >> - Proxy support > >> - Plenty of tweaks to the http-client params to prevent/skip hung > >> sockets, > >> slow responses (like 1200 baud) > >> - Mechanism to do a clean exit (shutdown / SIG hooks), finish all > items > >> in > >> the queue, close all writers properly. Alternatively a quick exit can > be > >> triggered to flush remaining items from the queue. > >> - Stuff i have already forgotten :/ > >> > >> Maybe i should also mention where i am going with beat up version of > >> droids: i am building a pseudo distributed crawler farm. pseudo > >> distributed > >> because there is not controlling server and shared task queue. Each > >> node in > >> my cluster runs multiple droids, each one crawling one host. Extracted > >> data > >> is collected from all of the instances per node (not droid) and fed > into > >> HDFS. Each node has a thread pool which polls new crawl specs from a > >> master > >> queue (in my case jdbc - although i am thinking about HBase or > Membase) > >> > >> So yes, i took a huge step away from the idea of implementing generic > >> droids framework and focused rather on a very specific way to crawl > the > >> web. Right now I tried my best to make droids fault tolerant (the > >> internet > >> is b-r-o-k-e-n, you have no idea how bad it is!), and produce helpful > >> logs. > >> > >> What's next for me? Probably rewriting the robots.txt client. The > >> crawlers > >> itself did pass a first smaller crawl with ~20m pages with pleasing > >> results. > >> > >> Fire away if you have questions. > >> > >> Regards, > >> Paul. > >> > >> > >> On Tue, 8 Feb 2011 09:50:06 +0100, Chapuis Bertil > <[email protected]> > >> wrote: > >> > In previous emails and jira comments I saw several people > mentionning > >> the > >> > fact they have a local copy of droids which evolved too much to be > >> merged > >> > back with the trunk. This is my case, and I think Paul Rogalinski is > >> > in > >> the > >> > same situation. > >> > > >> > Since the patches have only been applied periodically on the trunk > >> during > >> > the last months, I'd love to know if someone else is in the same > >> situation > >> > and what can of changes they made locally. > >> > >> -- > >> Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich > - > >> Germany - mailto: [email protected] - Phone: +49-179-3574356 > - > >> msn: [email protected] - aim: pu1s4r - icq: 1177279 - skype: pulsar > >> > > -- > Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich - > Germany - mailto: [email protected] - Phone: +49-179-3574356 - > msn: [email protected] - aim: pu1s4r - icq: 1177279 - skype: pulsar >
