HelloWorld,
I'm currently building a crawler based on the droid-core implementation,
trying not to change anything in the core API / interfaces yet. Due to
the lack of documentation I was not so eager to dive directly into a lot
of crawler-code with unclear quality. Perhaps this was a mistake, but on
the other hand it does currently suit me quite well.
My goal is to have a crawler with a very small footprint to be embedded
into a Hadoop map/reduce job. So I am not using Spring (IMHO too much
overhead to initialize when running inside map/reduce), recrawling or
even multi-threaded crawling. I do plan to spawn a lot of droids, each
taking care of one domain. Each droid has no need to jump domains or
hosts. Extracted data will be written into an HBase cluster for further
processing.
This is not some hobby side project for myself but a project with real
world deployment and it needs to be pretty much bullet proof. I am not
going crazy about beautiful architecture but focus rather on stable,
clean and hopefully bugfree code. Along with that I am finding smaller
bugs in the droids-core implementation and thinking about additions and
minor changes to the API.
I am not sure *all* of this has its place in the droids-core module - in
the end my requirements are not very generic. But if somebody is
interested I am open to discussion how my work can help improving
droids-core.
Greetings,
Paul.
P.S.
just parked my butt over at #droids/freenode. My timezone is CET and
I'll be checking activity on that channel in the evenings. To wake me up
a ping on any IM mentioned in the signature will help.
Chapuis Bertil wrote:
IMHO one of the primary requirements is to clean the trunk: for exemple, the
work which has been done in the droids-crawler project has to be integrated
with the droids-core project. Then making some refactoring and implementing
some new features will be much easier.
--
paul rogalinski · mailto: [email protected] · msn: [email protected] · aim:
pu1s4r · icq: 1177279 · skype: pulsar