Hi, I've just released a significant fork and extension of the Apache Droids framework, which I've been using for my own purposes for a while. http://open.trickl.com/trickl-crawler/index.html I've released it under the ASL and the intent is that any useful code might be integrated into the official trunk of droids in the future. I've taken a rather brutal, but pragmatic approach to using the framework - where the design hasn't met my needs I've duplicated and revised code from the framework. So, for example, you will see that significant chunks of the API I have copied and changed and are available under com.trickl.crawler.api. Obviously, in a perfect world, I would work with your development team to discuss changes and find sensible workarounds, but sadly I didn't have the time for that so I just rushed ahead and made changes where I needed them to my modified implementation. So there will be conflicts in design and perhaps philosophy about some of my core changes, many of which you might regard as unnecessary. However, hopefully, there will still be a significant chunk of code that is useful and perhaps some design changes were indeed worthwhile. So there's quite a lot of code to digest in one sitting, but broadly the significant extensions to functionality are:
- Many more handlers. I really wanted to cope with the variety of responses web servers can throw back at me and deal with them solely with Spring configured beans. For example, a recent requirement I dealt with was a web server with information that required: a http post request with specific header and post data, formatted in JSON format, which returned data in JSON format (but the server gave the content type wrongly as "text/html") then I needed a particular JSON parameter which contained HTML, which I needed to parse into XML, then convert using XSL, then bind using JAXB (phew!). All configured via Spring beans. - More flexibility in the parsers (so I've played with other HTML parser implementations). - A specific requirement I had where using "classpath:/" to load a resource, I needed to specify the actual class loader as I've been working in a web server environment where the services jar (with the required resource) was separate from the jar that contained the droids framework. - JSON and SOAP processing. This may be against the philosophy of the framework. My particular example where I needed this was scraped information about films from Wikipedia. I decided I also wanted the film ratings, which are available from org.cara.webcarasearch - however they require a SOAP request to get this information. While it's not appropriate for a web "crawler" which just follows links - it does seem appropriate for a web "walker" (which is really what my misnamed framework was designed for) where I've a set of known data sources I want to automatically collect data from. - A "delegating" droid. My requirement here is that I have a single queue of tasks, but some of those tasks need to go to different droids (i.e. some tasks are SOAP tasks, some just grab images, some render web pages). So all the tasks are sent to the delegating droid, which then delegates to another droids depending on some criteria (I use a custom field to decide). - A rendering droid, based on the Mozilla Gecko engine. I haven't touched this code in over a year...all I can say is it once worked in a single threaded environment, but had issues in multi-threaded environments. Since writing the code, I've no longer a requirement for rendering web pages, so I've not maintained this code. - Support for timed tasks. Some web servers are very slow to respond and rather than allowing them to consume resources for too long, I needed the ability to kill some tasks after a time limit. There's some major design issues as well. Probably too many to list here and should you wish to discuss these, it might be worth going over each individually. Many of them are because my idea of a droid task is more "general" than that assumed in the main branch of droids. So fields just as "depth", do not make any sense for a SOAP task. My top level class "Task" just requires an identifier. I hope the code is useful and presents ideas for refinement and development of the main branch of Apache Droids. Best regards, Tim
