On 12/13/2011 12:29 PM, Tim Gee wrote:
Hi,
   I've just released a significant fork and extension of the Apache Droids
framework, which I've been using for my own purposes for a while.
   http://open.trickl.com/trickl-crawler/index.html
   I've released it under the ASL and the intent is that any useful code
might be integrated into the official trunk of droids in the future. I've
taken a rather brutal, but pragmatic approach to using the framework -
where the design hasn't met my needs I've duplicated and revised code from
the framework. So, for example, you will see that significant chunks of the
API I have copied and changed and are available under
com.trickl.crawler.api. Obviously, in a perfect world, I would work with
your development team to discuss changes and find sensible workarounds, but
sadly I didn't have the time for that so I just rushed ahead and made
changes where I needed them to my modified implementation.
   So there will be conflicts in design and perhaps philosophy about some of
my core changes, many of which you might regard as unnecessary. However,
hopefully, there will still be a significant chunk of code that is useful
and perhaps some design changes were indeed worthwhile.


Hello Tim,

I've been meaning to take some time to look through your release, but sadly I have not had the opportunity to yet. Thank you for using AL2, and I hope we can incorporate some of your changes back into the core.

You make some interesting changes and additions. Being able to process those different content types (JSON for example), might not make sense for a crawler, but it is quite useful. In my implementation, I'm doing all sorts of status code handling, and recording that into a database. I figure if I'm crawling my material, I might as well know what is broken and where the redirects are. So those functionalities are certainly very useful when going over a certain set of content.

You mentioned that you've tried to use other HTML parsers. How does HtmlCleaner vary from JTidy? Do you have any feeling of how those and/or Neko compare to the ones from Tika? I've got a few pages that Tika blows up on.

How did you handle Spring? I see you aren't using the droids-spring module. I know so very little about Spring, that I don't know where any of the deficiencies are.

It would be nice to merge some of your functionality in. I hope I have time to look at it soon. Obviously patches are always welcome, and may work with low hanging fruit. The more significant changes would require more effort.

Reply via email to