About your use case, if you take Droids as a framework and are ready to put your hands in code, I think you can nearly solve all problems you mentionned.
I personnally used Droids more as an ETL tools than a crawler. To give you an idea of my main use case, I used it build a dataset containing all the interventions [1] of the swiss parliament as well as the parliamentarians votes which are stored in pdf files [2]. Then, we used social network analysis metrics to analyse the collected data. Droids was a very flexible and easy to customize foundation. [1] - http://www.parlament.ch/ab/frameset/f/n/4817/345655/f_n_4817_345655_345933.htm [2] - http://www.parlament.ch/poly/Abstimmung/48/out/vote_48_5042.pdf On 2 March 2011 19:21, Jeremy Arnold <[email protected]> wrote: > Hello droids developers, > > I wanted to talk a little bit about my use case and hopefully get an > idea of how far away the code base is from it, as well as where I > could put in personal time to help get it there. > > I'm interested in having a standalone searching/crawling service so I > can have 1 application hosting the searching and indexing used by many > different custom apps. I would like to have the option to use Solr or > elasticsearch for indexing/searching. I've been working with elastic > search for the past few days and am growing very fond of the > flexibility it provides. I'm also a sucker for JSON over HTTP > services. I need to be able to start crawls for both a filesystem and > webpages from within my custom webapp, or have the crawls run at > scheduled times. Those crawls then need to be indexed. I also need to > have the ability to integrate my own content handlers so I can specify > how certain pieces of content are indexed (e.g., custom PDF metadata). > I also need to be able to easily add or remove items in an index from > within my custom app, as well as the obvious updating items in an > index. > > How far is the codebase away from being able to be used in the > scenario described above? > > I've spent a lot of time over the past 3 days looking at the droids > code base. It looks really promising but I'm not sure where it really > stands overall. I know the elasticsearch piece doesn't exist, and I > would love to put together that contribution if it seems like an > acceptable counterpart to the existing droids-solr module. I would > also like to take on the task of bringing a lot more consistency to > the code base (e.g., commenting, code consistency). I'm just a bit > concerned about taking on such a large task and submitting it as a > patch. I also see a few places where testing would be beneficial and > do not mind attacking that as well, I just don't want to waste time > testing things that may be going away in the future. > > I'd like to start a discussion about my particular use case and where > it would be most beneficial for me to get involved, or if there is a > project out there that is better suited for my use case. I've > evaluated Nutch but I can't say it's been the best experience and it > doesn't quite fit into what I am trying to do. It does look like Nutch > 2 will fit well into my use case but the timeline does not. > > > Thanks, > Jeremy > -- Bertil Chapuis Agimem Sàrl http://www.agimem.com
