Dear All, Very early days, but I would like to announce a new Open Source project named Behemoth which we have put on Google Code under Apache License ( http://code.google.com/p/behemoth-pebble/).
Behemoth allows to deploy GATE or UIMA applications over a Hadoop cluster in order to do very large scale document analysis. It uses a very simple representation format which can be used as a common ground between UIMA and GATE-generated annotations, hence achieving compatibility between both systems. Since it is Hadoop-based it benefits from all its features (scalability, fault-tolerance, etc...) and most notably the back up of a thriving open source community. Quite a few Apache resources already do or will fit into it: Nutch, Tika, Mahout, Hbase etc... The documentation is virtually non existant (apart from some basic wiki pages) but this should hopefully be fixed as some point soon. Again, the project is at a very early stage so do not expect anything stable. This also means that user feedback is more likely to influence the design or implementation. Apart from the Google code pages for the project the best place to discuss Behemoth or get updates on it is the DigitalPebble user group on http://groups.google.com/group/digitalpebble. We've used Behemoth on a 100K documents corpus on a small Amazon EC2 cluster with a GATE application and found that it worked fine. If you have a cluster available and a large corpus to process with UIMA or GATE maybe you should give Behemoth a try? Best regards, Julien Nioche -- DigitalPebble Ltd http://www.digitalpebble.com
