It is exactly the "production"-oriented aspects of UIMA that, IMHO, make it unattractive for many users. Almost all commercial implementations of NLP use NLP as one of the components in their overall software stack, i.e., they typically have their own custom framework, platform, and architecture that do many things other than NLP. What these commercial systems need are open source libraries to do specific things, leaving their developers free to put them together according to their own requirements and tradeoffs. Any NLP system that forces the developer to use one particular way (eg UIMA) of putting things together will not be attractive, and might steer them away from otherwise great algorithm implementations due to the significant additional baggage that the algorithms come with.
Today, there are so many ways of connecting components together, including workflows, platforms, configuration management, parallel batch processing (Hadoop, anyone?), parallel stream processing (Storm), etc. Almost the entire code-base of UIMA has nothing to do with NLP. There's a big reason why IBM gave away all this code to Apache, and kept their core algorithms to themselves - it was clear to them where the value is. At least to me, it is clear where the value is - algorithms, and not frameworks. Software developers in industry are very capable of easily putting together their own frameworks. What they need help with are core NLP algorithms that they don't have the background to do themselves. One example I would suggest (at least according my view), is the difference between Lucene and Nutch. Being a library, Lucene has pretty much taken over search engine software development. Nutch, on the other hand, tries to be a full-fledged platform for crawling, indexing and search, and has not gathered anywhere near the same usage levels. My vote is to please keep OpenNLP clean, smart, algorithm-centered, user-focused. Keep it simple. Math, stat, and algos. And excel at it. Please don’t dumb OpenNLP down with unnecessary bloat that any decent software team can do easily, and might often prefer to implement in a different way. Connectors, not merging. My two bits... :-) Cheers, Jeyendran -----Original Message----- From: Jörn Kottmann [mailto:[email protected]] Sent: Monday, July 09, 2012 1:50 AM To: [email protected] Subject: Re: Apache "Text Analysis" top-level project? On 07/09/2012 05:56 AM, Lance Norskog wrote: > Would it make sense to join OpenNLP, UIMA, and Open Relevance into one > top-level "Text Analysis" project? There are already cross-project > connections between UIMA and OpenNLP. ORP seems dormant. It also seems > a more natural place than OpenNLP for a database of tagged text. > > OpenNLP and UIMA align nicely in my opinion. OpenNLP just implements engines for various NLP tasks without any further support. UIMA on the other side can do a lot of these additional things you need to run OpenNLP in a production system e.g. scaling the engines to many machines, providing workflow support, resource loading and management, etc. So there is not really an overlap between the two. UIMA has some NLP related addons in their sandbox, some of them duplicate functionality which is also provided by OpenNLP e.g. pos tagging, or the dictionary annotator, but that does not seem to be that much. Lucene contains a lot of NLP code for stemming and word segmentation in different languages. Thats probably the biggest NLP related code base next to OpenNLP at Apache. Jörn
