RE: Apache "Text Analysis" top-level project?

Jeyendran Balakrishnan Mon, 09 Jul 2012 11:19:25 -0700

It is exactly the "production"-oriented aspects of UIMA that, IMHO, make it 
unattractive for many users.
Almost all commercial implementations of NLP use NLP as one of the components 
in their overall software stack,
i.e., they typically have their own custom framework, platform, and 
architecture that do many things other than NLP.
What these commercial systems need are open source libraries to do specific 
things, leaving their developers free to put them together according to their 
own requirements and tradeoffs.
Any NLP system that forces the developer to use one particular way (eg UIMA) of 
putting things together will not be attractive, 
and might steer them away from otherwise great algorithm implementations due to 
the significant additional baggage that the algorithms come with.

Today, there are so many ways of connecting components together, including 
workflows, platforms, configuration management, parallel batch processing 
(Hadoop, anyone?), parallel stream processing (Storm), etc. Almost the entire 
code-base of UIMA has nothing to do with NLP. There's a big reason why IBM gave 
away all this code to Apache, and kept their core algorithms to themselves - it 
was clear to them where the value is. At least to me, it is clear where the 
value is - algorithms, and not frameworks.

Software developers in industry are very capable of easily putting together 
their own frameworks. What they need help with are core NLP algorithms that 
they don't have the background to do themselves. One example I would suggest 
(at least according my view), is the difference between Lucene and Nutch. Being 
a library, Lucene has pretty much taken over search engine software 
development. Nutch, on the other hand, tries to be a full-fledged platform for 
crawling, indexing and search, and has not gathered anywhere near the same 
usage levels.

My vote is to please keep OpenNLP clean, smart, algorithm-centered, 
user-focused. 
Keep it simple. 
Math, stat, and algos. 
And excel at it.

Please don’t dumb OpenNLP down with unnecessary bloat that any decent software 
team can do easily, and might often prefer to implement in a different way.
Connectors, not merging.

My two bits... :-)

Cheers,
Jeyendran

-----Original Message-----
From: Jörn Kottmann [mailto:[email protected]] 
Sent: Monday, July 09, 2012 1:50 AM
To: [email protected]
Subject: Re: Apache "Text Analysis" top-level project?

On 07/09/2012 05:56 AM, Lance Norskog wrote:
> Would it make sense to join OpenNLP, UIMA, and Open Relevance into one 
> top-level "Text Analysis" project? There are already cross-project 
> connections between UIMA and OpenNLP. ORP seems dormant. It also seems 
> a more natural place than OpenNLP for a database of tagged text.
>
>

OpenNLP and UIMA align nicely in my opinion. OpenNLP just implements engines 
for various NLP tasks without any further support.
UIMA on the other side can do a lot of these additional things you need to run 
OpenNLP in a production system e.g. scaling the engines to many machines, 
providing workflow support, resource loading and management, etc.
So there is not really an overlap between the two.

UIMA has some NLP related addons in their sandbox, some of them duplicate 
functionality which is also provided by OpenNLP e.g. pos tagging, or the 
dictionary annotator, but that does not seem to be that much.

Lucene contains a lot of NLP code for stemming and word segmentation in 
different languages. Thats probably the biggest NLP related code base next to 
OpenNLP at Apache.

Jörn

RE: Apache "Text Analysis" top-level project?

Reply via email to