I am going to make a really naive comment / idea / input.  (there are a lot of 
IF's in this post so I apologize in advance)

Its my observation that there are lots of companies out there that really 
wouldn't mind OpenNLP having better coverage of POS tagging and chunking in a 
whole assortment of languages.  Its not a long term competitive advantage to do 
it by themselves.  They also probably have neither the skills or time to make 
it happen (without pooling).  I have worked for three companies so far that fit 
into that category one I got very close to just paying for the labeling and 
donating the content.....clearly it didnt happen.

Coverage in this part varies by company but 

As I see it:

1) Part of the problem is the labeling of the content.  What if we were able to 
turk this?  It may require breaking down the labeling process into a whole 
bunch of sub tasks.  Further it would require probably finding a subset of 
turkers capable of aiding in labeling for this type of advanced task.  I am a 
fan of companies like CrowdFlower that build on top of amazon mechanical turk 
and have pre-validated turkers that are known to perform with certain task 
styles.

2) assuming labeling to a quality (enough level) could be achieved with (1) 
could we have a fund / charity / kickstarter to pay for this labeling.  Perhaps 
the funding is split up by language so for instance companies could vote with 
their money on what they need to get fleshed out.

of course 1 + 2 dont solve the complete picture.  

Thoughts? heckles?

I actually work for a large corp that I can argue we need to put into the pot 
for several european languages and a couple asian.

C




On Oct 1, 2013, at 11:58 PM, Thomas Zastrow <[email protected]> wrote:

> Dear all,
> 
> Some of you mentioned already the Brat tool, so let me point you to WebAnno. 
> It is based on Brat, but has some more functionality like for example 
> extensions for crowdsourcing:
> 
> http://code.google.com/p/webanno/
> 
> Best,
> 
> Tom
> 
> 
> 
> 
> Am 01.10.2013 17:01, schrieb Michael Schmitz:
>> Hi, I've used OpenNLP for a few years--in particular the chunker, POS
>> tagger, and tokenizer.  We're grateful for a high performance library
>> with an Apache license, but one of our greatest complaints is the
>> quality of the models.  Yes--we're aware we can train our own--but
>> most people are looking for something that is good enough out of the
>> box (we aim for this with out products).  I'm not surprised that
>> volunteer engineers don't want to spend their time annotating data ;-)
>> 
>> I'm curious what other people see as the biggest shortcomings for Open
>> NLP or the most important next steps for OpenNlp.  I may have an
>> opportunity to contribute to the project and I'm trying to figure out
>> where the community thinks the biggest impact could be made.
>> 
>> Peace.
>> Michael Schmitz
> 

Reply via email to