This seems like a fairly big deal to me.  I've recently switched to using
Freeling in my dissertation work because of this -- I wanted to use the
same tool for a basic pipeline that gave me coverage of English and
Spanish.  Now, we could in principle use Freeling to generate a seed corpus
for other languages, then semi-supervise up from there, but we would then
also inherit Freeling's errors.

jds


On Wed, Oct 2, 2013 at 3:33 AM, Chris Collins <[email protected]>wrote:

> I am going to make a really naive comment / idea / input.  (there are a
> lot of IF's in this post so I apologize in advance)
>
> Its my observation that there are lots of companies out there that really
> wouldn't mind OpenNLP having better coverage of POS tagging and chunking in
> a whole assortment of languages.  Its not a long term competitive advantage
> to do it by themselves.  They also probably have neither the skills or time
> to make it happen (without pooling).  I have worked for three companies so
> far that fit into that category one I got very close to just paying for the
> labeling and donating the content.....clearly it didnt happen.
>
> Coverage in this part varies by company but
>
> As I see it:
>
> 1) Part of the problem is the labeling of the content.  What if we were
> able to turk this?  It may require breaking down the labeling process into
> a whole bunch of sub tasks.  Further it would require probably finding a
> subset of turkers capable of aiding in labeling for this type of advanced
> task.  I am a fan of companies like CrowdFlower that build on top of amazon
> mechanical turk and have pre-validated turkers that are known to perform
> with certain task styles.
>
> 2) assuming labeling to a quality (enough level) could be achieved with
> (1) could we have a fund / charity / kickstarter to pay for this labeling.
>  Perhaps the funding is split up by language so for instance companies
> could vote with their money on what they need to get fleshed out.
>
> of course 1 + 2 dont solve the complete picture.
>
> Thoughts? heckles?
>
> I actually work for a large corp that I can argue we need to put into the
> pot for several european languages and a couple asian.
>
> C
>
>
>
>
> On Oct 1, 2013, at 11:58 PM, Thomas Zastrow <[email protected]>
> wrote:
>
> > Dear all,
> >
> > Some of you mentioned already the Brat tool, so let me point you to
> WebAnno. It is based on Brat, but has some more functionality like for
> example extensions for crowdsourcing:
> >
> > http://code.google.com/p/webanno/
> >
> > Best,
> >
> > Tom
> >
> >
> >
> >
> > Am 01.10.2013 17:01, schrieb Michael Schmitz:
> >> Hi, I've used OpenNLP for a few years--in particular the chunker, POS
> >> tagger, and tokenizer.  We're grateful for a high performance library
> >> with an Apache license, but one of our greatest complaints is the
> >> quality of the models.  Yes--we're aware we can train our own--but
> >> most people are looking for something that is good enough out of the
> >> box (we aim for this with out products).  I'm not surprised that
> >> volunteer engineers don't want to spend their time annotating data ;-)
> >>
> >> I'm curious what other people see as the biggest shortcomings for Open
> >> NLP or the most important next steps for OpenNlp.  I may have an
> >> opportunity to contribute to the project and I'm trying to figure out
> >> where the community thinks the biggest impact could be made.
> >>
> >> Peace.
> >> Michael Schmitz
> >
>
>

Reply via email to