This seems like a fairly big deal to me. I've recently switched to using Freeling in my dissertation work because of this -- I wanted to use the same tool for a basic pipeline that gave me coverage of English and Spanish. Now, we could in principle use Freeling to generate a seed corpus for other languages, then semi-supervise up from there, but we would then also inherit Freeling's errors.
jds On Wed, Oct 2, 2013 at 3:33 AM, Chris Collins <[email protected]>wrote: > I am going to make a really naive comment / idea / input. (there are a > lot of IF's in this post so I apologize in advance) > > Its my observation that there are lots of companies out there that really > wouldn't mind OpenNLP having better coverage of POS tagging and chunking in > a whole assortment of languages. Its not a long term competitive advantage > to do it by themselves. They also probably have neither the skills or time > to make it happen (without pooling). I have worked for three companies so > far that fit into that category one I got very close to just paying for the > labeling and donating the content.....clearly it didnt happen. > > Coverage in this part varies by company but > > As I see it: > > 1) Part of the problem is the labeling of the content. What if we were > able to turk this? It may require breaking down the labeling process into > a whole bunch of sub tasks. Further it would require probably finding a > subset of turkers capable of aiding in labeling for this type of advanced > task. I am a fan of companies like CrowdFlower that build on top of amazon > mechanical turk and have pre-validated turkers that are known to perform > with certain task styles. > > 2) assuming labeling to a quality (enough level) could be achieved with > (1) could we have a fund / charity / kickstarter to pay for this labeling. > Perhaps the funding is split up by language so for instance companies > could vote with their money on what they need to get fleshed out. > > of course 1 + 2 dont solve the complete picture. > > Thoughts? heckles? > > I actually work for a large corp that I can argue we need to put into the > pot for several european languages and a couple asian. > > C > > > > > On Oct 1, 2013, at 11:58 PM, Thomas Zastrow <[email protected]> > wrote: > > > Dear all, > > > > Some of you mentioned already the Brat tool, so let me point you to > WebAnno. It is based on Brat, but has some more functionality like for > example extensions for crowdsourcing: > > > > http://code.google.com/p/webanno/ > > > > Best, > > > > Tom > > > > > > > > > > Am 01.10.2013 17:01, schrieb Michael Schmitz: > >> Hi, I've used OpenNLP for a few years--in particular the chunker, POS > >> tagger, and tokenizer. We're grateful for a high performance library > >> with an Apache license, but one of our greatest complaints is the > >> quality of the models. Yes--we're aware we can train our own--but > >> most people are looking for something that is good enough out of the > >> box (we aim for this with out products). I'm not surprised that > >> volunteer engineers don't want to spend their time annotating data ;-) > >> > >> I'm curious what other people see as the biggest shortcomings for Open > >> NLP or the most important next steps for OpenNlp. I may have an > >> opportunity to contribute to the project and I'm trying to figure out > >> where the community thinks the biggest impact could be made. > >> > >> Peace. > >> Michael Schmitz > > > >
