We are in progress.  So far we trained sentence models for several languages 
but have not done any detailed evaluation of quality (yet).  

Sentence segmentation
===================
We pulled wikipedia dumps for the several languages.  The plan is to use this 
as the corpus for labeling for the different training exercises.  We pulled for 
now about 100 articles that were typically a couple of pages of text.  and 
stripped them of any markup (what is enough for training a language model.  We 
then handed these articles to a native speaker to simply markup sentence 
boundaries.  I am pretty confident that this was probably one task we didn't 
really need a native speaker for (at least for the first pass we applied).

For this exercise the distribution of the articles was all done manually via 
email to internal employees.  A general workflow / editorial engine is in the 
works but is really focused on the POS training exercise.

POS
====
Still in the planning stage.  We are playing with how we can turn this into 
much of a turking task as possible and how we will effectively measure the 
quality of the labeled data.  Thats if we can come up with turking exercises 
that require a minimum qualification).  We need to build more substantial tools 
for breaking up our dataset for labeling into a variety of labeling tasks.  Due 
to the nature of trying to use minimally trained turkers, it is likely we would 
effectively break up labeling an individual sentence into numerous tasks and 
have overlap in what turkers label.  Because of this we need to re-assemble our 
labeled data sets, measure and act on disagreements, etc.  All so we can choose 
cheap labor :-}



Sorry if this didn't tell you very much (possible even seems dumb).

So all these we are doing in a partial vacuum.  Things that would of been 
useful to know are:


1) I can understand you cannot distribute the original training set for english 
etc because of perhaps distribution rights.  Knowing where or at least the 
flavor of where the original corpus came from would be nice.  What type of 
people and how many people were used in labeling the data and how much of it 
would be useful in determining if we are off.

2) What are the planned models, are there any existing open source projects 
that want help on these exercises?  

3) I see that with 1.5 there seems to be better support for taking training 
sets from other file formats.  What are the motivations?  Is it so that ONLP 
can take advantage of existing training sets that will help with 2) or is it 
generally to help the community interoperate better? 


Let me know if I can be of help.

Best

C

On Apr 27, 2011, at 11:16 AM, Jörn Kottmann wrote:

> On 4/27/11 7:56 PM, Chris Collins wrote:
>> I think that is a great idea.  I didn't really want to blast the mailing 
>> list as I am not a contributor as of today.  I have been using ONLP for a 
>> couple of years now, when it came time to train sentence and POS models in 
>> languages not currently supported I was surprised to see no guidelines, 
>> suggestions or best practices.  Further I see that with 1.5 support for 
>> reading training sets became more flexible but I have no idea what the 
>> public facing plans are for supporting new languages and what the 
>> methodology was going to be.  I am not looking for an answer to these 
>> questions from you, but I certainly would of appreciated a better eco system 
>> on the ONLP website.  If there was such a thing I would certainly 
>> participate in what our findings were (albeit perhaps not the best ones :-} )
> 
> We finally started to work on the documentation and the 1.5.1 release will 
> come with a docbook
> containing documentation, also how to train OpenNLP on certain data sets.
> 
> It would be really nice if you could share your experience with us, on which 
> languages
> and which data sets did you train?
> 
> Jörn

Reply via email to