We are in progress. So far we trained sentence models for several languages but have not done any detailed evaluation of quality (yet).
Sentence segmentation =================== We pulled wikipedia dumps for the several languages. The plan is to use this as the corpus for labeling for the different training exercises. We pulled for now about 100 articles that were typically a couple of pages of text. and stripped them of any markup (what is enough for training a language model. We then handed these articles to a native speaker to simply markup sentence boundaries. I am pretty confident that this was probably one task we didn't really need a native speaker for (at least for the first pass we applied). For this exercise the distribution of the articles was all done manually via email to internal employees. A general workflow / editorial engine is in the works but is really focused on the POS training exercise. POS ==== Still in the planning stage. We are playing with how we can turn this into much of a turking task as possible and how we will effectively measure the quality of the labeled data. Thats if we can come up with turking exercises that require a minimum qualification). We need to build more substantial tools for breaking up our dataset for labeling into a variety of labeling tasks. Due to the nature of trying to use minimally trained turkers, it is likely we would effectively break up labeling an individual sentence into numerous tasks and have overlap in what turkers label. Because of this we need to re-assemble our labeled data sets, measure and act on disagreements, etc. All so we can choose cheap labor :-} Sorry if this didn't tell you very much (possible even seems dumb). So all these we are doing in a partial vacuum. Things that would of been useful to know are: 1) I can understand you cannot distribute the original training set for english etc because of perhaps distribution rights. Knowing where or at least the flavor of where the original corpus came from would be nice. What type of people and how many people were used in labeling the data and how much of it would be useful in determining if we are off. 2) What are the planned models, are there any existing open source projects that want help on these exercises? 3) I see that with 1.5 there seems to be better support for taking training sets from other file formats. What are the motivations? Is it so that ONLP can take advantage of existing training sets that will help with 2) or is it generally to help the community interoperate better? Let me know if I can be of help. Best C On Apr 27, 2011, at 11:16 AM, Jörn Kottmann wrote: > On 4/27/11 7:56 PM, Chris Collins wrote: >> I think that is a great idea. I didn't really want to blast the mailing >> list as I am not a contributor as of today. I have been using ONLP for a >> couple of years now, when it came time to train sentence and POS models in >> languages not currently supported I was surprised to see no guidelines, >> suggestions or best practices. Further I see that with 1.5 support for >> reading training sets became more flexible but I have no idea what the >> public facing plans are for supporting new languages and what the >> methodology was going to be. I am not looking for an answer to these >> questions from you, but I certainly would of appreciated a better eco system >> on the ONLP website. If there was such a thing I would certainly >> participate in what our findings were (albeit perhaps not the best ones :-} ) > > We finally started to work on the documentation and the 1.5.1 release will > come with a docbook > containing documentation, also how to train OpenNLP on certain data sets. > > It would be really nice if you could share your experience with us, on which > languages > and which data sets did you train? > > Jörn
