Re: abbreviation diccionary format

Joan Codina Thu, 19 Apr 2012 09:21:09 -0700

How can I de-tokenize a conll training set?
I have tried some commands but none seems to work
i did

./detokenizer.sh models/CoNLL2009-ST-English-train.txt>models/CoNLL2009-ST-English-train.sent

where detokenizer is like:

#!/bin/bash

SEP="\t";
TAG="[^${SEP}]*";
SENTENCESEP="<SENTENCE123456789SEP>";
exec cat $1 | perl -pe "s/^${TAG}${SEP}(${TAG}).*$/\1/g" | perl -pe "s/^\s*$/\n/g" | perl -pe 
"s/^$/${SENTENCESEP}/g" | perl -pe "s/\n/ /g" | perl -pe "s/ ${SENTENCESEP} /\n/g"

then with the sentences with all tokens separated by spaces y need tomerge the words adding <space> but I don't know how to make it with thedictionaryDetokenizer./opennlp DictionaryDetokenizer ../models/en-detokenizer.xml<../models/CoNLL2009-ST-English-train.sent


as it merges the senteces but does not add the <space>


thanks in advance

Joan.



On 04/10/2012 04:51 PM, Jörn Kottmann wrote:

On 04/10/2012 04:44 PM, Joan Codina wrote:
But to train the system I only found that file... which is small.
http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markupwhich only contains 121 sentences. i don't know if this is enough orthere are other training annotated models
No, that is not enough. Get some training data set for the languageyou need. Most of the data setsreferenced in the Corpora section can be used to train the tokenizer.These corpora are already tokenized
and can be de-tokenized into training data for the tokenizer.

Jörn

Re: abbreviation diccionary format

Reply via email to