Hi Jörn,
I am trying to train the tokenizer with some corpora in Chinese and got
exception as below from the console:
Indexing events with TwoPass using cutoff of 5
Computing event counts... done. 4476143 events
Indexing... done.
Sorting and merging events... done. Reduced 4476143 events to 358244.
Done indexing in 30.55 s.
opennlp.tools.util.InsufficientTrainingDataException: Training data must
contain more than one outcome
at
opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:78)
at
opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:93)
at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:247)
at com.mzdee.nlp.Tokenizer.main(Tokenizer.java:207)
I am new to NPL and not quite understand what’s going on. and the code snip as
below:
InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(
new
File("/Users/aaron/resume-corpus/corpus_一_20140804162433.txt"));
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream = new
PlainTextByLineStream(inputStreamFactory, charset);
ObjectStream<TokenSample> sampleStream = new
TokenSampleStream(lineStream);
TokenizerModel model;
try {
// model = TokenizerME.train("zh", sampleStream,
true, TrainingParameters.defaultParams());
TokenizerFactory tf = new TokenizerFactory();
boolean useAlphaNumericOptimization=false;
String languageCode="zh";
model =TokenizerME.train(sampleStream,
TokenizerFactory.create(null, languageCode, null, useAlphaNumericOptimization,
null), TrainingParameters.defaultParams());
} finally {
sampleStream.close();
}
OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new
FileOutputStream("/Users/aaron/resume-corpus/zh-token.bin"));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
The line I commented out above seem is not update with latest version.
Help !!!
> On 1 Sep 2017, at 7:18 PM, Joern Kottmann <[email protected]> wrote:
>
> Our current tokenizer can be trained to segment Chinese just by
> following the user documentation,
> but it might not work very well. We never tried this.
>
> Do you have a corpora you can train on?
>
> OntoNotes has some Chinese text and could probably be used.
>
> Jörn
>
> On Fri, Sep 1, 2017 at 11:15 AM, 王春华 <[email protected]> wrote:
>> Hello everyone,
>>
>> I wonder if there is any tokenizing model for Chinese text, or where to get
>> some guidelines of how to generate one by myself.
>>
>> thanks!
>> Aaron