Re: How to get a tokenize model for Chinese

王春华 Sun, 03 Sep 2017 16:53:41 -0700

Hi Jörn,

I am trying to train the tokenizer with some corpora in Chinese and got 
exception as below from the console:



Indexing events with TwoPass using cutoff of 5

        Computing event counts...  done. 4476143 events
        Indexing...  done.
Sorting and merging events... done. Reduced 4476143 events to 358244.
Done indexing in 30.55 s.
opennlp.tools.util.InsufficientTrainingDataException: Training data must 
contain more than one outcome
        at 
opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:78)
        at 
opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:93)
        at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:247)
        at com.mzdee.nlp.Tokenizer.main(Tokenizer.java:207)

I am new to NPL and not quite understand what’s going on. and the code snip as 
below:

InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(
                                        new 
File("/Users/aaron/resume-corpus/corpus_一_20140804162433.txt"));
                        Charset charset = Charset.forName("UTF-8");
                        ObjectStream<String> lineStream = new 
PlainTextByLineStream(inputStreamFactory, charset);
                        ObjectStream<TokenSample> sampleStream = new 
TokenSampleStream(lineStream);

                        TokenizerModel model;

                        try {
//                              model = TokenizerME.train("zh", sampleStream, 
true, TrainingParameters.defaultParams());
                                TokenizerFactory tf = new TokenizerFactory();
                                
                                boolean useAlphaNumericOptimization=false;
                                String languageCode="zh";
                                model =TokenizerME.train(sampleStream, 
TokenizerFactory.create(null, languageCode, null, useAlphaNumericOptimization, 
null), TrainingParameters.defaultParams());
                                
                        } finally {
                                sampleStream.close();
                        }

                        OutputStream modelOut = null;
                        try {
                                modelOut = new BufferedOutputStream(new 
FileOutputStream("/Users/aaron/resume-corpus/zh-token.bin"));
                                model.serialize(modelOut);
                        } finally {
                                if (modelOut != null)
                                        modelOut.close();
                        }
The line I commented out above seem is not update with latest version.

Help !!!


> On 1 Sep 2017, at 7:18 PM, Joern Kottmann <[email protected]> wrote:
> 
> Our current tokenizer can be trained to segment Chinese just by
> following the user documentation,
> but it might not work very well. We never tried this.
> 
> Do you have a corpora you can train on?
> 
> OntoNotes has some Chinese text and could probably be used.
> 
> Jörn
> 
> On Fri, Sep 1, 2017 at 11:15 AM, 王春华 <[email protected]> wrote:
>> Hello everyone,
>> 
>> I wonder if there is any tokenizing model for Chinese text, or where to get 
>> some guidelines of how to generate one by myself.
>> 
>> thanks!
>> Aaron

Re: How to get a tokenize model for Chinese

Reply via email to