I can help you in doing this with the API.

You should train your own TokenizerModel just like another model:

TokenizerModel model = TokenizerME.train(language, sampleStream, useAlphaNumericOptimization, trainingParameters);

In your case, I suggest you to write your own TokenSampleStream class:

ObjectStream<TokenSample> sampleStream = new MyTokenSampleStream(...);

In this class you should of course implement the ObjectStream<TokenSample> interface for which you must implement the following method:

public TokenSample read()

A TokenSample basically has to be filled with a:
* String text; // which represents a sentence
* List<Span> tokenSpans; // which is the list of spans in which your sentence must be tokenized.
e.g. TokenSampel t: {
text = "my token sample stream"
tokenSpans = { [0, 8], [9, 22] }
...
}

As you can see in this TokenSample there are two tokens: "my token" and "sample stream".

The constructor of MyTokenSampleStream should load the training data (from a file, from a database...whatever) and for each invocation of the read method you should return:
* a new TokenSample from your data
* null if you don't have more samples

The TokenizerME.train will read samples from your sampleStream and it will train your custom model. Then you can save it or use it depending on your needs.

Cheers,
    Riccardo

On 11/02/2012 18:46, Lee Hinman wrote:
Hey Guys,

I'm trying to train a tokenizer that ignores spaces and only uses<SPLIT>  to 
determine where to split. I wasn't able to find anything in the javadocs, is this 
possible with OpenNLP? If so, could someone point me in the right direction regarding 
it?

- Lee Hinman

Reply via email to