Re: Tokenizer issue - Quotation marks

Jörn Kottmann Wed, 02 Mar 2011 05:09:04 -0800

On 3/2/11 1:47 PM, Rohana Rajapakse wrote:

My NameFinder training model (created from CONLL + Reuters) has<START>  and<END>  markups for person names. It doesn't 
have<SPLIT>  markups. I am trying the testTokenizer() test with TokenizerTestUtil.createMaxentTokenModel() to create a model 
using my training data file. I had to remove<START>  and<END>  tags and add few<SPLIT>  tags to get the test to 
work (to get "Number of Outcomes" to match). It learns a model now, but not perfect. I need to add<SPLIT>  markups 
for all single and double quotes etc.


By the way, where is the " TokenizerConverter" that you had mentioned. My 
download (from sourceforge) doesn't have it. Also, where is the converter to produce name
Finder that you have created to convert CONLL03. Am I missing some code in my 
download.

Also, please point me to your "docbook". Would like to know more about the detokenizer. I 
can't find a "release candidate" in the download site.

The release candidate can be found here:
http://people.apache.org/~joern/releases/opennlp-1.5.1-incubating/rc1/

Just use your name finder training file with the TokenizerConverter.Pieces of the workis in 1.5.0 and all the things you are missing are in 1.5.1. The docbookis also included

in the 1.5.1 distribution.

I suggest that you just re-try with the rc1.

Jörn

Re: Tokenizer issue - Quotation marks

Reply via email to