On 3/2/11 1:47 PM, Rohana Rajapakse wrote:
My NameFinder training model (created from CONLL + Reuters) has<START> and<END> markups for person names. It doesn't
have<SPLIT> markups. I am trying the testTokenizer() test with TokenizerTestUtil.createMaxentTokenModel() to create a model
using my training data file. I had to remove<START> and<END> tags and add few<SPLIT> tags to get the test to
work (to get "Number of Outcomes" to match). It learns a model now, but not perfect. I need to add<SPLIT> markups
for all single and double quotes etc.
By the way, where is the " TokenizerConverter" that you had mentioned. My
download (from sourceforge) doesn't have it. Also, where is the converter to produce name
Finder that you have created to convert CONLL03. Am I missing some code in my
download.
Also, please point me to your "docbook". Would like to know more about the detokenizer. I
can't find a "release candidate" in the download site.
The release candidate can be found here:
http://people.apache.org/~joern/releases/opennlp-1.5.1-incubating/rc1/
Just use your name finder training file with the TokenizerConverter.
Pieces of the work
is in 1.5.0 and all the things you are missing are in 1.5.1. The docbook
is also included
in the 1.5.1 distribution.
I suggest that you just re-try with the rc1.
Jörn