Thanks. I have got the training files created (conll03 + Reuters) and models trained. Used the latin-detokenizer that came with the download. The trained model solves the double quotation problem (e.g. "mistakes" now results in three tokens: ", mistakes and ").
I have tried adding the same detokenizer rules for single quote. However, it seems to conflict with the different usage of the single quote (e.g. possession as Tom's, It's etc.) This means we will have such cases separately. I will try adding <SPLIT> tags for those cases (e.g. Tom <SPLIT>'s , it<SPLIT>'s etc.). Don't know which gets the priority, rules in the detokenizer or <SPLIT> tags... Rohana -----Original Message----- From: Jörn Kottmann [mailto:[email protected]] Sent: 02 March 2011 13:08 To: [email protected] Subject: Re: Tokenizer issue - Quotation marks On 3/2/11 1:47 PM, Rohana Rajapakse wrote: > My NameFinder training model (created from CONLL + Reuters) has<START> > and<END> markups for person names. It doesn't have<SPLIT> markups. I am > trying the testTokenizer() test with > TokenizerTestUtil.createMaxentTokenModel() to create a model using my > training data file. I had to remove<START> and<END> tags and add few<SPLIT> > tags to get the test to work (to get "Number of Outcomes" to match). It > learns a model now, but not perfect. I need to add<SPLIT> markups for all > single and double quotes etc. > > By the way, where is the " TokenizerConverter" that you had mentioned. My > download (from sourceforge) doesn't have it. Also, where is the converter to > produce name > Finder that you have created to convert CONLL03. Am I missing some code in my > download. > > Also, please point me to your "docbook". Would like to know more about the > detokenizer. I can't find a "release candidate" in the download site. > The release candidate can be found here: http://people.apache.org/~joern/releases/opennlp-1.5.1-incubating/rc1/ Just use your name finder training file with the TokenizerConverter. Pieces of the work is in 1.5.0 and all the things you are missing are in 1.5.1. The docbook is also included in the 1.5.1 distribution. I suggest that you just re-try with the rc1. Jörn GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
