Experimenting shows next results: only $ marked digits determined as money ------sentences: [The drop last week unwound most of the prior week's jump, suggesting employers were not laying off workers in response to tighter fiscal policy, especially the $85 billion in across-the-board government spending cuts that have dampened factory activity] ------tokenizing ------finding money [[29..32) money] [29..32) money prepare model ------sentences: [buy milk $2] ------tokenizing buy milk $ 2 ------finding money [[2..4) money] [2..4) money ------pos tagging VB NN $ CD ------saving message to database prepare model ------sentences: [buy milk usd 2] ------tokenizing buy milk usd 2 ------finding money [] ------pos tagging VB NN CD ------saving message to database prepare model ------sentences: [Buy milk two Dollars] ------tokenizing Buy milk two Dollars ------finding money []
I have not noticed difference between SimpleTokenizer and TokenizerME in this case On Thu, May 23, 2013 at 5:00 PM, Jörn Kottmann <[email protected]> wrote: > On 05/23/2013 02:56 PM, Яков Керанчук wrote: > >> Thanks for suggestion with own model, I'll try >> >> I use standard en-token.bin model, text contains mixed upper-lower case >> words. >> > > For the english model you should use the SimpleTokenizer, the token output > from the en-token.bin model is not compatible with the training data. > > Jörn > -- Best regards, Yakov Keranchuk +79263768032
