Thanks for the advice, will do own model
On Thu, May 23, 2013 at 5:25 PM, Jörn Kottmann <[email protected]> wrote: > No there are no differences in your samples. Try to use capital USD > instead of usd. > > The model was trained on English news text from the 90s try to give it > some (old) news > articles for testing. > > Jörn > > > On 05/23/2013 03:16 PM, Яков Керанчук wrote: > >> Experimenting shows next results: only $ marked digits determined as money >> >> ------sentences: [The drop last week unwound most of the prior week's >> jump, >> suggesting employers were not laying off workers in response to tighter >> fiscal policy, especially the $85 billion in across-the-board government >> spending cuts that have dampened factory activity] >> ------tokenizing >> ------finding money >> [[29..32) money] >> [29..32) money >> prepare model >> ------sentences: [buy milk $2] >> ------tokenizing >> buy >> milk >> $ >> 2 >> ------finding money >> [[2..4) money] >> [2..4) money >> ------pos tagging >> VB >> NN >> $ >> CD >> ------saving message to database >> prepare model >> ------sentences: [buy milk usd 2] >> ------tokenizing >> buy >> milk >> usd >> 2 >> ------finding money >> [] >> ------pos tagging >> VB >> NN >> CD >> ------saving message to database >> prepare model >> ------sentences: [Buy milk two Dollars] >> ------tokenizing >> Buy >> milk >> two >> Dollars >> ------finding money >> [] >> >> I have not noticed difference between SimpleTokenizer and TokenizerME in >> this case >> >> >> On Thu, May 23, 2013 at 5:00 PM, Jörn Kottmann <[email protected]> >> wrote: >> >> On 05/23/2013 02:56 PM, Яков Керанчук wrote: >>> >>> Thanks for suggestion with own model, I'll try >>>> >>>> I use standard en-token.bin model, text contains mixed upper-lower case >>>> words. >>>> >>>> For the english model you should use the SimpleTokenizer, the token >>> output >>> from the en-token.bin model is not compatible with the training data. >>> >>> Jörn >>> >>> >> >> > -- Best regards, Yakov Keranchuk +79263768032
