Re: DetokenizationOperation.MERGE_BOTH

Jörn Kottmann Thu, 29 Mar 2012 00:36:16 -0700

+1 to add a MERGE_BOTH.

I recently worked with news paper texts where the quotation mark was

never separated by at least one white space. For that it would be niceto use

MERGE_BOTH to retrain the tokenizer.


Jörn

On 03/29/2012 12:38 AM, [email protected] wrote:

Hi!

I need something like DetokenizationOperation.MERGE_BOTH to train a
Tokenizer from NameFinder data. A sample of the data is:

... devolva - me o livro .... (give the book back to me)

I need detokenize it to "devolva-me o livro"

So I would need to add the hyphen to the detokenizer dictionary and
configure it to something like MERGE_BOTH, but we don't have such option.
Do you see another way of doing it or should I extend the
the DetokenizationOperation ?

Thanks
William

Re: DetokenizationOperation.MERGE_BOTH

Reply via email to