It's somewhere in https://github.com/apertium/lttoolbox - I don't know the exact location.
The entrypoint that does tokenization is lt-proc, so start from lt-proc.cc and trace execution to somewhere that does tokenization. That's also a good way to learn the codebase. -- Tino Didriksen On Mon, 16 Mar 2020 at 16:00, 杨伟哲 <gavinwzma...@gmail.com> wrote: > Hi Tino and Fammie, > > Due to my mistake in sending the email before, I am not sure whether you > have > received the email I sent, so I'm sending the email to you again now. Hope > you can > receive it. > > These days, I read the wikipedia description of tokenization and got a > general idea > of how it works.I also learn some icu syntax every day. At the mean time, > I'm also > searching for information on how to handle tokenized Unicode vocabularies. > > Recently I have been reading "further reading"[1] of my proposed > project[2], which > is about HFST. The code is a bit hard to understand. But my task is > "Update > lttoolbox to be fully Unicode compliant with regards to medication to > alphabetical > symbols". May I know exactly how tokenization is implemented in lttoolbox > and the > specific code that I'm going to update? > > Regards, > > Weizhe > > [1] https://github.com/hfst/hfst/blob/master/tools/src/hfst-tokenize.cc > > [2] > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Robust_tokenisation >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff