It's somewhere in https://github.com/apertium/lttoolbox - I don't know the
exact location.

The entrypoint that does tokenization is lt-proc, so start from lt-proc.cc
and trace execution to somewhere that does tokenization. That's also a good
way to learn the codebase.

-- Tino Didriksen


On Mon, 16 Mar 2020 at 16:00, 杨伟哲 <gavinwzma...@gmail.com> wrote:

> Hi Tino and Fammie,
>
> Due to my mistake in sending the email before, I am not sure whether you
> have
> received the email I sent, so I'm sending the email to you again now. Hope
> you can
> receive it.
>
> These days, I read the wikipedia description of tokenization and got a
> general idea
> of how it works.I also learn some icu syntax every day. At the mean time,
> I'm also
> searching for information on how to handle tokenized Unicode vocabularies.
>
> Recently I have been reading "further reading"[1] of my proposed
> project[2], which
> is about HFST. The code is a bit hard to understand. But my task is
> "Update
> lttoolbox to be fully Unicode compliant with regards to medication to
> alphabetical
> symbols". May I know exactly how tokenization is implemented in lttoolbox
> and the
> specific code that I'm going to update?
>
> Regards,
>
> Weizhe
>
> [1] https://github.com/hfst/hfst/blob/master/tools/src/hfst-tokenize.cc
>
> [2]
> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Robust_tokenisation
>
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to