Hi Eijisan,

There's also the tokeniser used for Nuosu, which uses the transducer itself
to tokenise:
https://github.com/apertium/apertium-iii

I believe this is a later implementation of what's described in the thesis
sent by Kevin in [2].

This method has some downsides, but it also has some advantages over a
statistical model.  Perhaps a way to get started would be to explore the
pros and cons of each approach, and think about what a hybrid model could
achieve.  It would be good to join the IRC channel to discuss all this with
the mentors.

Another good way to get started (and it would help you do the above too)
would be to integrate the tokeniser from apertium-iii into apertium-jpn:
https://github.com/apertium/apertium-jpn

You would need to modify the Makefile.am, the modes.xml file, drop in the
tokeniser script, and that's about it?  Then see if you can get it to
analyse text without spaces (test it first with the same text,
hand-tokenised, to see what the output is).  Again, come to IRC for
guidance.

The tokeniser.py script is a bit slow, mainly because of Python string
processing.  Rewriting it in C/C++ would be useful, and also a good way to
get a better handle on how it works.

--
Jonathan


On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto <motopo...@gmail.com> wrote:

> Thank you for your reply. The project seems cool to work on for GSOC2023,
> and I would like to participate in. I reckon there are two tasks on the
> page and could you tell me where to start?
>
> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer <unham...@fsfe.org>
> wrote:
>
>> > I'd like to participate in Google Summer of Code 2023 at Apertium.
>> > In particular, I'm interested in adding new language pair and I am
>> > thinking to add Japanese-English as I speak Japanese. I took summer
>> > school at Tokyo University online on natural language processing
>> > before.
>> > Could you tell me more about the project?
>>
>> Hi,
>>
>> Getting some support for Japanese would be great! I'm not sure if you
>> saw the whole IRC discussion, but what we really need in that regard is
>> support for the *tokenisation* step, where our regular methods[1] fail
>> us, since the text might have no spaces and lots of
>> tokenisation-ambiguity. There has been some prior work[2] and it's
>> already listed as a potential GsoC project.
>>
>> Support for anything-Japanese depends on tokenisation. It's also a big
>> enough job that it would qualify as a full GsoC project, so if you were
>> hoping for jpn-eng in a summer you will be disappointeda (but having a
>> toy language pair to test with would help!). On the other hand, if we
>> get good spaceless tokenisation we open up the possibility for not just
>> Japanese, but Thai, Lao, Chinese etc. – and of course all those writing
>> systems used before the invention of the space character :)
>>
>> regards,
>> Kevin
>>
>> [1] https://wiki.apertium.org/wiki/LRLM
>> [2] http://hdl.handle.net/10066/20002
>> [3]
>> https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to