Re: [RFC]Japanese tokenization/tagging restructuring proposal

NOKUBI Takatsugu Mon, 25 Aug 2014 01:01:49 -0700

At Sun, 24 Aug 2014 14:21:52 +0200,
Silvan Jegen wrote:
> Because the tagger library used by them (called 'sen') does the
> tokenization and tagging in one step, these two steps cannot be separated
> as cleanly as required by the interfaces used in LT.


Yes, almost Japanese morphological analysis systems have such behavor,
it is come from Japanese characteristic.

Japanese sentences has no separation between every words. To analyse,
morphological system calculates by dictionary with words list, POS,
and a kind of score. POS is the important information to determine
that where is a separation of words.

> My proposal would be to avoid this issue by working around the current
> interface as follows.
> 
> 1. JapaneseWordTokenizer calls sen's analyze method which tokenizes the
>    text and adds POS tags. We save the tokenized and tagged items in a
>        private "analyzedTokens" field of JapaneseWordTokenizer.
> 2. The JapaneseWordTokenizer just returns null (or an empty List<String>).
> 3. When the JapaneseTagger is called with the above (null/empty)
>    List<String> as input we ignore the input parameter. Instead we get the
>        "analyzedTokens" field directly from the JapaneseWordTokenizer
>        (a reference to which we saved within the JapaneseTagger)
>        and build the needed AnalyzedTokenReadings directly.

I think it would be make sense.

> Before working on the implementation of these changes further I wanted
> to ask whether you think this is the way to go or if we should stick to
> the current behavior.

Maybe the change has no side effect.

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [RFC]Japanese tokenization/tagging restructuring proposal

Reply via email to