Re: [RFC]Japanese tokenization/tagging restructuring proposal

Silvan Jegen Mon, 25 Aug 2014 13:39:07 -0700

On Mon, Aug 25, 2014 at 12:47:06PM +0200, Daniel Naber wrote:
> On 2014-08-25 12:27, Silvan Jegen wrote:
> 
> > I agree that it would be about equally confusing (and inelegant) but at
> > least it would save some unnecessary work for LT.
> 
> I don't think we should argue with performance unless there's a 
> real-world use case that's actually too slow and we can show that the 
> new solution is actually significantly faster.


I don't know about the real-world use case but I tested both
implementations using languagetool-standalone.jar on a 114MB text file. I
ran both versions ten times and on average the suggested one was about
15% faster (note that it was not very rigorous testing and the difference
between runs was surprisingly high at times).

This simple testing also highlighted an oversight of mine. If the
tokenized List<String> result is ignored, the replaceSoftHyphens
function won't have anything to work with. That means that at least
some of the speed gain will be due to this function not being used. Not
handling soft hyphens does make sense for Japanese since they are
only very rarely used. They seem to be allowed according to 3.1.10f in
http://www.w3.org/TR/2009/NOTE-jlreq-20090604/ though.


Cheers,

Silvan


------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [RFC]Japanese tokenization/tagging restructuring proposal

Reply via email to