[lingu-dev][lang guesser]Next steps

Jocelyn Merand Tue, 19 Sep 2006 13:07:25 -0700

Hi all lingu-workers,

Since the official end of my SoC, I have taken a rest of 1 big week. I'm now
returning to the keyboard...


For those who are not up to date to the language guesser component project,
I have successfully completed the SoC and the component can guess the
language of quite short texts. But, often, it is not able to return
*all*the languages included in the text.

Here are suggestions and remarks from Thomas Lange about the component just
to make it clearer and smarter (not to develop new features).

  -

  "If this is about sentences using the XBreakiterator's functionality
  to
  identify sentence boundaries might be useful." (about breaking
  sentences in sets of words)
  -

  "Ok. You may use an STL container for this. Even though that would be
  C++
  mixing them in the source code should be fine." (about N-Gram memory
  allocation)
  -

  To have a look to a specific data structure called Bloom filter to
  store N-Grams
  -

  Results for mixed languages texts are quite bad (general conclusion)

All these points make me doubtful about the interest of using libtextcat for
next version of the component. Because it's coded in C and this code not
seems to be designed for reusing and for easy modifying.

In addition, if we want to guess the language from texts witch are composed
of some different languages, we have to find typical text parts like quoted
or bracketed words sequences. I expect that there is a UNO component to do
that, isn't it? I chat with Thomas LEBARBE – a Researcher at the Grenoble
University (France) – during the OooCon and he suggested me to use something
he called "virgulo" witch is a kind of grammatical separator. I also
thought, at the beginning of the summer, when I was searching a good way to
guess multi-languages texts, that language changes are often on beginning or
end of grammatical blocks. So this should be a possible way to improve the
efficiency of multi-guesses (to analyze block by block).

About the Bloom Filter, this is very interesting but it is not useful if you
want to get the rank (frequency) of all the N-Grams. Thank you for making me
aware about this strange thing.

It sounds that a complete refactoring should be needed if we want to
implement new functionalities and if we want to have real multi-guess
features. I propose to develop a complete C++ library, of course not from
scratch, but I will be inspired by libtextcat especially for the fingerprint
comparison witch have been implemented in libtextcat in a very efficient
way. Unfortunately, this algorithm is ad-hoc and I think I will have to
really look at it. So we will have, for example, a component called
"XFingerprintMaker" that also would be very useful for other linguistic
usages.

Maybe it's not really interesting to send everybody the present version of
the component because I think it will be modified.

Every things that I said here are not the priority. Of course, these are
next steps. Thomas, please, can you send me the last component snapshot in
case of modification on your side. I will restart from this step.

Best regards


Jocelyn

[lingu-dev][lang guesser]Next steps

Reply via email to