[lingu-dev][SoC][Report] Component for guessing the language of text

Jocelyn Merand Wed, 16 Aug 2006 02:53:27 -0700

Hi, Thomas



First of all, I'm surprised to see that the dead line of the project is next
Monday. In fact, I believed that since the project was sponsored by intel
and with 1 week of delay, it should end also with 1 week of delay (about end
of August). No prob, I'll change my agenda to finish it in time. I will
contribute to the project freely after this deadline to improve the
efficiency of guessing and to add new languages.



Now I'm worrying about the Unicode because libtextcat is definitely not
designed to use bigger encoding that 8bit based ones. I also have thought
about the interest of making a Unicode analyze. I tried to make a set of
rules to guess the language of text using the codes of characters and it
hardly works. Finally, I think that the N-Gram analyze includes a code based
analyze that could really be sufficient to pick up the most probable
languages. In fact, when the N-Gram analyzer counts N-Grams, it also can
count just single characters. This is why I decided to use libtextcat with
short text and to not use a dedicated algorithm. I tested libtextcat on
short text and I defined and implemented some tips to analyze short text
with libtextcat like: "reduce the minimum size of an N-Gram for short text"
or "add white space before and after single words to improve categorisation
by introducing the mark of beginning and end of the word" (basicly, "hello"
have these 2-Gram: "he", "el", "ll" and "lo". If I add spaces, I also
introduce these 2-Gram " h" and "o " which is much more expressive for
example with words that ends with "ing" in English.



Today, I have a problem with character encoding. In fact, the best way to
guess the language should be to use always the same character encoding for
every texts and to compare fingerprints with languages ones (all encoded
with the same encoding). One encoding appears me to be the best to do this
is, it's UTF16 but it's a 2 bytes based encoding. To use it I should modify
libtextcat to accept 2 Bytes based characters and it should be a big job (I
am modifying the program that makes fingerprints and the rest will be done
before the end of the week).



I also have added methods to configure the component (set the fingerprint
DB and enable/disable languages). I not have made tests for it and it's the
reason that I'm not sending it now.



I will write all documentation and comments next weekend and Monday.



About debug, I have added the lines you sent me last week but I still debug
manually.



I also read (in this news :
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
) that
Google will publish its N-Gram set (which is about 25GB !!!) soon. It could
be really interesting to use a subset of this huge data base to make our
fingerprints and I'm looking for the official release of them.



Regards, Jocelyn

[lingu-dev][SoC][Report] Component for guessing the language of text

Reply via email to