Hi, Thomas
First of all, I'm surprised to see that the dead line of the project is next Monday. In fact, I believed that since the project was sponsored by intel and with 1 week of delay, it should end also with 1 week of delay (about end of August). No prob, I'll change my agenda to finish it in time. I will contribute to the project freely after this deadline to improve the efficiency of guessing and to add new languages. Now I'm worrying about the Unicode because libtextcat is definitely not designed to use bigger encoding that 8bit based ones. I also have thought about the interest of making a Unicode analyze. I tried to make a set of rules to guess the language of text using the codes of characters and it hardly works. Finally, I think that the N-Gram analyze includes a code based analyze that could really be sufficient to pick up the most probable languages. In fact, when the N-Gram analyzer counts N-Grams, it also can count just single characters. This is why I decided to use libtextcat with short text and to not use a dedicated algorithm. I tested libtextcat on short text and I defined and implemented some tips to analyze short text with libtextcat like: "reduce the minimum size of an N-Gram for short text" or "add white space before and after single words to improve categorisation by introducing the mark of beginning and end of the word" (basicly, "hello" have these 2-Gram: "he", "el", "ll" and "lo". If I add spaces, I also introduce these 2-Gram " h" and "o " which is much more expressive for example with words that ends with "ing" in English. Today, I have a problem with character encoding. In fact, the best way to guess the language should be to use always the same character encoding for every texts and to compare fingerprints with languages ones (all encoded with the same encoding). One encoding appears me to be the best to do this is, it's UTF16 but it's a 2 bytes based encoding. To use it I should modify libtextcat to accept 2 Bytes based characters and it should be a big job (I am modifying the program that makes fingerprints and the rest will be done before the end of the week). I also have added methods to configure the component (set the fingerprint DB and enable/disable languages). I not have made tests for it and it's the reason that I'm not sending it now. I will write all documentation and comments next weekend and Monday. About debug, I have added the lines you sent me last week but I still debug manually. I also read (in this news : http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html ) that Google will publish its N-Gram set (which is about 25GB !!!) soon. It could be really interesting to use a subset of this huge data base to make our fingerprints and I'm looking for the official release of them. Regards, Jocelyn
