Hi all lingu-workers, Since the official end of my SoC, I have taken a rest of 1 big week. I'm now returning to the keyboard...
For those who are not up to date to the language guesser component project, I have successfully completed the SoC and the component can guess the language of quite short texts. But, often, it is not able to return *all*the languages included in the text. Here are suggestions and remarks from Thomas Lange about the component just to make it clearer and smarter (not to develop new features). - "If this is about sentences using the XBreakiterator's functionality to identify sentence boundaries might be useful." (about breaking sentences in sets of words) - "Ok. You may use an STL container for this. Even though that would be C++ mixing them in the source code should be fine." (about N-Gram memory allocation) - To have a look to a specific data structure called Bloom filter to store N-Grams - Results for mixed languages texts are quite bad (general conclusion) All these points make me doubtful about the interest of using libtextcat for next version of the component. Because it's coded in C and this code not seems to be designed for reusing and for easy modifying. In addition, if we want to guess the language from texts witch are composed of some different languages, we have to find typical text parts like quoted or bracketed words sequences. I expect that there is a UNO component to do that, isn't it? I chat with Thomas LEBARBE – a Researcher at the Grenoble University (France) – during the OooCon and he suggested me to use something he called "virgulo" witch is a kind of grammatical separator. I also thought, at the beginning of the summer, when I was searching a good way to guess multi-languages texts, that language changes are often on beginning or end of grammatical blocks. So this should be a possible way to improve the efficiency of multi-guesses (to analyze block by block). About the Bloom Filter, this is very interesting but it is not useful if you want to get the rank (frequency) of all the N-Grams. Thank you for making me aware about this strange thing. It sounds that a complete refactoring should be needed if we want to implement new functionalities and if we want to have real multi-guess features. I propose to develop a complete C++ library, of course not from scratch, but I will be inspired by libtextcat especially for the fingerprint comparison witch have been implemented in libtextcat in a very efficient way. Unfortunately, this algorithm is ad-hoc and I think I will have to really look at it. So we will have, for example, a component called "XFingerprintMaker" that also would be very useful for other linguistic usages. Maybe it's not really interesting to send everybody the present version of the component because I think it will be modified. Every things that I said here are not the priority. Of course, these are next steps. Thomas, please, can you send me the last component snapshot in case of modification on your side. I will restart from this step. Best regards Jocelyn
