[lingu-dev] SoC "Component for guessing the language of text"

Jocelyn Merand Fri, 16 Jun 2006 07:34:21 -0700

Hi,

After a 2 weeks period without news, this is what it's all about :

(From the emails I have sent to Thomas Lange)

(email sent on June 7^thto Thomas)

From me

The N-Grams work fine with long text, but there is no trivial solution to guess the language of 1 single words. I think a good way would be to use the context of the word we are wondering the language. I'm working on it through some ideas :

use the language eventually guessed of previous and next words

From Thomas

Sure, I thought about this one too as a possible improvement.
But providing the context or not should be decided by the caller.
Thus your component should only make use of the provided text.

The assumption of courser would be that the text is in one language only.
Of course it might be a good idea to have a return value that indicates that
there are (likely) several languages involved.
As to if there should be a list of those language I'm not sure. Currently I'm
missing to see the need for returning such a list.

BTW: Mathias said it would be preferred if the implementation of the component
would be in C or C++. This would eliminate the overhead of involving a UNO bridge
to a different language binding.

From me

So, this is my first idea on a algorithm that guess the language

1 : guessing the language of the whole text with n-gram (if there not were so much foreign words, the language may be guessed well)

2 : guessing of language of every paragraphs and check them with the spell checker

From Thomas

No! Involving any other component from the linguistic is not a good idea.
After all the whole point in submitting the project was to get rid of the large overhead (time and
memory) involved by using the spellchecker or thesaurus.
If you like to do this there should at least be a separate interface, or sth. similar distinctive
from the 'regular' method.

Also I think this component should not know about paragraphs.
It should just be about strings. What amount of text is put into it should be left to the caller.

From me

3 : for all sub-texts that are not well recognized by the spell-checker, make an advanced analyze

the advanced analyze could be consist in :

1 : n-Gram the word (or word sequence) => get the possible languages in order of relevance

2 : pass them to the different spell-checkers to get the language (obviously in the order determined before)

From Thomas

The trouble begins with "in the order determined before". Who is going to determine it?
What languages will be in it and how long will it be?

My today comment

I had some ideas to design a fancy and big way to get the language of text. Well I have understood that it was too complex and maybe to big.

(Email sent on June 14^th from Thomas to answer me)

From Thomas

When I reviewed my latest mail to you in my mind it occurred to me that
I made a mistake in one of my statements.
It is not always true to assume that the whole text was in one language.
A possible use for the component would be to guess the 'primary' language
of a sentence in order to pass it on to the respective grammar checker.
This would be useful if we have mixed language attributes within a sentence.

My today's comment

Ok, so we assume that there may be foreign words in a text. I think it could be interesting to have a way to first have the "average" language of the text. N-Gram algorithm is good for this!

But, twice, why not be able to guess the language of foreign parts and words ? Feel free to comment this last idea.

In this case, we must have a way that not make redundant passes on the text. To do this, I think we would use a one pass n-gram counter that records not only the whole count but also intermediate results for sentences, or if the text is short, for words. In one of my reports to Thomas, I spoke about an "N-Gram improvement", it is. Ok, improvement is maybe rather excessive.

The problem is now to give it a "serializable" interface. Of course, with this process, the text must be provided one time, and the "client" asks for guesses after. So, for example, if we use it in a spell/grammar checking, we should give it the text, ask for the main language (the most probable), analyze the text with the given language (on the checker side). If some words are not recognized, it could be interesting to ask for a preciser analyze on these parts (or words) to guess a possible different language than the "main". The system I'm proposing, allows to only give the text one time, on the guesser side, there is only one N-Gram counting and it possible to make some analysis without multiple heavy string transits and without some heavy process (obviously counting). In addition, in case of using this component as a simple and "one use" guesser, it shouldn't be so less CPU and memory efficient than the "normal" one. I not know if this functionality justify the development. So, tell me if it's interesting according to you.

(June 15^th email)

From me

When I read your Monday's email, I realized that I was wrong from the beginning ! In fact, before I read your critics, I thought that the goal was to assist the spell/grammar checker. So, it was clear to me that in this case a spell checker component should be loaded at every time. Ok, mea-culpa. Thus, I was so disappointed and I started to think about what you said.

From Thomas

I think the communication problem was more on my side.
I should have made clear at first that you understood what it was about in detail.

But I think almost nothing you will have done already will be wasted since not using
the spell checker checker or thesaurus in your component won't have any effect on
the implementation of the n-gram algorithm or evaluating the unicode points of characters.

My today's comment

I would have had understand that the project's main goal is to design a quite simple tool that could be used by other one and not a spell assistant (more a server and less a client). Ok, please forget this. Sorry for the misunderstanding.

From me

Now, I would want to be fine with all goals and constraints.

So correct me If I'm wrong :

The component, I will produce, should be able to guess the language without any third party help and, of course (and it's clear to me since the beginning), with a small memory print (and CPU efficiency).

From Thomas

Yes.
If it can be done without to much overhead, maybe because internally you will already have the necessary data, it would also be Ok to give a list of languages sorted by likelihood with the one top scoring being the first.
As for the likelihoods to be returned as well I'm quite biased. Even though it always will be interesting from a theoretical point of view and that it will raise the trust in the result if the likelihood is 98% and not only 69%, from the programmatical point of view we probably always have to choose the top scoring entry thats it. Also giving likelihoods sometimes is a rather result since for some algorithm it can't be done properly at all. Thus a different implication of the same API may have trouble to 'generate' those values since they are not given by the algorithm.
Thus you may choose as like here. To me this information does not really matter but if you think it will be useful just go on.

My today's comment

Well, if I give the example of the n-gram algorithm (in libtextcat), it only uses a finger print that weight no more than 4ko per language managed. Does it fit with the memory constraint ? In addition these fingerprints should be built with the same n-gram counter that the guesser one. (I think libtextcat does so)

From me (in the last email)

From the start, I thought that the level of relevance should be quite high (about 95%). With the precision you told me about, I believe now that you do not want a so high rate, isn't it ?

From Thomas

Hmm... As for a text in a single language such a high precision would be most desirable.
But as the text goes smaller you will be less likely to assert such a precision. For example I think you will have trouble if the word is only about 5 characters long and has only ASCII characters in it and thus is missing any special character e.g. from French or Russian.
For example "Theater" at the beginning of a sentence is it the German or the English word?
One might take chances and vote for German since if it were not a the start of a sentence it should be lowercase in English.

And if we are talking about a text with multiple languages it is even more unlikely to have such a high precision for guessing the primary language. Especially if for example two languages are involved for about the same amount of words, and even if you got the respective text parts for those languages with 100% accuracy. In such cases you will always be in a pinch.
If it is not only about one or two foreign words in 30 words sentence you'll be unlikely to achieve a precision of 95%. Therefore in this case it is quite acceptable to go with less precision e.g. 70%.

Because of these basically different needs I come to think it might actually be a good idea to have different API calls for those two tasks and thus allowing them to be implemented with different algorithms.
Please comment about this.

My today's comment

In my previous emails, I haven't been really clear about precision and as thomas said, I think, the precision is different if we are speaking from a theoretical or a practical point of view. According to me (and, I think, to Thomas), theoretically, the more precise a guesser is, the more it presents the "good" language on the top of the list. Practically, a guesser may not be so bad if (and only if) it gives the good language, of a single word, on the second or third place.

What about a special interface to different use case ?

If I understand what Thomas said is that the guessing of the language of one word is so different than the one of a long text that we should propose different use of the component. For example, a service for long texts and an other one for words. I think, what I'm proposing about seting the text and asking for different part of the text should handle these different use cases. If you don't think so please tell me more.

If design different services for different use cases is not totally clear for me, I no doubt that the guess process must be, at least, quite different. In addition, I think, I should define an heuristic that define the "best" parameters and a way to compute the results of different systems. For example, if I use n-gram plus Unicode characters code discrimination. These methods mustn't be considered with the same importance with only one word to analyze and a 1000 word text. I mean that the Unicode test is maybe much more interesting with short text. Please read next what I wonder about Unicode test.

From Thomas
For the API we need also to think about some special problems and what to do in those cases. For example if the function returns a list of possible language that should not be too long. One way to handle this is to allow to specify a maximum number of entries to be returned, or if you are going to use propabilities to return only languages with a propability higher than a provided value.

Another problem that easily comes to mind is: what are you going to do if for a given word you found 10 matching languages each with a probability of 9-11%?
Even if we take the high scorer it would be rather pure luck to get the correct language.
In such cases might be useful to indicate that no really useful result was found or that it could not be pinned down to a single language. But in the end the caller might still want to get a (single) language despite how bad the guess may be.

Since after the previous mail you should now know what the intended applications for the component are please think about what might be the useful results returned by the API to meet those cases even in the odd cases.

My today's comment

Basically, we would think that a high likelihood implies to trust the order of languages we guessed even it's not totally sure. In fact, order relation between language guesses could not be warrantied. But we must make a choice in spite of this lack of precision. I mean that this problem (that consist in choose the "best guess" among all credible results) will be, whatever solution we will choose. So the problem is to detect situations that are not relevant. I have tested libtextcat and when I tried to scramble sentences I tested, I saw "UNKNOWN" as the language returned. So, it seams that libtextcat filter result itself. I'm going to study this point.

Without libtextcat, I have had an idea to distinguish good guess and bad one. Why not use the variance of relevances(for every languages)? If it's low it's mean that the result shouldn't be rather interesting. This evaluation should also be possible because a way to compare language finger prints and the text's one is to calculate the "distance" between them (am I wrong ?). So why not calculate the variance of distances ?

Latest news

What about the Unicode ?

I wonder if the Unicode analyze is relevant. In fact, if we use n-grams on a text, they must include some n-gram with "special" Unicode character. So, does it really interesting to analyze Unicode code if they are already managed by the n-gram algorithm. For example, we are guessing the language of the French short text "La grande �cole", the character "�" should not be found in a English sentence but obviously the bigram " �" and the other one "�c". So, does the Unicode analysis could improve the n-gram result, I'm not sure. Of course, I'm not so fine with Asian languages and I don't know if, in these languages, the N-Gram algo could help us. Unicode could help us in these cases. So I'm not totally disagreeing with the idea to use Unicode but I think, this point must be discussed.

I'm getting some help from Emmanuel Giguet who is a French researcher that submitted his thesis in 1998. It's about "Formal structures analysis on multilingual texts". Unfortunately, this thesis is written in French only (it's an irritating usage in France), sorry. I hope he will may help me on the single word problem. However, he wrote this : http://users.info.unicaen.fr/~giguet/iwpt95/ . I think should be really interesting.

I also going to ask the libtextcat maintener for a possible "change of license"/ fork.

What about the SDK ?

The final goal of the project is to produce an independent component (independent of others and witch has interface that is language independent). That's why UNO is the best way to design it. To do this, I downloaded the SDK and Thomas gave me an example.

I, too, have had problems with linguistics samples of the SDK. I'm currently working on it. If someone could help me on this. I can't compile the simple spell checker example and it's blocking me.

What about services ?

Well, the service, I think could be simply defined like this :

# set the text to be analyzed

void setText(String)

# analyze the whole text

void guess()

# analyze only a subset

void guessArea(int fromWord, int toWord)

# return the nth language guessed (to call after guess())

String getLanguage(int rank)

# only return the "best" language

String getLanguage()

# return the nth language's relevance guessed (to call after guess())

# I'm not sure the component should give this result !

int getRelevance(int rank)

With the implementation I propose, the n-gram counting is done only when the setText() function is called (or at the first call of a guessX() function). May be an other service should be important. In fact, I don't know if it's useful but the component could be constrained by the set of language it considers. Therefor, I should introduce :

# load the language finger print and prepare the component to consider this language

[lingu-dev] SoC "Component for guessing the language of text"

Reply via email to