You're right Yonik, that's what I meant.

As for an example, let us place ourselves in two situations:
- Situation A: one field ("text") where the texts of all documents are
indexed;
- Situation B: one field per language ("text_en", "text_fr", etc.) where
only the texts of documents in that language are indexed.

Now, take a query with the French word "voiture" (means car). There are of
course much more French documents than English ones containing this word. In
situation A, the scores of documents will only depends on the tf of
"voiture". French and English documents will be treated equally. In
situation B, the idf of "voiture" in the field "text_en" will be very
important due to the rarity of this word in English. English documents will
then always be ranked higher.

Then, take a query with the English word "car". This word is very common in
French (means "because" but can also be used for other meanings). In
situation A, French documents will be ranked higher because of their high
tf. Situation B is better in this case as the frequency of "car" in French
is not taken in account in the ranking. The score of English documents will
be calculated normally while French documents will be ranked lower due to
the very low idf (or even not appear if stop words are used). 

So far, situation B with a boost for documents in the same language as the
query seems the more promising one (at least when dealing with the relevance
problem).

Another interesting solution is to use language dependant stop words in
situation A to avoid altering idf for other language. But this incurs other
problems (stop words are not always desirable).

I hope this example clarifies the identified problem of cross-lingual
retrieval in Lucene. Sorry if it is not as clear as I would like (English is
not my mother tongue).

Nicolas

-----Message d'origine-----
De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] De la part de Yonik Seeley
Envoyé : dimanche 2 mars 2008 03:45
À : solr-user@lucene.apache.org
Objet : Re: Proposition of a new feature: Dynamic Field Types

On Sat, Mar 1, 2008 at 9:38 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> I don't quite follow everything here (examples?), but I believe IDF of a
term is not a per-field value, but "index-wide".

I think Nicolas meant that idfs are field specific, and that is the
case (index-wide, per field).

-Yonik

Reply via email to