Re: Multi language indexing

Doron Cohen Mon, 07 May 2007 17:54:50 -0700

bhecht <[EMAIL PROTECTED]> wrote on 07/05/2007 10:26:27:

> I have implemented my own analyzer for each country.
> So as I see it, when I index these records, I want to
> provide lucene, with a specific analyzer per record
> i'm indexing.
>
> When a user performs a query in my JSF form, I will
> use the country value he entered, to get the needed
> analyzer, and query lucene with the users query and
> the needed analyzer.
>
> The user may also choose not to enter a country value
> to his search, and here comes in the solution you gave
> me, to duplicate each field, and index it using a non
> stemming analyzer (A standard analyzer without stop
> words defined).
>
> Am I going the right direction?


Sounds ok to me except that there seems to be a mix
between stemming and stop-words elimination. Perhaps
just a typo in the above text, but anyhow while the
StandardAnalyzer constructor takes a stopwords list
parameter and would eliminate these words (e.g. "is"),
it would not do stemming (e.g "knives" --> "knive").
(Though both a stop-list and a stemming algorithm
are language specific.)

So, rephrasing the discussion so far, assuming:

1) a single field "F" (for simplicity),
2) (doc) language always known at indexing
3) (user) language sometimes known at search

I think a resonable solution might be:

1) use PerFieldanalyzerWrapper
2) index each doc to F and to F_LAN
3) F would be language neutral - no
   stemming and no stop words elimination
4) F_LAN (e.g. F_en) would be language specific,
   so a specific language stopwords list would be
   used, and a specific stemmer would be used.
5) Search would go to F_LAN when the language is
   known and to F when the language is not known,
   using language specific analysis as while indexing.
6) Note Karl's mentioning having both F and F_LAN at
   search, assigning higher boost to F_LAN. Useful when
   there is some uncertainty on the "marked language".

There can be other considerations - for instance (1) the
certainty of language id; (2) fallback to English when the
language is unknown...

Note that SnowballFilter can be used for applying
stemming on the output of StandardAnalyzer.

Doron



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multi language indexing

Reply via email to