On 08.10.12 17.03, Maciej Liżewski wrote:

Now there are two possibilities:
1. when fields are untouched - processing data (stemming, etc) is same
for every document, which is rather wrong because polish stemming is
different from english one... :)
2. attributes are mapped to *_lang and every *_lang field has
different processing definition (stemming, stop words, etc).

The latter seems more reasonable for me and is more common practice. There are different stemmers you may try out such as Hunspell.

If you want to detect languages, I would use TikaLanguageIdentifierUpdateProcessorFactory:
http://wiki.apache.org/solr/LanguageDetection

It can be configured by using an Update Request Processor:
http://wiki.apache.org/solr/UpdateRequestProcessor

This part I understand,
but I am confused on how to perform valid queries in both cases? I
have single (simple) page which should work google-like: you enter a
text and get results. But there is no "language guess" process for
queries... Do I have to specify on each query whether it should search
in 'text_en' or 'text_pl' fields? If so - it is not very good because
I would like users to get all documents that match query no matter
what language they are written in. There are many similar words,
technical names, etc, which are same in many languages...

I think you should search in both fields, yes. I will explain why further down.

In other words - how to achieve google-like search with stemming for
multiple languages and without to force users to select language they
would like to search in?

Google does a guessing about the query language. If you hit www.google.com, you will be redirected to www.google.pl if you're sitting in Poland. This may also be achieved in your application by detecting the browser's locale etc. Many web application frameworks have support for this. Then you may give (at query time) a higher boost to the fields belonging to the language detected.

Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to