On 08.10.12 17.03, Maciej Liżewski wrote:
Now there are two possibilities: 1. when fields are untouched - processing data (stemming, etc) is same for every document, which is rather wrong because polish stemming is different from english one... :) 2. attributes are mapped to *_lang and every *_lang field has different processing definition (stemming, stop words, etc).
The latter seems more reasonable for me and is more common practice. There are different stemmers you may try out such as Hunspell.
If you want to detect languages, I would use TikaLanguageIdentifierUpdateProcessorFactory:
http://wiki.apache.org/solr/LanguageDetection It can be configured by using an Update Request Processor: http://wiki.apache.org/solr/UpdateRequestProcessor
This part I understand, but I am confused on how to perform valid queries in both cases? I have single (simple) page which should work google-like: you enter a text and get results. But there is no "language guess" process for queries... Do I have to specify on each query whether it should search in 'text_en' or 'text_pl' fields? If so - it is not very good because I would like users to get all documents that match query no matter what language they are written in. There are many similar words, technical names, etc, which are same in many languages...
I think you should search in both fields, yes. I will explain why further down.
In other words - how to achieve google-like search with stemming for multiple languages and without to force users to select language they would like to search in?
Google does a guessing about the query language. If you hit www.google.com, you will be redirected to www.google.pl if you're sitting in Poland. This may also be achieved in your application by detecting the browser's locale etc. Many web application frameworks have support for this. Then you may give (at query time) a higher boost to the fields belonging to the language detected.
Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
