You can store in one field if you manage to hide a language code with the text. XML is overkill but effective for this. At one point, we'd investigated how to allow a Lucene analyzer to see more than one field (the language code as well as the text) but I don't think we came up with anything.
On Thu, Mar 20, 2008 at 12:39 PM, David King <[EMAIL PROTECTED]> wrote: > > Unless you can come up with language-neutral tokenization and > > stemming, you > > need to: > > a) know the language of each document. > > b) run a different analyzer depending on the language. > > c) force the user to tell you the language of the query. > > d) run the query through the same analyzer. > > I can do all of those. This implies storing all of the different > languages in different fields, right? Then changing the default search- > field to the language of the query for every query? > > > > > > > > > > > > On Thu, Mar 20, 2008 at 12:17 PM, David King <[EMAIL PROTECTED]> > > wrote: > > > >>> You may be interested in a recent discussion that took place on a > >>> similar > >>> subject: > >>> http://www.mail-archive.com/solr-user@lucene.apache.org/ > >>> msg09332.html > >> > >> Interesting, yes. But since it doesn't actually exist, it's not much > >> help. > >> > >> I guess what I'm asking is, if my approach seems convoluted, I'm > >> probably doing it wrong, so how *a*re people solving the problem of > >> searching over multiple languages? What is the canonical way to do > >> this? > >> > >> > >>> > >>> > >>> Nicolas > >>> > >>> -----Message d'origine----- > >>> De : David King [mailto:[EMAIL PROTECTED] > >>> Envoyé : mercredi 19 mars 2008 20:07 > >>> À : solr-user@lucene.apache.org > >>> Objet : Language support > >>> > >>> This has probably been asked before, but I'm having trouble finding > >>> it. Basically, we want to be able to search for content across > >>> several > >>> languages, given that we know what language a datum and a query are > >>> in. Is there an obvious way to do this? > >>> > >>> Here's the longer version: I am trying to index content that > >>> occurs in > >>> multiple languages, including Asian languages. I'm in the process of > >>> moving from PyLucene to Solr. In PyLucene, I would have a list of > >>> analysers: > >>> > >>> analyzers = dict(en = pyluc.SnowballAnalyzer("English"), > >>> cs = pyluc.CzechAnalyzer(), > >>> pt = pyluc.SnowballAnalyzer("Portuguese"), > >>> ... > >>> > >>> Then when I want to index something, I do > >>> > >>> writer = pyluc.IndexWriter(store, analyzer, create) > >>> writer.addDocument(d.doc) > >>> > >>> That is, I tell Lucene the language of every datum, and the analyser > >>> to use when writing out the field. Then when I want to search > >>> against > >>> it, I do > >>> > >>> analyzer = LanguageAnalyzer.getanal(lang) > >>> q = pyluc.QueryParser(field, analyzer).parse(value) > >>> > >>> And use that QueryParser to parse the query in the given language > >>> before sending it off to PyLucene. (off-topic: getanal() is > >>> perhaps my > >>> favourite function-name ever). So the language of a given datum is > >>> attached to the datum itself. In Solr, however, this appears to be > >>> attached to the field, not to the individual data in it: > >>> > >>> <fieldType name="text_greek" class="solr.TextField"> > >>> <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/> > >>> </fieldType> > >>> > >>> Does this mean there there's no way to have a single "contents" > >>> field > >>> that has content in multiple languages, and still have the queries > >>> be > >>> parsed and stemmed correctly? How are other people handling this? > >>> Does > >>> it makes sense to write a tokeniser factory and a query factory that > >>> look at, say, the 'lang' field and return the correct tokenisers? > >>> Does > >>> this already exist? > >>> > >>> The other alternative is to have a text_zh field, a text_en field, > >>> etc, and to modify the query to search on that field depending on > >>> the > >>> language of the query, but that seems kind of hacky to me, > >>> especially > >>> if a query may be against more than one language. Is this the > >>> accepted > >>> way to go about it? Is there a benefit to this method over writing a > >>> detecting tokeniser factory? > >> > >> > >