why not just using the StandardAnalyzer? it works pretty well even with Asian languages!
On Wed, Jan 19, 2011 at 12:23 AM, Shai Erera <ser...@gmail.com> wrote: > If you index documents, each in a different language, but all its fields > are > of the same language, then what you can do is the following: > > Create separate indexes per language > ------------------------------------------------------- > This will work and is not too hard to set up. Requires some maintenance > code > (e.g. directing a search request against the relevant index) but nothing > too > complicated. The advantage of using this approach is that you don't risk > running into issues like search for "die" when the language is German, yet > you will find documents in English indexed w/ that word. So your searches > are "language safe". A disadvantage is that if you ever require to do > cross-language operation, like search two languages, you need to do search > federation which is less good. Also, maintenance becomes a slight pain, > because you e.g. need to optimize multiple indexes, make sure they don't > try > to optimize at once, resulting in a sudden burst of IO. > > Create one index > ------------------------- > Here, you'd use IndexWriter.addDocument(doc, analyzer) method and pass the > proper Analyzer per the doc's language. That way, all your documents are > located in the same index so administration is really simple. They also > don't step on each other toes - each document is analyzed exactly as it > should. You might get into weird situations like the "die" example > (fetching > a document in incorrect language), but that's easily solvable by indexing > for each document a "language" field and use it as a Filter during the > search. You can cache that Filter so that its posting list isn't traversed > for every query but instead only once. > > We use the second approach and we're required to support 32 languages. > While > in most deployments the number never exceeds 3-4 languages, I know of some > that handle > 10. If you're careful enough, it just works. > > Hope this helps. > > Shai > > On Wed, Jan 19, 2011 at 9:44 AM, Paul Libbrecht <p...@hoplahup.net> wrote: > > > > > But for this, you need a skillfully designed: > > - set of fields > > - multiplexing analyzer > > - query expansion > > In one of my projects, we do not split language by fields and it's a > > pain... I'm having recurring issues in one sense or the other. > > - the "die" example that Oti s mentioned is a good one: stop-word in > > German, essential verb in English > > - I had recently issues with the contribution of the word Fourier (for > the > > name of series): in English it stays fourier, in French in becomes fouri. > > So: if the resource is contributed in French, the indexed value is fouri, > > English seekers won't find it; if the resource is contributed in English, > > French seekers won't find it. > > So my last lesson: always have a whitespace-lowercase unstemmed field > also > > at hand and prefer it over the others in your query expansion. > > > > A wiki page should probably be made. > > > > paul > > > > > > Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit : > > > I think we should be using lucene with snowball jar's which means one > > index for all languages (ofcourse size of index is always a matter of > > concerns). > > > > > > Hope this helps. > > > -vinaya > > > > > > On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote: > > >> What is the "best practice" to support multiple languages, i.e. > > Lucene-Documents that have multiple language content/fields? > > >> Should > > >> a) each language be indexed in a seperate index/directory or should > > >> b) the Documents (in a single directory) hold the diverse localized > > fields? > > >> > > >> We most often will be searching "language dependent" which (at least > > performance wise) mandates one-directory-per-language... > > >> > > >> Any (lucene specific) white papers on this topic? > > >> > > >> Thx in advance > > >> Clemens > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >