Re: Best practices for multiple languages?

Luca Rondanini Wed, 19 Jan 2011 09:59:51 -0800

why not just using the StandardAnalyzer? it works pretty well even with
Asian languages!




On Wed, Jan 19, 2011 at 12:23 AM, Shai Erera <[email protected]> wrote:

> If you index documents, each in a different language, but all its fields
> are
> of the same language, then what you can do is the following:
>
> Create separate indexes per language
> -------------------------------------------------------
> This will work and is not too hard to set up. Requires some maintenance
> code
> (e.g. directing a search request against the relevant index) but nothing
> too
> complicated. The advantage of using this approach is that you don't risk
> running into issues like search for "die" when the language is German, yet
> you will find documents in English indexed w/ that word. So your searches
> are "language safe". A disadvantage is that if you ever require to do
> cross-language operation, like search two languages, you need to do search
> federation which is less good. Also, maintenance becomes a slight pain,
> because you e.g. need to optimize multiple indexes, make sure they don't
> try
> to optimize at once, resulting in a sudden burst of IO.
>
> Create one index
> -------------------------
> Here, you'd use IndexWriter.addDocument(doc, analyzer) method and pass the
> proper Analyzer per the doc's language. That way, all your documents are
> located in the same index so administration is really simple. They also
> don't step on each other toes - each document is analyzed exactly as it
> should. You might get into weird situations like the "die" example
> (fetching
> a document in incorrect language), but that's easily solvable by indexing
> for each document a "language" field and use it as a Filter during the
> search. You can cache that Filter so that its posting list isn't traversed
> for every query but instead only once.
>
> We use the second approach and we're required to support 32 languages.
> While
> in most deployments the number never exceeds 3-4 languages, I know of some
> that handle > 10. If you're careful enough, it just works.
>
> Hope this helps.
>
> Shai
>
> On Wed, Jan 19, 2011 at 9:44 AM, Paul Libbrecht <[email protected]> wrote:
>
> >
> > But for this, you need a skillfully designed:
> > - set of fields
> > - multiplexing analyzer
> > - query expansion
> > In one of my projects, we do not split language by fields and it's a
> > pain... I'm having recurring issues in one sense or the other.
> > - the "die" example that Oti s mentioned is a good one: stop-word in
> > German, essential verb in English
> > - I had recently issues with the contribution of the word Fourier (for
> the
> > name of series): in English it stays fourier, in French in becomes fouri.
> > So: if the resource is contributed in French, the indexed value is fouri,
> > English seekers won't find it; if the resource is contributed in English,
> > French seekers won't find it.
> > So my last lesson: always have a whitespace-lowercase unstemmed field
> also
> > at hand and prefer it over the others in your query expansion.
> >
> > A wiki page should probably be made.
> >
> > paul
> >
> >
> > Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit :
> > > I think we should be using lucene with snowball jar's which means one
> > index for all languages (ofcourse size of index is always a matter of
> > concerns).
> > >
> > > Hope this helps.
> > > -vinaya
> > >
> > > On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote:
> > >> What is the "best practice" to support multiple languages, i.e.
> > Lucene-Documents that have multiple language content/fields?
> > >> Should
> > >> a) each language be indexed in a seperate index/directory or should
> > >> b) the Documents (in a single directory) hold the diverse localized
> > fields?
> > >>
> > >> We most often will be searching "language dependent" which (at least
> > performance wise) mandates one-directory-per-language...
> > >>
> > >> Any (lucene specific) white papers on this topic?
> > >>
> > >> Thx in advance
> > >> Clemens
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [email protected]
> > >> For additional commands, e-mail: [email protected]
> > >>
> > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: Best practices for multiple languages?

Reply via email to