Hi,

  I'm  trying to implement multiple language support in Solr Cloud (4.7).
Although we've different languages in index, we were only supporting
english in terms of index and query. To provide some context, our current
index size is 35 GB with close to 15 million documents. We've two shards
with two replicas per shard. I'm using composite id to support
de-duplication, which puts the documents having the same field (dedup)
value to a specific shard.
Language is known prior to for every document being indexed. That saves the
need for runtime language detection. Similarly, during query, the language
will be known as well. To extend it, there's no need for multi-lingual
support.

Based on my understanding so far, there are three approaches which are
widely adopted. Multi-field indexing, Multi-Core indexing and Multiple
language in one field (based from Solr in Action).

First option seems easy to implement. But then, I've around 40 fields which
are getting indexed currently, though a majority of them are type="string"
and not being analyzed. I'm planning to support around 10 languages, which
translates to 400 field definitions in the same schema. And this is poised
to grow with addition of languages and fields. My apprehension is whether
this approach becomes a maintenance nightmare ? Does it affect overall
scalability ? Does is affect any existing features like Suggester,
Spellcheck, etc. ? I was thinking of including language as part of the id
key. It'll look like "Language!Dedup_id!url" so that documents are spread
across the two shards.

Second option of a dedicated core sounds easy in terms of maintaining
config files. Also,routing requests will be fairly easy as the language
will be always known up-front,both during indexing and query time. But, as
I looked into the documents, 60% of our total index will be in English,
while rest 40% will constitute remaining 10-14 languages. Some language
content are in few thousands which perhaps doesn't merit a dedicate core.
On top of that, this approach has the potential of getting into a complex
infrastructure, which might be hard to maintain.

I read about the use of multiple language in a single field in Trey
Grainger's book. It looks like a great approach but not sure if it is meant
to address my scenario. My first impression is that it's more geared
towards supporting multi-lingual, but I maybe completely wrong. Also, this
is not supported by Solr / Lucene out of the box.

I know there's a lot of people in this group who have excelled as far as
supporting multiple language in Solr is concerned. I'm trying to gather
their inputs / experience on the best practice to help me decide the right
approach. Any pointer on this will be highly appreciated.

Thanks,
Shamik

Reply via email to