Hi, I'm trying to implement multiple language support in Solr Cloud (4.7). Although we've different languages in index, we were only supporting english in terms of index and query. To provide some context, our current index size is 35 GB with close to 15 million documents. We've two shards with two replicas per shard. I'm using composite id to support de-duplication, which puts the documents having the same field (dedup) value to a specific shard. Language is known prior to for every document being indexed. That saves the need for runtime language detection. Similarly, during query, the language will be known as well. To extend it, there's no need for multi-lingual support.
Based on my understanding so far, there are three approaches which are widely adopted. Multi-field indexing, Multi-Core indexing and Multiple language in one field (based from Solr in Action). First option seems easy to implement. But then, I've around 40 fields which are getting indexed currently, though a majority of them are type="string" and not being analyzed. I'm planning to support around 10 languages, which translates to 400 field definitions in the same schema. And this is poised to grow with addition of languages and fields. My apprehension is whether this approach becomes a maintenance nightmare ? Does it affect overall scalability ? Does is affect any existing features like Suggester, Spellcheck, etc. ? I was thinking of including language as part of the id key. It'll look like "Language!Dedup_id!url" so that documents are spread across the two shards. Second option of a dedicated core sounds easy in terms of maintaining config files. Also,routing requests will be fairly easy as the language will be always known up-front,both during indexing and query time. But, as I looked into the documents, 60% of our total index will be in English, while rest 40% will constitute remaining 10-14 languages. Some language content are in few thousands which perhaps doesn't merit a dedicate core. On top of that, this approach has the potential of getting into a complex infrastructure, which might be hard to maintain. I read about the use of multiple language in a single field in Trey Grainger's book. It looks like a great approach but not sure if it is meant to address my scenario. My first impression is that it's more geared towards supporting multi-lingual, but I maybe completely wrong. Also, this is not supported by Solr / Lucene out of the box. I know there's a lot of people in this group who have excelled as far as supporting multiple language in Solr is concerned. I'm trying to gather their inputs / experience on the best practice to help me decide the right approach. Any pointer on this will be highly appreciated. Thanks, Shamik