Re: Multiple Languages in Same Core
Thanks Trey! Last week I ordered the eBook. I look forward to seeing the information in it. Jeremy On Thu, Mar 27, 2014 at 6:03 PM, Trey Grainger solrt...@gmail.com wrote: In addition to the two approaches Liu Bo mentioned (separate core per language and separate field per language), it is also possible to put multiple languages in a single field. This saves you the overhead of multiple cores and of having to search across multiple fields at query time. The idea here is that you can run multiple analyzers (i.e. one for German, one for English, one for Chinese, etc.) and stack the outputted TokenStreams for each of these within a single field. It is also possible to swap out the languages you want to use on a case-by-case basis (i.e. per-document, per field, or even per word) if you really need to for advanced use cases. All three of these methods, including code examples and the pros and cons of each are discussed in the Multilingual Search chapter of Solr in Action, which Alexandre referenced. If you don't have the book, you can also just download and run the code examples for free, though they may be harder to follow without the context from the book. Thanks, Trey Grainger Co-author, Solr in Action Director of Engineering, Search Analytics @CareerBuilder On Wed, Mar 26, 2014 at 4:34 AM, Liu Bo diabl...@gmail.com wrote: Hi Jeremy There're a lot of multi language discussions, two main approaches 1. like yours, a language is one core 2. all in one core, different language has it's own field. We have multi-language support in a single core, each multilingual field has it's own suffix such as name_en_US. We customized query handler to hide the query details to client. The main reason we want to do this is about NRT index and search, take product for example: product has price, quantity which is common and it's used by filtering and sorting, name, description is multi language field, if we split product in do different cores, the common field updating may end up a update in all of the multi language cores. As to scalability, we don't change solr cores/collections when a new language is added, but we probably need update our customized index process and run a full re-index. This approach suits our requirement for now, but you may have your own concerns. We have similar suggest filter problem like yours, we want to return suggest result filtering by stores. I can't find a way to build dictionary with query at my version of solr 4.6 What I do is run a query on a N-Gram analyzed field and with filter queries on store_id field. The suggest is actually a query. It may not perform as well as suggestion but can do the trick. You can try it to build a additional N-GRAM field for suggestion only and search on it with fq on your Locale field. All the best Liu Bo On 25 March 2014 09:15, Alexandre Rafalovitch arafa...@gmail.com wrote: Solr In Action has a significant discussion on the multi-lingual approach. They also have some code samples out there. Might be worth a look Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson jer...@thomersonfamily.com wrote: I recently deployed Solr to back the site search feature of a site I work on. The site itself is available in hundreds of languages. With the initial release of site search we have enabled the feature for ten of those languages. This is distributed across eight cores, with two Chinese languages plus Korean combined into one CJK core and each of the other seven languages in their own individual cores. The reason for splitting these into separate cores was so that we could have the same field names across all cores but have different configuration for analyzers, etc, per core. Now I have some questions on this approach. 1) Scalability: Considering I need to scale this to many dozens more languages, perhaps hundreds more, is there a better way so that I don't end up needing dozens or hundreds of cores? My initial plan was that many languages that didn't have special support within Solr would simply get lumped into a single default core that has some default analyzers that are applicable to the majority of languages. 1b) Related to this: is there a practical limit to the number of cores that can be run on one instance of Lucene? 2) Auto Suggest: In phase two I intend to add auto-suggestions as a user types a query. In reviewing how this is implemented and how the suggestion dictionary is built I have
Re: Multiple Languages in Same Core
In addition to the two approaches Liu Bo mentioned (separate core per language and separate field per language), it is also possible to put multiple languages in a single field. This saves you the overhead of multiple cores and of having to search across multiple fields at query time. The idea here is that you can run multiple analyzers (i.e. one for German, one for English, one for Chinese, etc.) and stack the outputted TokenStreams for each of these within a single field. It is also possible to swap out the languages you want to use on a case-by-case basis (i.e. per-document, per field, or even per word) if you really need to for advanced use cases. All three of these methods, including code examples and the pros and cons of each are discussed in the Multilingual Search chapter of Solr in Action, which Alexandre referenced. If you don't have the book, you can also just download and run the code examples for free, though they may be harder to follow without the context from the book. Thanks, Trey Grainger Co-author, Solr in Action Director of Engineering, Search Analytics @CareerBuilder On Wed, Mar 26, 2014 at 4:34 AM, Liu Bo diabl...@gmail.com wrote: Hi Jeremy There're a lot of multi language discussions, two main approaches 1. like yours, a language is one core 2. all in one core, different language has it's own field. We have multi-language support in a single core, each multilingual field has it's own suffix such as name_en_US. We customized query handler to hide the query details to client. The main reason we want to do this is about NRT index and search, take product for example: product has price, quantity which is common and it's used by filtering and sorting, name, description is multi language field, if we split product in do different cores, the common field updating may end up a update in all of the multi language cores. As to scalability, we don't change solr cores/collections when a new language is added, but we probably need update our customized index process and run a full re-index. This approach suits our requirement for now, but you may have your own concerns. We have similar suggest filter problem like yours, we want to return suggest result filtering by stores. I can't find a way to build dictionary with query at my version of solr 4.6 What I do is run a query on a N-Gram analyzed field and with filter queries on store_id field. The suggest is actually a query. It may not perform as well as suggestion but can do the trick. You can try it to build a additional N-GRAM field for suggestion only and search on it with fq on your Locale field. All the best Liu Bo On 25 March 2014 09:15, Alexandre Rafalovitch arafa...@gmail.com wrote: Solr In Action has a significant discussion on the multi-lingual approach. They also have some code samples out there. Might be worth a look Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson jer...@thomersonfamily.com wrote: I recently deployed Solr to back the site search feature of a site I work on. The site itself is available in hundreds of languages. With the initial release of site search we have enabled the feature for ten of those languages. This is distributed across eight cores, with two Chinese languages plus Korean combined into one CJK core and each of the other seven languages in their own individual cores. The reason for splitting these into separate cores was so that we could have the same field names across all cores but have different configuration for analyzers, etc, per core. Now I have some questions on this approach. 1) Scalability: Considering I need to scale this to many dozens more languages, perhaps hundreds more, is there a better way so that I don't end up needing dozens or hundreds of cores? My initial plan was that many languages that didn't have special support within Solr would simply get lumped into a single default core that has some default analyzers that are applicable to the majority of languages. 1b) Related to this: is there a practical limit to the number of cores that can be run on one instance of Lucene? 2) Auto Suggest: In phase two I intend to add auto-suggestions as a user types a query. In reviewing how this is implemented and how the suggestion dictionary is built I have concerns. If I have more than one language in a single core (and I keep the same field name for suggestions on all languages within a core) then it seems that I could get suggestions from another language returned with a suggest query. Is there a way to build a separate dictionary for
Re: Multiple Languages in Same Core
Hi Jeremy There're a lot of multi language discussions, two main approaches 1. like yours, a language is one core 2. all in one core, different language has it's own field. We have multi-language support in a single core, each multilingual field has it's own suffix such as name_en_US. We customized query handler to hide the query details to client. The main reason we want to do this is about NRT index and search, take product for example: product has price, quantity which is common and it's used by filtering and sorting, name, description is multi language field, if we split product in do different cores, the common field updating may end up a update in all of the multi language cores. As to scalability, we don't change solr cores/collections when a new language is added, but we probably need update our customized index process and run a full re-index. This approach suits our requirement for now, but you may have your own concerns. We have similar suggest filter problem like yours, we want to return suggest result filtering by stores. I can't find a way to build dictionary with query at my version of solr 4.6 What I do is run a query on a N-Gram analyzed field and with filter queries on store_id field. The suggest is actually a query. It may not perform as well as suggestion but can do the trick. You can try it to build a additional N-GRAM field for suggestion only and search on it with fq on your Locale field. All the best Liu Bo On 25 March 2014 09:15, Alexandre Rafalovitch arafa...@gmail.com wrote: Solr In Action has a significant discussion on the multi-lingual approach. They also have some code samples out there. Might be worth a look Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson jer...@thomersonfamily.com wrote: I recently deployed Solr to back the site search feature of a site I work on. The site itself is available in hundreds of languages. With the initial release of site search we have enabled the feature for ten of those languages. This is distributed across eight cores, with two Chinese languages plus Korean combined into one CJK core and each of the other seven languages in their own individual cores. The reason for splitting these into separate cores was so that we could have the same field names across all cores but have different configuration for analyzers, etc, per core. Now I have some questions on this approach. 1) Scalability: Considering I need to scale this to many dozens more languages, perhaps hundreds more, is there a better way so that I don't end up needing dozens or hundreds of cores? My initial plan was that many languages that didn't have special support within Solr would simply get lumped into a single default core that has some default analyzers that are applicable to the majority of languages. 1b) Related to this: is there a practical limit to the number of cores that can be run on one instance of Lucene? 2) Auto Suggest: In phase two I intend to add auto-suggestions as a user types a query. In reviewing how this is implemented and how the suggestion dictionary is built I have concerns. If I have more than one language in a single core (and I keep the same field name for suggestions on all languages within a core) then it seems that I could get suggestions from another language returned with a suggest query. Is there a way to build a separate dictionary for each language, but keep these languages within the same core? If it's helpful to know: I have a field in every core for Locale. Values will be the locale of the language of that document, i.e. en, es, zh_hans, etc. I'd like to be able to: 1) when building a suggestion dictionary, divide it into multiple dictionaries, grouping them by locale, and 2) supply a parameter to the suggest query that allows the suggest component to only return suggestions from the appropriate dictionary for that locale. If the answer to #1 is keep splitting groups of languages that have different analyzers into their own cores and the answer to #2 is that's not supported, then I'd be curious: where would I start to write my own extension that supported #2? I looked last night at the suggest lookup classes, dictionary classes, etc. But I didn't see a clear point where it would be clean to implement something like I'm suggesting above. Best Regards, Jeremy Thomerson -- All the best Liu Bo
Multiple Languages in Same Core
I recently deployed Solr to back the site search feature of a site I work on. The site itself is available in hundreds of languages. With the initial release of site search we have enabled the feature for ten of those languages. This is distributed across eight cores, with two Chinese languages plus Korean combined into one CJK core and each of the other seven languages in their own individual cores. The reason for splitting these into separate cores was so that we could have the same field names across all cores but have different configuration for analyzers, etc, per core. Now I have some questions on this approach. 1) Scalability: Considering I need to scale this to many dozens more languages, perhaps hundreds more, is there a better way so that I don't end up needing dozens or hundreds of cores? My initial plan was that many languages that didn't have special support within Solr would simply get lumped into a single default core that has some default analyzers that are applicable to the majority of languages. 1b) Related to this: is there a practical limit to the number of cores that can be run on one instance of Lucene? 2) Auto Suggest: In phase two I intend to add auto-suggestions as a user types a query. In reviewing how this is implemented and how the suggestion dictionary is built I have concerns. If I have more than one language in a single core (and I keep the same field name for suggestions on all languages within a core) then it seems that I could get suggestions from another language returned with a suggest query. Is there a way to build a separate dictionary for each language, but keep these languages within the same core? If it's helpful to know: I have a field in every core for Locale. Values will be the locale of the language of that document, i.e. en, es, zh_hans, etc. I'd like to be able to: 1) when building a suggestion dictionary, divide it into multiple dictionaries, grouping them by locale, and 2) supply a parameter to the suggest query that allows the suggest component to only return suggestions from the appropriate dictionary for that locale. If the answer to #1 is keep splitting groups of languages that have different analyzers into their own cores and the answer to #2 is that's not supported, then I'd be curious: where would I start to write my own extension that supported #2? I looked last night at the suggest lookup classes, dictionary classes, etc. But I didn't see a clear point where it would be clean to implement something like I'm suggesting above. Best Regards, Jeremy Thomerson
Re: Multiple Languages in Same Core
Solr In Action has a significant discussion on the multi-lingual approach. They also have some code samples out there. Might be worth a look Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson jer...@thomersonfamily.com wrote: I recently deployed Solr to back the site search feature of a site I work on. The site itself is available in hundreds of languages. With the initial release of site search we have enabled the feature for ten of those languages. This is distributed across eight cores, with two Chinese languages plus Korean combined into one CJK core and each of the other seven languages in their own individual cores. The reason for splitting these into separate cores was so that we could have the same field names across all cores but have different configuration for analyzers, etc, per core. Now I have some questions on this approach. 1) Scalability: Considering I need to scale this to many dozens more languages, perhaps hundreds more, is there a better way so that I don't end up needing dozens or hundreds of cores? My initial plan was that many languages that didn't have special support within Solr would simply get lumped into a single default core that has some default analyzers that are applicable to the majority of languages. 1b) Related to this: is there a practical limit to the number of cores that can be run on one instance of Lucene? 2) Auto Suggest: In phase two I intend to add auto-suggestions as a user types a query. In reviewing how this is implemented and how the suggestion dictionary is built I have concerns. If I have more than one language in a single core (and I keep the same field name for suggestions on all languages within a core) then it seems that I could get suggestions from another language returned with a suggest query. Is there a way to build a separate dictionary for each language, but keep these languages within the same core? If it's helpful to know: I have a field in every core for Locale. Values will be the locale of the language of that document, i.e. en, es, zh_hans, etc. I'd like to be able to: 1) when building a suggestion dictionary, divide it into multiple dictionaries, grouping them by locale, and 2) supply a parameter to the suggest query that allows the suggest component to only return suggestions from the appropriate dictionary for that locale. If the answer to #1 is keep splitting groups of languages that have different analyzers into their own cores and the answer to #2 is that's not supported, then I'd be curious: where would I start to write my own extension that supported #2? I looked last night at the suggest lookup classes, dictionary classes, etc. But I didn't see a clear point where it would be clean to implement something like I'm suggesting above. Best Regards, Jeremy Thomerson