Re: Multiple Languages in Same Core

2014-03-31 Thread Jeremy Thomerson
Thanks Trey! Last week I ordered the eBook. I look forward to seeing the
information in it.

Jeremy


On Thu, Mar 27, 2014 at 6:03 PM, Trey Grainger solrt...@gmail.com wrote:

 In addition to the two approaches Liu Bo mentioned (separate core per
 language and separate field per language), it is also possible to put
 multiple languages in a single field. This saves you the overhead of
 multiple cores and of having to search across multiple fields at query
 time. The idea here is that you can run multiple analyzers (i.e. one for
 German, one for English, one for Chinese, etc.) and stack the outputted
 TokenStreams for each of these within a single field. It is also possible
 to swap out the languages you want to use on a case-by-case basis (i.e.
 per-document, per field, or even per word) if you really need to for
 advanced use cases.

 All three of these methods, including code examples and the pros and cons
 of each are discussed in the Multilingual Search chapter of Solr in Action,
 which Alexandre referenced. If you don't have the book, you can also just
 download and run the code examples for free, though they may be harder to
 follow without the context from the book.

 Thanks,

 Trey Grainger
 Co-author, Solr in Action
 Director of Engineering, Search  Analytics @CareerBuilder





 On Wed, Mar 26, 2014 at 4:34 AM, Liu Bo diabl...@gmail.com wrote:

  Hi Jeremy
 
  There're a lot of multi language discussions, two main approaches
   1. like yours, a language is one core
   2. all in one core, different language has it's own field.
 
  We have multi-language support in a single core, each multilingual field
  has it's own suffix such as name_en_US. We customized query handler to
 hide
  the query details to client.
  The main reason we want to do this is about NRT index and search,
  take product for example:
 
  product has price, quantity which is common and it's used by
 filtering
  and sorting, name, description is multi language field,
  if we split product in do different cores, the common field updating
  may end up a update in all of the multi language cores.
 
  As to scalability, we don't change solr cores/collections when a new
  language is added, but we probably need update our customized index
 process
  and run a full re-index.
 
  This approach suits our requirement for now, but you may have your own
  concerns.
 
  We have similar suggest filter problem like yours, we want to return
  suggest result filtering by stores. I can't find a way to build
 dictionary
  with query at my version of solr 4.6
 
  What I do is run a query on a N-Gram analyzed field and with filter
 queries
  on store_id field. The suggest is actually a query. It may not perform
 as
  well as suggestion but can do the trick.
 
  You can try it to build a additional N-GRAM field for suggestion only and
  search on it with fq on your Locale field.
 
  All the best
 
  Liu Bo
 
 
 
 
  On 25 March 2014 09:15, Alexandre Rafalovitch arafa...@gmail.com
 wrote:
 
   Solr In Action has a significant discussion on the multi-lingual
   approach. They also have some code samples out there. Might be worth a
   look
  
   Regards,
  Alex.
   Personal website: http://www.outerthoughts.com/
   LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
   - Time is the quality of nature that keeps events from happening all
   at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
   book)
  
  
   On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson
   jer...@thomersonfamily.com wrote:
I recently deployed Solr to back the site search feature of a site I
  work
on. The site itself is available in hundreds of languages. With the
   initial
release of site search we have enabled the feature for ten of those
languages. This is distributed across eight cores, with two Chinese
languages plus Korean combined into one CJK core and each of the
 other
seven languages in their own individual cores. The reason for
 splitting
these into separate cores was so that we could have the same field
  names
across all cores but have different configuration for analyzers, etc,
  per
core.
   
Now I have some questions on this approach.
   
1) Scalability: Considering I need to scale this to many dozens more
languages, perhaps hundreds more, is there a better way so that I
 don't
   end
up needing dozens or hundreds of cores? My initial plan was that many
languages that didn't have special support within Solr would simply
 get
lumped into a single default core that has some default analyzers
  that
are applicable to the majority of languages.
   
1b) Related to this: is there a practical limit to the number of
 cores
   that
can be run on one instance of Lucene?
   
2) Auto Suggest: In phase two I intend to add auto-suggestions as a
  user
types a query. In reviewing how this is implemented and how the
   suggestion
dictionary is built I have 

Re: Multiple Languages in Same Core

2014-03-27 Thread Trey Grainger
In addition to the two approaches Liu Bo mentioned (separate core per
language and separate field per language), it is also possible to put
multiple languages in a single field. This saves you the overhead of
multiple cores and of having to search across multiple fields at query
time. The idea here is that you can run multiple analyzers (i.e. one for
German, one for English, one for Chinese, etc.) and stack the outputted
TokenStreams for each of these within a single field. It is also possible
to swap out the languages you want to use on a case-by-case basis (i.e.
per-document, per field, or even per word) if you really need to for
advanced use cases.

All three of these methods, including code examples and the pros and cons
of each are discussed in the Multilingual Search chapter of Solr in Action,
which Alexandre referenced. If you don't have the book, you can also just
download and run the code examples for free, though they may be harder to
follow without the context from the book.

Thanks,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search  Analytics @CareerBuilder





On Wed, Mar 26, 2014 at 4:34 AM, Liu Bo diabl...@gmail.com wrote:

 Hi Jeremy

 There're a lot of multi language discussions, two main approaches
  1. like yours, a language is one core
  2. all in one core, different language has it's own field.

 We have multi-language support in a single core, each multilingual field
 has it's own suffix such as name_en_US. We customized query handler to hide
 the query details to client.
 The main reason we want to do this is about NRT index and search,
 take product for example:

 product has price, quantity which is common and it's used by filtering
 and sorting, name, description is multi language field,
 if we split product in do different cores, the common field updating
 may end up a update in all of the multi language cores.

 As to scalability, we don't change solr cores/collections when a new
 language is added, but we probably need update our customized index process
 and run a full re-index.

 This approach suits our requirement for now, but you may have your own
 concerns.

 We have similar suggest filter problem like yours, we want to return
 suggest result filtering by stores. I can't find a way to build dictionary
 with query at my version of solr 4.6

 What I do is run a query on a N-Gram analyzed field and with filter queries
 on store_id field. The suggest is actually a query. It may not perform as
 well as suggestion but can do the trick.

 You can try it to build a additional N-GRAM field for suggestion only and
 search on it with fq on your Locale field.

 All the best

 Liu Bo




 On 25 March 2014 09:15, Alexandre Rafalovitch arafa...@gmail.com wrote:

  Solr In Action has a significant discussion on the multi-lingual
  approach. They also have some code samples out there. Might be worth a
  look
 
  Regards,
 Alex.
  Personal website: http://www.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality of nature that keeps events from happening all
  at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
  book)
 
 
  On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson
  jer...@thomersonfamily.com wrote:
   I recently deployed Solr to back the site search feature of a site I
 work
   on. The site itself is available in hundreds of languages. With the
  initial
   release of site search we have enabled the feature for ten of those
   languages. This is distributed across eight cores, with two Chinese
   languages plus Korean combined into one CJK core and each of the other
   seven languages in their own individual cores. The reason for splitting
   these into separate cores was so that we could have the same field
 names
   across all cores but have different configuration for analyzers, etc,
 per
   core.
  
   Now I have some questions on this approach.
  
   1) Scalability: Considering I need to scale this to many dozens more
   languages, perhaps hundreds more, is there a better way so that I don't
  end
   up needing dozens or hundreds of cores? My initial plan was that many
   languages that didn't have special support within Solr would simply get
   lumped into a single default core that has some default analyzers
 that
   are applicable to the majority of languages.
  
   1b) Related to this: is there a practical limit to the number of cores
  that
   can be run on one instance of Lucene?
  
   2) Auto Suggest: In phase two I intend to add auto-suggestions as a
 user
   types a query. In reviewing how this is implemented and how the
  suggestion
   dictionary is built I have concerns. If I have more than one language
 in
  a
   single core (and I keep the same field name for suggestions on all
   languages within a core) then it seems that I could get suggestions
 from
   another language returned with a suggest query. Is there a way to
 build a
   separate dictionary for 

Re: Multiple Languages in Same Core

2014-03-26 Thread Liu Bo
Hi Jeremy

There're a lot of multi language discussions, two main approaches
 1. like yours, a language is one core
 2. all in one core, different language has it's own field.

We have multi-language support in a single core, each multilingual field
has it's own suffix such as name_en_US. We customized query handler to hide
the query details to client.
The main reason we want to do this is about NRT index and search,
take product for example:

product has price, quantity which is common and it's used by filtering
and sorting, name, description is multi language field,
if we split product in do different cores, the common field updating
may end up a update in all of the multi language cores.

As to scalability, we don't change solr cores/collections when a new
language is added, but we probably need update our customized index process
and run a full re-index.

This approach suits our requirement for now, but you may have your own
concerns.

We have similar suggest filter problem like yours, we want to return
suggest result filtering by stores. I can't find a way to build dictionary
with query at my version of solr 4.6

What I do is run a query on a N-Gram analyzed field and with filter queries
on store_id field. The suggest is actually a query. It may not perform as
well as suggestion but can do the trick.

You can try it to build a additional N-GRAM field for suggestion only and
search on it with fq on your Locale field.

All the best

Liu Bo




On 25 March 2014 09:15, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Solr In Action has a significant discussion on the multi-lingual
 approach. They also have some code samples out there. Might be worth a
 look

 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


 On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson
 jer...@thomersonfamily.com wrote:
  I recently deployed Solr to back the site search feature of a site I work
  on. The site itself is available in hundreds of languages. With the
 initial
  release of site search we have enabled the feature for ten of those
  languages. This is distributed across eight cores, with two Chinese
  languages plus Korean combined into one CJK core and each of the other
  seven languages in their own individual cores. The reason for splitting
  these into separate cores was so that we could have the same field names
  across all cores but have different configuration for analyzers, etc, per
  core.
 
  Now I have some questions on this approach.
 
  1) Scalability: Considering I need to scale this to many dozens more
  languages, perhaps hundreds more, is there a better way so that I don't
 end
  up needing dozens or hundreds of cores? My initial plan was that many
  languages that didn't have special support within Solr would simply get
  lumped into a single default core that has some default analyzers that
  are applicable to the majority of languages.
 
  1b) Related to this: is there a practical limit to the number of cores
 that
  can be run on one instance of Lucene?
 
  2) Auto Suggest: In phase two I intend to add auto-suggestions as a user
  types a query. In reviewing how this is implemented and how the
 suggestion
  dictionary is built I have concerns. If I have more than one language in
 a
  single core (and I keep the same field name for suggestions on all
  languages within a core) then it seems that I could get suggestions from
  another language returned with a suggest query. Is there a way to build a
  separate dictionary for each language, but keep these languages within
 the
  same core?
 
  If it's helpful to know: I have a field in every core for Locale.
 Values
  will be the locale of the language of that document, i.e. en, es,
  zh_hans, etc. I'd like to be able to: 1) when building a suggestion
  dictionary, divide it into multiple dictionaries, grouping them by
 locale,
  and 2) supply a parameter to the suggest query that allows the suggest
  component to only return suggestions from the appropriate dictionary for
  that locale.
 
  If the answer to #1 is keep splitting groups of languages that have
  different analyzers into their own cores and the answer to #2 is that's
  not supported, then I'd be curious: where would I start to write my own
  extension that supported #2? I looked last night at the suggest lookup
  classes, dictionary classes, etc. But I didn't see a clear point where it
  would be clean to implement something like I'm suggesting above.
 
  Best Regards,
  Jeremy Thomerson




-- 
All the best

Liu Bo


Multiple Languages in Same Core

2014-03-24 Thread Jeremy Thomerson
I recently deployed Solr to back the site search feature of a site I work
on. The site itself is available in hundreds of languages. With the initial
release of site search we have enabled the feature for ten of those
languages. This is distributed across eight cores, with two Chinese
languages plus Korean combined into one CJK core and each of the other
seven languages in their own individual cores. The reason for splitting
these into separate cores was so that we could have the same field names
across all cores but have different configuration for analyzers, etc, per
core.

Now I have some questions on this approach.

1) Scalability: Considering I need to scale this to many dozens more
languages, perhaps hundreds more, is there a better way so that I don't end
up needing dozens or hundreds of cores? My initial plan was that many
languages that didn't have special support within Solr would simply get
lumped into a single default core that has some default analyzers that
are applicable to the majority of languages.

1b) Related to this: is there a practical limit to the number of cores that
can be run on one instance of Lucene?

2) Auto Suggest: In phase two I intend to add auto-suggestions as a user
types a query. In reviewing how this is implemented and how the suggestion
dictionary is built I have concerns. If I have more than one language in a
single core (and I keep the same field name for suggestions on all
languages within a core) then it seems that I could get suggestions from
another language returned with a suggest query. Is there a way to build a
separate dictionary for each language, but keep these languages within the
same core?

If it's helpful to know: I have a field in every core for Locale. Values
will be the locale of the language of that document, i.e. en, es,
zh_hans, etc. I'd like to be able to: 1) when building a suggestion
dictionary, divide it into multiple dictionaries, grouping them by locale,
and 2) supply a parameter to the suggest query that allows the suggest
component to only return suggestions from the appropriate dictionary for
that locale.

If the answer to #1 is keep splitting groups of languages that have
different analyzers into their own cores and the answer to #2 is that's
not supported, then I'd be curious: where would I start to write my own
extension that supported #2? I looked last night at the suggest lookup
classes, dictionary classes, etc. But I didn't see a clear point where it
would be clean to implement something like I'm suggesting above.

Best Regards,
Jeremy Thomerson


Re: Multiple Languages in Same Core

2014-03-24 Thread Alexandre Rafalovitch
Solr In Action has a significant discussion on the multi-lingual
approach. They also have some code samples out there. Might be worth a
look

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson
jer...@thomersonfamily.com wrote:
 I recently deployed Solr to back the site search feature of a site I work
 on. The site itself is available in hundreds of languages. With the initial
 release of site search we have enabled the feature for ten of those
 languages. This is distributed across eight cores, with two Chinese
 languages plus Korean combined into one CJK core and each of the other
 seven languages in their own individual cores. The reason for splitting
 these into separate cores was so that we could have the same field names
 across all cores but have different configuration for analyzers, etc, per
 core.

 Now I have some questions on this approach.

 1) Scalability: Considering I need to scale this to many dozens more
 languages, perhaps hundreds more, is there a better way so that I don't end
 up needing dozens or hundreds of cores? My initial plan was that many
 languages that didn't have special support within Solr would simply get
 lumped into a single default core that has some default analyzers that
 are applicable to the majority of languages.

 1b) Related to this: is there a practical limit to the number of cores that
 can be run on one instance of Lucene?

 2) Auto Suggest: In phase two I intend to add auto-suggestions as a user
 types a query. In reviewing how this is implemented and how the suggestion
 dictionary is built I have concerns. If I have more than one language in a
 single core (and I keep the same field name for suggestions on all
 languages within a core) then it seems that I could get suggestions from
 another language returned with a suggest query. Is there a way to build a
 separate dictionary for each language, but keep these languages within the
 same core?

 If it's helpful to know: I have a field in every core for Locale. Values
 will be the locale of the language of that document, i.e. en, es,
 zh_hans, etc. I'd like to be able to: 1) when building a suggestion
 dictionary, divide it into multiple dictionaries, grouping them by locale,
 and 2) supply a parameter to the suggest query that allows the suggest
 component to only return suggestions from the appropriate dictionary for
 that locale.

 If the answer to #1 is keep splitting groups of languages that have
 different analyzers into their own cores and the answer to #2 is that's
 not supported, then I'd be curious: where would I start to write my own
 extension that supported #2? I looked last night at the suggest lookup
 classes, dictionary classes, etc. But I didn't see a clear point where it
 would be clean to implement something like I'm suggesting above.

 Best Regards,
 Jeremy Thomerson