Re: multilingual content and indexing
On Tue, Jul 12, 2016 at 3:53 PM, Lukas Kahwe Smith wrote: >> Alternatively, you can create different index definitions for each subtree >> (see [1]), e.g. Using the “includedPaths” property. This would lead to >> smaller indexes at the downside that you would have to create an index >> definition if you add a new language tree. Another way would be to have your index definition under each node /content/en/oak:index/fooIndex /content/jp/oak:index/fooIndex And have each index config analyzer configured as per the language. Chetan Mehrotra
Re: multilingual content and indexing
> On 12 Jul 2016, at 12:15, Michael Marth wrote: > > Hi Lukas, > > I am not entirely sure what you want to achieve (or what exactly you mean > with “dealing with multi language content”), but trying to answer a bit: > > Let’s say you have distinct content trees for different languages, like e.g. > /content/en > /content/jp > Etc. > > You can choose to index all these trees in one (Lucene) index for full text > search and filter the results in your query, i.e. Put the burden on the query > engine. > This is a simple setup which leads to a large index (although I personally > have not seen this to be a problem) for example if you index multi lingual content under the same field while doing monolingual searches, then you tend to have suboptimal sorting since word distributions values from one language affect word distribution of another. > Alternatively, you can create different index definitions for each subtree > (see [1]), e.g. Using the “includedPaths” property. This would lead to > smaller indexes at the downside that you would have to create an index > definition if you add a new language tree. > This approach has the additional benefit that you can define > language-specific Lucene analyzers for each sub tree, so that e.g. In the > example above the Japanese index would have ist own analyzer. ok, so its possible to tweak this with the standard indexer in Oak without having to switch to an external indexer like Solr just for this. good to hear. regards, Lukas Kahwe Smith sm...@pooteeweet.org signature.asc Description: Message signed with OpenPGP using GPGMail
Re: multilingual content and indexing
Hi Lukas, I am not entirely sure what you want to achieve (or what exactly you mean with “dealing with multi language content”), but trying to answer a bit: Let’s say you have distinct content trees for different languages, like e.g. /content/en /content/jp Etc. You can choose to index all these trees in one (Lucene) index for full text search and filter the results in your query, i.e. Put the burden on the query engine. This is a simple setup which leads to a large index (although I personally have not seen this to be a problem) Alternatively, you can create different index definitions for each subtree (see [1]), e.g. Using the “includedPaths” property. This would lead to smaller indexes at the downside that you would have to create an index definition if you add a new language tree. This approach has the additional benefit that you can define language-specific Lucene analyzers for each sub tree, so that e.g. In the example above the Japanese index would have ist own analyzer. HTH Michael [1] http://jackrabbit.apache.org/oak/docs/query/lucene.html On 12/07/16 10:15, "Lukas Kahwe Smith" wrote: >Aloha, > >I did a bit of search but didn’t find anything specific on any plans to >dealing with multi language content in any specific way inside Oak. >Specifically I am wondering as indexing all content from different languages >together can lead to suboptimal sorting and needless overhead. So are there >any plans to deal with this specifically? > >If not inside Oak, are there any projects on top of Oak (or inside AEM) that >deal with this? > >Or is this basically considered to be a case where one needs to plugin a >custom indexer and figure it out on your own? > >regards, >Lukas Kahwe Smith >sm...@pooteeweet.org > > >
multilingual content and indexing
Aloha, I did a bit of search but didn’t find anything specific on any plans to dealing with multi language content in any specific way inside Oak. Specifically I am wondering as indexing all content from different languages together can lead to suboptimal sorting and needless overhead. So are there any plans to deal with this specifically? If not inside Oak, are there any projects on top of Oak (or inside AEM) that deal with this? Or is this basically considered to be a case where one needs to plugin a custom indexer and figure it out on your own? regards, Lukas Kahwe Smith sm...@pooteeweet.org signature.asc Description: Message signed with OpenPGP using GPGMail