Re: multilingual content and indexing

2016-07-12 Thread Chetan Mehrotra
On Tue, Jul 12, 2016 at 3:53 PM, Lukas Kahwe Smith  wrote:
>> Alternatively, you can create different index definitions for each subtree 
>> (see [1]), e.g. Using the “includedPaths” property. This would lead to 
>> smaller indexes at the downside that you would have to create an index 
>> definition if you add a new language tree.

Another way would be to have your index definition under each node

/content/en/oak:index/fooIndex
/content/jp/oak:index/fooIndex

And have each index config analyzer configured as per the language.

Chetan Mehrotra


Re: multilingual content and indexing

2016-07-12 Thread Lukas Kahwe Smith

> On 12 Jul 2016, at 12:15, Michael Marth  wrote:
> 
> Hi Lukas,
> 
> I am not entirely sure what you want to achieve (or what exactly you mean 
> with “dealing with multi language content”), but trying to answer a bit:
> 
> Let’s say you have distinct content trees for different languages, like e.g.
> /content/en
> /content/jp
> Etc.
> 
> You can choose to index all these trees in one (Lucene) index for full text 
> search and filter the results in your query, i.e. Put the burden on the query 
> engine.
> This is a simple setup which leads to a large index (although I personally 
> have not seen this to be a problem)

for example if you index multi lingual content under the same field while doing 
monolingual searches, then you tend to have suboptimal sorting since word 
distributions values from one language affect word distribution of another.

> Alternatively, you can create different index definitions for each subtree 
> (see [1]), e.g. Using the “includedPaths” property. This would lead to 
> smaller indexes at the downside that you would have to create an index 
> definition if you add a new language tree.
> This approach has the additional benefit that you can define 
> language-specific Lucene analyzers for each sub tree, so that e.g. In the 
> example above the Japanese index would have ist own analyzer.

ok, so its possible to tweak this with the standard indexer in Oak without 
having to switch to an external indexer like Solr just for this. good to hear.

regards,
Lukas Kahwe Smith
sm...@pooteeweet.org





signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: multilingual content and indexing

2016-07-12 Thread Michael Marth
Hi Lukas,

I am not entirely sure what you want to achieve (or what exactly you mean with 
“dealing with multi language content”), but trying to answer a bit:

Let’s say you have distinct content trees for different languages, like e.g.
/content/en
/content/jp
Etc.

You can choose to index all these trees in one (Lucene) index for full text 
search and filter the results in your query, i.e. Put the burden on the query 
engine.
This is a simple setup which leads to a large index (although I personally have 
not seen this to be a problem)

Alternatively, you can create different index definitions for each subtree (see 
[1]), e.g. Using the “includedPaths” property. This would lead to smaller 
indexes at the downside that you would have to create an index definition if 
you add a new language tree.
This approach has the additional benefit that you can define language-specific 
Lucene analyzers for each sub tree, so that e.g. In the example above the 
Japanese index would have ist own analyzer.

HTH
Michael

[1] http://jackrabbit.apache.org/oak/docs/query/lucene.html



On 12/07/16 10:15, "Lukas Kahwe Smith"  wrote:

>Aloha,
>
>I did a bit of search but didn’t find anything specific on any plans to 
>dealing with multi language content in any specific way inside Oak. 
>Specifically I am wondering as indexing all content from different languages 
>together can lead to suboptimal sorting and needless overhead. So are there 
>any plans to deal with this specifically?
>
>If not inside Oak, are there any projects on top of Oak (or inside AEM) that 
>deal with this?
>
>Or is this basically considered to be a case where one needs to plugin a 
>custom indexer and figure it out on your own?
>
>regards,
>Lukas Kahwe Smith
>sm...@pooteeweet.org
>
>
>


multilingual content and indexing

2016-07-12 Thread Lukas Kahwe Smith
Aloha,

I did a bit of search but didn’t find anything specific on any plans to dealing 
with multi language content in any specific way inside Oak. Specifically I am 
wondering as indexing all content from different languages together can lead to 
suboptimal sorting and needless overhead. So are there any plans to deal with 
this specifically?

If not inside Oak, are there any projects on top of Oak (or inside AEM) that 
deal with this?

Or is this basically considered to be a case where one needs to plugin a custom 
indexer and figure it out on your own?

regards,
Lukas Kahwe Smith
sm...@pooteeweet.org





signature.asc
Description: Message signed with OpenPGP using GPGMail