umlauts / diacritic expansion

2019-04-16 Thread Michael Sokolov
I'm learning how to index/search German today and understanding that
vowels with umlauts are conventionally expanded into two ASCII
characters, eg  "für" -> "fuer", so people may search for the expanded
form "fuer", but they might also search with the diacritic, and
finally they might lazily search using the stripped form "fur".

My question: is there a standard CharFilter or TokenFilter that
expands to both (ASCII) forms, for characters with umlauts and perhaps
other diacritics I might be unaware of in other languages having
similar multiple renderings in ASCII?

-Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: umlauts / diacritic expansion

2019-04-16 Thread Ralf Heyde
Hey,

Take a look at Asciifoldingfilter - this one is quite generic.

Does this answer your question?

Cheers Ralf

Von meinem iPhone gesendet

> Am 16.04.2019 um 20:08 schrieb Michael Sokolov :
> 
> I'm learning how to index/search German today and understanding that
> vowels with umlauts are conventionally expanded into two ASCII
> characters, eg  "für" -> "fuer", so people may search for the expanded
> form "fuer", but they might also search with the diacritic, and
> finally they might lazily search using the stripped form "fur".
> 
> My question: is there a standard CharFilter or TokenFilter that
> expands to both (ASCII) forms, for characters with umlauts and perhaps
> other diacritics I might be unaware of in other languages having
> similar multiple renderings in ASCII?
> 
> -Mike
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: umlauts / diacritic expansion

2019-04-16 Thread Markus Jelsma
Hello Michael,

For the case of normalizing ü to ue, take a look at the german normalizer [1].

Regards,
Markus

[1] 
https://lucene.apache.org/core/7_6_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html

 
 
-Original message-
> From:Ralf Heyde 
> Sent: Tuesday 16th April 2019 20:28
> To: java-user@lucene.apache.org
> Subject: Re: umlauts / diacritic expansion
> 
> Hey,
> 
> Take a look at Asciifoldingfilter - this one is quite generic.
> 
> Does this answer your question?
> 
> Cheers Ralf
> 
> Von meinem iPhone gesendet
> 
> > Am 16.04.2019 um 20:08 schrieb Michael Sokolov :
> > 
> > I'm learning how to index/search German today and understanding that
> > vowels with umlauts are conventionally expanded into two ASCII
> > characters, eg  "für" -> "fuer", so people may search for the expanded
> > form "fuer", but they might also search with the diacritic, and
> > finally they might lazily search using the stripped form "fur".
> > 
> > My question: is there a standard CharFilter or TokenFilter that
> > expands to both (ASCII) forms, for characters with umlauts and perhaps
> > other diacritics I might be unaware of in other languages having
> > similar multiple renderings in ASCII?
> > 
> > -Mike
> > 
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: umlauts / diacritic expansion

2019-04-16 Thread Ralf Heyde
Ah sorry, Asciifolding for umlauts will result in ue/ae - ß/ss etc

You could allow a distance of 1 or 2 given you use levenshtein distance - this 
might be close to what you need. 

Von meinem iPhone gesendet

> Am 16.04.2019 um 20:08 schrieb Michael Sokolov :
> 
> I'm learning how to index/search German today and understanding that
> vowels with umlauts are conventionally expanded into two ASCII
> characters, eg  "für" -> "fuer", so people may search for the expanded
> form "fuer", but they might also search with the diacritic, and
> finally they might lazily search using the stripped form "fur".
> 
> My question: is there a standard CharFilter or TokenFilter that
> expands to both (ASCII) forms, for characters with umlauts and perhaps
> other diacritics I might be unaware of in other languages having
> similar multiple renderings in ASCII?
> 
> -Mike
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: umlauts / diacritic expansion

2019-04-17 Thread Michael Sokolov
Thanks - GermanNormalizer seems as if it addresses this problem, yes.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: umlauts / diacritic expansion

2019-04-17 Thread Michael Sokolov
Right, AsciiFoldingFilter seems to map  Ü  [LATIN CAPITAL LETTER U
WITH DIAERESIS] to "U" not "UE".

On Wed, Apr 17, 2019 at 12:26 AM Ralf Heyde  wrote:
>
> Ah sorry, Asciifolding for umlauts will result in ue/ae - ß/ss etc
>
> You could allow a distance of 1 or 2 given you use levenshtein distance - 
> this might be close to what you need.
>
> Von meinem iPhone gesendet
>
> > Am 16.04.2019 um 20:08 schrieb Michael Sokolov :
> >
> > I'm learning how to index/search German today and understanding that
> > vowels with umlauts are conventionally expanded into two ASCII
> > characters, eg  "für" -> "fuer", so people may search for the expanded
> > form "fuer", but they might also search with the diacritic, and
> > finally they might lazily search using the stripped form "fur".
> >
> > My question: is there a standard CharFilter or TokenFilter that
> > expands to both (ASCII) forms, for characters with umlauts and perhaps
> > other diacritics I might be unaware of in other languages having
> > similar multiple renderings in ASCII?
> >
> > -Mike
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org