[ 
https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535049
 ] 

Mark Miller commented on LUCENE-1029:
-------------------------------------

My comment about stemming was not meant to compare a stemmer to a diacritical 
stripper, but rather to point out that the result of such an operation does not 
necessarily have to create something 'legal' (just as a stemmer does not create 
'legal' words). This was in response to the comment 'Some of the 
ISOLatin1AccentFilter are legal while others are illegal. '

Your point about semantic meaning is well taken, but was not intended to be 
part of the comparison I was going for. My bad. 

I think that the fact that ripping diacriticals can change the meaning of words 
goes without saying...otherwise, why even have them in the language? As Uwe 
said, the main motivating factor is to allow easy entry with the keyboard of 
another language. Of course this must come with a compromise. Other search 
engines I have seen offer the exact functionality of this class. (CPL, 
SearchServer, etc)

Literally, this thing is called an accent filter...letters go in, accents come 
off. Doing more really does seem like a job for another class. If I can borrow 
a word I didn't know from DM Smith, transliteration seems to go beyond an 
ISOLatin1AccentFilter. This is a tough sell I know -- programmers seem to push 
the definition of filter to its limits and IMHO into the realm of 
transform/translate.

Anyhow...I apologize for beating a dead horse...<g>

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented 
> characters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in 
> the finnish, swedish, danish languages etc.) are illegal. The scandinavian 
> characters are different from the accented characters used e.g. in latin 
> based languages such as french in that these characters (ä, ö, å) represent 
> entirely independent sounds in the language and therefore cannot be 
> represented with any other sound without change of meaning. It is therefore 
> illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) 
> to saa (will have) because these are two entirely different words with 
> different meaning. The same applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å 
> and a. 
> In addition to the three characters mentioned above danish and norwegian use 
> other special characters such as ø and æ. It should be checked if the 
> replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to