[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Steven Rowe (JIRA) Mon, 29 Sep 2008 16:55:19 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steven Rowe updated LUCENE-1390:
--------------------------------

    Attachment: ASCIIFoldingFilter.patch

Changes from Andi's version:

# Changed the name of the class to ASCIIFoldingFilter
# Added the Unicode chracter descriptions to comments on each character
# Added a test class
# Added several other Unicode blocks from which characters are converted to 
their ASCII equivalents.  Added characters include digits and punctuation.

I did not provide mappings for characters from the Math block - flattening 
circled plus, for example, didn't seem appropriate.

I *did* provide mappings for IPA and two other phonetic character blocks, and 
I'm not sure whether this is appropriate.  I was following what seemed to me to 
be the logic of Andi's mappings, and those provided by Latin1AccentFilter: 
convert characters to those that *look like* them in ASCII.  As a result, e.g., 
the character described as "LATIN SMALL LETTER TURNED M" (U+0270) from the IPA 
block is mapped to "m", regardless of its actual phonetic value.

There are lots of mappings in there now.  I generated the mappings by Perl 
scripting over the contents of the Unicode 5.1 version of UnicodeData.txt from 
Unicode.org, after grep'ing e.g. for "LATIN" and "LETTER" or "DIGRAPH", etc., 
and then moved things around to the appropriate places by hand.  I guess this 
is one weakness of this patch: it's large enough that manual verification is 
tough.  It's my hope that adding the Unicode character descriptions will allow 
for at least improved verifiability.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

Reply via email to