[ 
https://issues.apache.org/jira/browse/LUCENE-7321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Provalov updated LUCENE-7321:
----------------------------------
    Description: 
One of the challenges in search is recall of an item with a common typing 
variant.  These cases can be as simple as lower/upper case in most languages, 
accented characters, or more complex morphological phenomena like prefix 
omitting, or constructing a character with some combining mark.  This component 
addresses the cases, which are not covered by ASCII folding component, or more 
complex to design with other tools.  The idea is that a linguist could provide 
the mappings in a tab-delimited file, which then can be directly used by Solr.

The mappings are maintained in the tab-delimited file, which could be just a 
copy paste from Excel spreadsheet.  This gives the linguists the opportunity to 
create the mappings, then for the developer to include them in Solr 
configuration.  There are a few cases, when the mappings grow complex, where 
some additional debugging may be required.  The mappings can contain any 
sequence of characters to any other sequence of characters.

Some of the cases I discuss in detail document are handling the voiced vowels 
for Japanese; common typing substitutions for Korean, Russian, Polish; 
transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
for Japanese.  In the appendix, I give an example of implementing a Russian 
light weight stemmer using this component.

  was:
One of the challenges in search is recall of an item with a common typing 
variant.  These cases can be as simple as lower/upper case in most languages, 
accented characters, or more complex morphological phenomena like prefix 
omitting, or constructing a character with some combining mark.  This component 
addresses the cases, which are not covered by ASCII folding component, or more 
complex to design with other tools.  The idea is that a linguist could provide 
the mappings in a tab-delimited file, which then can be directly used by Solr.

The mappings are maintained in the tab-delimited file, which could be just a 
copy paste from Excel spreadsheet.  This gives the linguists the opportunity to 
create the mappings, then for the developer to include them in Solr 
configuration.  There are a few cases, when the mappings grow complex, where 
some additional debugging may be required.  The mappings can contain any 
sequence of characters to any other sequence of characters.

Some of the cases I discuss in detail document are handling the voiced vowels 
for Japanese; common typing substitutions for Korean, Russian, Polish; 
transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
for Japanese.


> Character Mapping
> -----------------
>
>                 Key: LUCENE-7321
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7321
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.6.1, 6.0, 5.4.1, 6.0.1
>            Reporter: Ivan Provalov
>            Priority: Minor
>              Labels: patch
>             Fix For: 6.0.1
>
>         Attachments: CharacterMappingComponent.pdf, LUCENE-7321.patch
>
>
> One of the challenges in search is recall of an item with a common typing 
> variant.  These cases can be as simple as lower/upper case in most languages, 
> accented characters, or more complex morphological phenomena like prefix 
> omitting, or constructing a character with some combining mark.  This 
> component addresses the cases, which are not covered by ASCII folding 
> component, or more complex to design with other tools.  The idea is that a 
> linguist could provide the mappings in a tab-delimited file, which then can 
> be directly used by Solr.
> The mappings are maintained in the tab-delimited file, which could be just a 
> copy paste from Excel spreadsheet.  This gives the linguists the opportunity 
> to create the mappings, then for the developer to include them in Solr 
> configuration.  There are a few cases, when the mappings grow complex, where 
> some additional debugging may be required.  The mappings can contain any 
> sequence of characters to any other sequence of characters.
> Some of the cases I discuss in detail document are handling the voiced vowels 
> for Japanese; common typing substitutions for Korean, Russian, Polish; 
> transliteration for Polish, Arabic; prefix removal for Arabic; suffix folding 
> for Japanese.  In the appendix, I give an example of implementing a Russian 
> light weight stemmer using this component.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to