Re: Re: Multi-lingual Search & Accent Marks

Audrey Lorberfeld - audrey.lorberf...@ibm.com Tue, 03 Sep 2019 06:31:01 -0700

Languages are the best. Thank you all so much!

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com


On 8/30/19, 4:09 PM, "Walter Underwood" <wun...@wunderwood.org> wrote:

    The right transliteration for accents is language-dependent. In English, a 
diaeresis can be stripped because it is only used to mark neighboring vowels as 
independently pronounced. In German, the “typewriter umlaut” adds an “e”.
    
    English: coöperate -> cooperate
    German: Glück -> Glueck
    
    Some stemmers will handle the typewriter umlauts for you. The InXight 
stemmers used to do that.
    
    The English diaeresis is a fussy usage, but it does occur in text. For 
years, MS Word corrected “naive” to “naïve”. There may even be a curse 
associated with its usage.
    
    
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.newyorker.com_culture_culture-2Ddesk_the-2Dcurse-2Dof-2Dthe-2Ddiaeresis&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=bs1f1lhhzL5yetMSZKj0sDIC1dRXVKWJ6QfOnl6YGgo&s=cpRGRPUJXHCR3A-NyxcjzAqt-N1HevrBCjLJAW60KDU&e=
 
    
    In German, there are corner cases where just stripping the umlaut changes 
one word into another, like schön/schon.
    
    Isn’t language fun?
    
    wunder
    Walter Underwood
    wun...@wunderwood.org
    
https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=bs1f1lhhzL5yetMSZKj0sDIC1dRXVKWJ6QfOnl6YGgo&s=JKCjwue0SDlu5UZ5sllEI__txfMvrugOL51CIAPV1H8&e=
   (my blog)
    
    > On Aug 30, 2019, at 12:48 PM, Erick Erickson <erickerick...@gmail.com> 
wrote:
    > 
    > It Depends (tm). In this case on how sophisticated/precise your users 
are. If your users are exclusively extremely conversant in the language and are 
expected to have keyboards that allow easy access to all the accents… then I 
might leave them in. In some cases removing them can change the meaning of a 
word.
    > 
    > That said, most installations I’ve seen remove them. They’re still 
present in any returned stored field so the doc looks good. And then you bypass 
all the nonsense about perhaps ingesting a doc that “somehow” had accents 
removed and/or people not putting accents in their search and the like.
    > 
    > MappingCFF works..
    > 
    >> On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
    >> 
    >> Aita,
    >> 
    >> Thanks for that insight! 
    >> 
    >> As the conversation has progressed, we are now leaning towards not 
having the ASCII-folding filter in our pipelines in order to keep marks like 
umlauts and tildas. Instead, we might add acute and grave accents to a file 
pointed at by the MappingCharFilterFactory to simply strip those more common 
accent marks...
    >> 
    >> Any other opinions are welcome!
    >> 
    >> -- 
    >> Audrey Lorberfeld
    >> Data Scientist, w3 Search
    >> Digital Workplace Engineering
    >> CIO, Finance and Operations
    >> IBM
    >> audrey.lorberf...@ibm.com
    >> 
    >> 
    >> On 8/30/19, 10:27 AM, "Atita Arora" <atitaar...@gmail.com> wrote:
    >> 
    >>   We work on german index, we neutralize accents before index i.e. 
umlauts to
    >>   'ae', 'ue'.. Etc and similar what we do at the query time too for an
    >>   appropriate match.
    >> 
    >>   On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - 
audrey.lorberf...@ibm.com
    >>   <audrey.lorberf...@ibm.com> wrote:
    >> 
    >>> Hi All,
    >>> 
    >>> Just wanting to test the waters here – for those of you with search
    >>> engines that index multiple languages, do you use ASCII-folding in your
    >>> schema? We are onboarding Spanish documents into our index right now and
    >>> keep going back and forth on whether we should preserve accent marks. 
From
    >>> our query logs, it seems people generally do not include accents when
    >>> searching, but you never know…
    >>> 
    >>> Thank you in advance for sharing your experiences!
    >>> 
    >>> --
    >>> Audrey Lorberfeld
    >>> Data Scientist, w3 Search
    >>> Digital Workplace Engineering
    >>> CIO, Finance and Operations
    >>> IBM
    >>> audrey.lorberf...@ibm.com
    >>> 
    >>> 
    >> 
    >> 
    >

Re: Re: Multi-lingual Search & Accent Marks

Reply via email to