[jira] [Commented] (LUCENE-8419) Return token unchanged for pathological Stempel tokens

Adrien Grand (JIRA) Fri, 27 Jul 2018 05:35:34 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559683#comment-16559683
 ]


Adrien Grand commented on LUCENE-8419:
--------------------------------------

[~ab] Do you have any ideas how to improve this?

> Return token unchanged for pathological Stempel tokens
> ------------------------------------------------------
>
>                 Key: LUCENE-8419
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8419
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Trey Jones
>            Priority: Major
>              Labels: stemmer, stemming
>         Attachments: dotc.txt, dotdotc.txt, twoletter.txt
>
>
> In the aggregate, Stempel does a good job, but certain tokens get stemmed 
> pathologically, conflating completely unrelated words in the search index. 
> Depending on the scoring function, documents returned may have no form of the 
> word that was in the query, only unrelated forms (see ć examples below).
> It's probably not possible to fix the stemmer, and it's probably not possible 
> to catch _every_ error, but catching and ignoring certain large classes of 
> errors would greatly improve precision, and doing it in the stemmer would 
> prevent losses to recall that happen from cleaning up these errors outside 
> the stemmer.
> An obvious example is that numbers ending in 1 have the last two digits 
> replaced with ć. So 12341 is stemmed as 123ć. Numbers ending in 31 have the 
> last 4 numbers removed and replaced with ć, so 12331 is stemmed as 1ć. Mixed 
> letters and numbers are treated the same: abc123451 is stemmed as abc1234ć, 
> abc1231 is stemmed as abcć.
> *Proposed solution:* any token that ends in a number should not be stemmed, 
> it should just be returned unchanged.
> One letter stems from the set [a-zńć] are generally useless and often absurd.
> ć is the worst offender by far (it's the ending of the infinitive form of 
> verbs). All of these tokens (found on Polish Wikipedia/Wiktionary) get 
> stemmed to ć:
>  * acque Adrien aguas Águas Alainem Alandh Amores Ansoe Arau asinaio aŭdas 
> audyt Awiwie Ayres Baby badż Baina Bains Balue Baon baque Barbola Bazy Beau 
> beim Beroe Betz Blaue blenda bleue Blizzard boor Boruca Boym Brodła Brogi 
> Bronksie Brydż Budgie Budiafa bujny Buon Buot Button Caan Cains Canoe Canona 
> caon Celu Charl Chloe ciag Cioma Cmdr Conseil Conso Cotton Cramp Creel Cuyk 
> cyan czcią Czermny czto D.III Daws Daxue dazzle decy Defoe Dereń Detroit 
> digue Dior Ditton Dojlido dosei douk DRaaS drag drau Dudacy dudas Dutton Duty 
> Dziób eayd Edwy Edyp eiro Eltz Emain erar ESaaS faan Fetz figurar Fitz foam 
> Frau Fugue GAAB gaan Gabirol Gaon gasue Gaup Geol GeoMIP Getz gigue Ginny 
> Gioią Girl Goam Gołymin Gosei Götz grasso Grodnie Gula Guroo gyan HAAB Haan 
> Heim Héroe Hitz Hoam Hohenho Hosei Huon Hutton Huub hyaina Iberii inkuby 
> Inoue Issue ITaaS Iudas Izmaile Jaan Jaws jedyn Jews jira Josepho Jost Josue 
> Judas Kaan Kaleido Karoo Katz Kazue Kehoe khayag kiwa Kiwu Klaas kmdr Kokei 
> Konoe kozer kpią Kringle ksiezyce Któż Kutz L231 L331 Laan Lalli Laon Laws 
> łebka Leroo Liban Ligue Liro Lisoli Logue Loja Londyn Lubomyr Luque Lutz 
> Lytton łzawy Maan mains Mainy malpaco Mammal mandag MBaaS meeki Merl Metz 
> MIDAS middag Miras mmol modą moins Monty Moryń motz mróż Mutz Müzesi MVaaS 
> Naam nabrzeża Nadab Nadala Nalewki Nd:YAG neol News Nieszawa Nimue Nyam ÖAAB 
> oblał oddala okala Olień opar oppi Orioł Osioł osoagi Osyki Otóż Output 
> Oxalido pasmową Patton Pearl Peau peoplk Petz poar Pobrzeża poecie Pogue Pono 
> posagi posł Praha Pringle probie progi Prońko Prosper prwdę Psioł Pułka Putz 
> QDTOE Quien Qwest radża raga Rains reht Reich Retz Revue Right RITZ Roam 
> Rogue Roque rosii RU31 Rutki Ryan SAAB saasso salue Sampaio Satz Sears 
> Sekisho semo Setton Sgan Siloe Sitz Skopje Slot Šmarje Smrkci Soar sopo 
> sozinho springa Steel Stip Straz Strip Suez sukuby Sumach Surgucie Sutton 
> svasso Szosą szto Tadas Taira tęczy Teodorą teol Tisii Tisza Toluca Tomoe 
> Toque TPMŻ Traiana Trask Traue Tulyag Tuque Turinga Undas Uniw usque Vague 
> Value Venue Vidas Vogue Voor W331 Waringa weht Weich Weija Wheel widmem WKAG 
> worku Wotton Wryk Wschowie wsiach wsiami Wybrzeża wydala Wyraz XLIII XVIII 
> XXIII Yaski yeol YONO Yorki zakręcie Zijab zipo.
> Four-character tokens ending in 31 (like 2,31 9,31 1031 1131 7431 8331 a331) 
> also all get stemmed to ć.
> Below are examples of other tokens (from Polish Wikipedia/Wiktionary) that 
> get stemmed to one-letter tokens in [a-zńć]. Note that i, o, u, w, and z are 
> stop words, and so don't show up in the list.
>  * a: a, addo, adygea, jhwh, also
>  * b: b, bdrm, barr, bebek, berr, bounty, bures, burr, berm, birm
>  * c: alzira, c, carr, county, haight, hermas, kidoń, paich, pieter, połóż, 
> radoń, soest, tatort, voight, zaba, biegną, pokaż, wskaż, zoisyt
>  * d: award, d, dlek, deeb
>  * e: e, eddy, eloi
>  * f: f, farr, firm
>  * g: g, geagea, grunty, gwdy, gyro, górą
>  * h: h
>  * i: inre, isro
>  * j: j, judo
>  * k: k, kgtj, kpzr, karr, kerr, ksok
>  * l: l, leeb, loeb
>  * m: m, magazyn, marr, mayor, merr, mnsi, murr, mgły, najmu
>  * n: johnowi, n
>  * o: obzr, offy
>  * p: p, pace, paoli, parr, pasji, pawełek, pyro, pirsy, plmb
>  * q: q
>  * r: r, rite, rrek
>  * s: s, sarr, site, sowie, szok
>  * t: leźnie, t, tnsw, tooi
>  * u: noite
>  * w: wmro, warr, wifi, wyspom, wątki
>  * x: x
>  * y: jesteś, lafleur, nate, nowsze, violeur, y, yach, douleur
>  * z: czok, skrawek
>  * ń: cisew, esso
> All other one-character stems I have encountered have been for one-character 
> input tokens (especially those in other writing systems).
> *Proposed solution:* if a token gets stemmed to a one-letter stem (either in 
> general, or specifically if the letter is one of [a-zńć]), the input token 
> should be returned unchanged.
> There are other patterns of unreliable stems, though the ones above are the 
> worst.
> Two-letter stems are generally unreliable (see attachement twoletter.txt). 
> The specific stems my, um, ąc, and ły are particularly random.
> Two- and three-letter stems fitting the patterns .ć and ..ć are generally not 
> useful (see attachments dotc.txt and dotdotc.txt for full lists of examples). 
> The specific stems ać, eć, yć, ąć, ść, and źć are particularly random.
> The specific stems ować, iwać, obić, snąć, ywać, ium also stand out as 
> egregious:
>  * ium: IIIC, Treze
>  * iwać: Blefa, Crew, Iwano, Krall, Leseur, Maksiu, Stefa, Wrycz, cygar, horou
>  * obić: Dawka, Obiło, dawka, obicia, obito
>  * ować: Abdou, Bangu, Beess, Biblie, Birmie, Bohle, Bredy, Buddę, Czubą, 
> Darją, Fatou, Firmie, Füssli, Ghany, Haeng, Katją, Koszyc, Ligę, Limie, 
> Madou, Ozmy, Pitou, Riess, Sloane, Smółka, Soeng, TheFa, UWSS, firmie, ligę, 
> szury, úzkost
>  * snąć: Koziej, Schwab, Serial, Spain, serial
>  * ywać: Ariza, odkuł, sorgo
> *Proposed solution:* Return the input token if the stem meets one or more of 
> the following criteria:
>  # stem matches /^[a-zął][a-zćń]$/
>  # stem matches /^.ć/
>  # stem is one of my, um, ąc, ły, ać, eć, yć, ąć, ść, or źć
>  # stem matches /^..ć/
>  # stem is one of ować, iwać, obić, snąć, ywać, ium
> Note: (1) is a superset of (2) and (3). (2) does not cover my, um, ąc, or ły 
> in (3), so (2) and part of (3) could be combined.
> *General workaround:* Unpack Stempel into constituent parts, recreate 
> Stempel's stopword list as a stop filter (see LUCENE-8417), use polish_stem 
> as a stemmer, use a pattern_replace filter to replace 
> /^([a-zął]?[a-zćń]|..ć|\d.*ć)$/ with '', and then a length filter to remove 
> zero-length tokens, and add a stop filter with ować, iwać, obić, snąć, ywać, 
> ium. Since many tokens are lost by this process, you need to also have an 
> unstemmed index of the same text so you don't lose recall. (That's not 
> exactly "easy", but it's what I've had to do.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8419) Return token unchanged for pathological Stempel tokens

Reply via email to