[
https://issues.apache.org/jira/browse/LUCENE-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559683#comment-16559683
]
Adrien Grand commented on LUCENE-8419:
--------------------------------------
[~ab] Do you have any ideas how to improve this?
> Return token unchanged for pathological Stempel tokens
> ------------------------------------------------------
>
> Key: LUCENE-8419
> URL: https://issues.apache.org/jira/browse/LUCENE-8419
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Trey Jones
> Priority: Major
> Labels: stemmer, stemming
> Attachments: dotc.txt, dotdotc.txt, twoletter.txt
>
>
> In the aggregate, Stempel does a good job, but certain tokens get stemmed
> pathologically, conflating completely unrelated words in the search index.
> Depending on the scoring function, documents returned may have no form of the
> word that was in the query, only unrelated forms (see ć examples below).
> It's probably not possible to fix the stemmer, and it's probably not possible
> to catch _every_ error, but catching and ignoring certain large classes of
> errors would greatly improve precision, and doing it in the stemmer would
> prevent losses to recall that happen from cleaning up these errors outside
> the stemmer.
> An obvious example is that numbers ending in 1 have the last two digits
> replaced with ć. So 12341 is stemmed as 123ć. Numbers ending in 31 have the
> last 4 numbers removed and replaced with ć, so 12331 is stemmed as 1ć. Mixed
> letters and numbers are treated the same: abc123451 is stemmed as abc1234ć,
> abc1231 is stemmed as abcć.
> *Proposed solution:* any token that ends in a number should not be stemmed,
> it should just be returned unchanged.
> One letter stems from the set [a-zńć] are generally useless and often absurd.
> ć is the worst offender by far (it's the ending of the infinitive form of
> verbs). All of these tokens (found on Polish Wikipedia/Wiktionary) get
> stemmed to ć:
> * acque Adrien aguas Águas Alainem Alandh Amores Ansoe Arau asinaio aŭdas
> audyt Awiwie Ayres Baby badż Baina Bains Balue Baon baque Barbola Bazy Beau
> beim Beroe Betz Blaue blenda bleue Blizzard boor Boruca Boym Brodła Brogi
> Bronksie Brydż Budgie Budiafa bujny Buon Buot Button Caan Cains Canoe Canona
> caon Celu Charl Chloe ciag Cioma Cmdr Conseil Conso Cotton Cramp Creel Cuyk
> cyan czcią Czermny czto D.III Daws Daxue dazzle decy Defoe Dereń Detroit
> digue Dior Ditton Dojlido dosei douk DRaaS drag drau Dudacy dudas Dutton Duty
> Dziób eayd Edwy Edyp eiro Eltz Emain erar ESaaS faan Fetz figurar Fitz foam
> Frau Fugue GAAB gaan Gabirol Gaon gasue Gaup Geol GeoMIP Getz gigue Ginny
> Gioią Girl Goam Gołymin Gosei Götz grasso Grodnie Gula Guroo gyan HAAB Haan
> Heim Héroe Hitz Hoam Hohenho Hosei Huon Hutton Huub hyaina Iberii inkuby
> Inoue Issue ITaaS Iudas Izmaile Jaan Jaws jedyn Jews jira Josepho Jost Josue
> Judas Kaan Kaleido Karoo Katz Kazue Kehoe khayag kiwa Kiwu Klaas kmdr Kokei
> Konoe kozer kpią Kringle ksiezyce Któż Kutz L231 L331 Laan Lalli Laon Laws
> łebka Leroo Liban Ligue Liro Lisoli Logue Loja Londyn Lubomyr Luque Lutz
> Lytton łzawy Maan mains Mainy malpaco Mammal mandag MBaaS meeki Merl Metz
> MIDAS middag Miras mmol modą moins Monty Moryń motz mróż Mutz Müzesi MVaaS
> Naam nabrzeża Nadab Nadala Nalewki Nd:YAG neol News Nieszawa Nimue Nyam ÖAAB
> oblał oddala okala Olień opar oppi Orioł Osioł osoagi Osyki Otóż Output
> Oxalido pasmową Patton Pearl Peau peoplk Petz poar Pobrzeża poecie Pogue Pono
> posagi posł Praha Pringle probie progi Prońko Prosper prwdę Psioł Pułka Putz
> QDTOE Quien Qwest radża raga Rains reht Reich Retz Revue Right RITZ Roam
> Rogue Roque rosii RU31 Rutki Ryan SAAB saasso salue Sampaio Satz Sears
> Sekisho semo Setton Sgan Siloe Sitz Skopje Slot Šmarje Smrkci Soar sopo
> sozinho springa Steel Stip Straz Strip Suez sukuby Sumach Surgucie Sutton
> svasso Szosą szto Tadas Taira tęczy Teodorą teol Tisii Tisza Toluca Tomoe
> Toque TPMŻ Traiana Trask Traue Tulyag Tuque Turinga Undas Uniw usque Vague
> Value Venue Vidas Vogue Voor W331 Waringa weht Weich Weija Wheel widmem WKAG
> worku Wotton Wryk Wschowie wsiach wsiami Wybrzeża wydala Wyraz XLIII XVIII
> XXIII Yaski yeol YONO Yorki zakręcie Zijab zipo.
> Four-character tokens ending in 31 (like 2,31 9,31 1031 1131 7431 8331 a331)
> also all get stemmed to ć.
> Below are examples of other tokens (from Polish Wikipedia/Wiktionary) that
> get stemmed to one-letter tokens in [a-zńć]. Note that i, o, u, w, and z are
> stop words, and so don't show up in the list.
> * a: a, addo, adygea, jhwh, also
> * b: b, bdrm, barr, bebek, berr, bounty, bures, burr, berm, birm
> * c: alzira, c, carr, county, haight, hermas, kidoń, paich, pieter, połóż,
> radoń, soest, tatort, voight, zaba, biegną, pokaż, wskaż, zoisyt
> * d: award, d, dlek, deeb
> * e: e, eddy, eloi
> * f: f, farr, firm
> * g: g, geagea, grunty, gwdy, gyro, górą
> * h: h
> * i: inre, isro
> * j: j, judo
> * k: k, kgtj, kpzr, karr, kerr, ksok
> * l: l, leeb, loeb
> * m: m, magazyn, marr, mayor, merr, mnsi, murr, mgły, najmu
> * n: johnowi, n
> * o: obzr, offy
> * p: p, pace, paoli, parr, pasji, pawełek, pyro, pirsy, plmb
> * q: q
> * r: r, rite, rrek
> * s: s, sarr, site, sowie, szok
> * t: leźnie, t, tnsw, tooi
> * u: noite
> * w: wmro, warr, wifi, wyspom, wątki
> * x: x
> * y: jesteś, lafleur, nate, nowsze, violeur, y, yach, douleur
> * z: czok, skrawek
> * ń: cisew, esso
> All other one-character stems I have encountered have been for one-character
> input tokens (especially those in other writing systems).
> *Proposed solution:* if a token gets stemmed to a one-letter stem (either in
> general, or specifically if the letter is one of [a-zńć]), the input token
> should be returned unchanged.
> There are other patterns of unreliable stems, though the ones above are the
> worst.
> Two-letter stems are generally unreliable (see attachement twoletter.txt).
> The specific stems my, um, ąc, and ły are particularly random.
> Two- and three-letter stems fitting the patterns .ć and ..ć are generally not
> useful (see attachments dotc.txt and dotdotc.txt for full lists of examples).
> The specific stems ać, eć, yć, ąć, ść, and źć are particularly random.
> The specific stems ować, iwać, obić, snąć, ywać, ium also stand out as
> egregious:
> * ium: IIIC, Treze
> * iwać: Blefa, Crew, Iwano, Krall, Leseur, Maksiu, Stefa, Wrycz, cygar, horou
> * obić: Dawka, Obiło, dawka, obicia, obito
> * ować: Abdou, Bangu, Beess, Biblie, Birmie, Bohle, Bredy, Buddę, Czubą,
> Darją, Fatou, Firmie, Füssli, Ghany, Haeng, Katją, Koszyc, Ligę, Limie,
> Madou, Ozmy, Pitou, Riess, Sloane, Smółka, Soeng, TheFa, UWSS, firmie, ligę,
> szury, úzkost
> * snąć: Koziej, Schwab, Serial, Spain, serial
> * ywać: Ariza, odkuł, sorgo
> *Proposed solution:* Return the input token if the stem meets one or more of
> the following criteria:
> # stem matches /^[a-zął][a-zćń]$/
> # stem matches /^.ć/
> # stem is one of my, um, ąc, ły, ać, eć, yć, ąć, ść, or źć
> # stem matches /^..ć/
> # stem is one of ować, iwać, obić, snąć, ywać, ium
> Note: (1) is a superset of (2) and (3). (2) does not cover my, um, ąc, or ły
> in (3), so (2) and part of (3) could be combined.
> *General workaround:* Unpack Stempel into constituent parts, recreate
> Stempel's stopword list as a stop filter (see LUCENE-8417), use polish_stem
> as a stemmer, use a pattern_replace filter to replace
> /^([a-zął]?[a-zćń]|..ć|\d.*ć)$/ with '', and then a length filter to remove
> zero-length tokens, and add a stop filter with ować, iwać, obić, snąć, ywać,
> ium. Since many tokens are lost by this process, you need to also have an
> unstemmed index of the same text so you don't lose recall. (That's not
> exactly "easy", but it's what I've had to do.)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]