Mike, thanks for the tips, I hadn't thought about stacking dictionaries, that works great.
I'm trying out using 4 dictionaries in two config groups now. Group 1 - 1 - Normal text synonyms (Roman numeral to text) Group 1 - 2 - INT to text Group 2 - 3 - Roman Numeral Int to text Gropu 2 - 4 - Roman Numeral text to Int So if the cataloger enters - "Scary Movie 5" it is indexed with "V" and "Five" - "Scary Movie V" it is indexed with "5" and "Five" - "Scary Movie Five" it is indexed with "5" and "V" This could really cut down on the need to add variations of the title (246) tags. Now I need to see what I can do about hyphenated numbers. We have about 500 titles like "A history of America in thirty-six postage stamps". The above setup adds "30","6","xxx","vi" to the index since thirty-six is treated as two separate words. I don't know if that is going to be a problem yet in real life usage. And while it is possible to have a synonym dictionary with 36 -> thirty-six, it doesn't work because I believe the search subsystem would never send "thirty-six", it will break it up into two words. For your suggestion about handling & and other special characters there is one thing that I don't understand. Would the normalizer translate & into ☃,and then the synonym dictionary would map ☃ to & along with ☃ to ‘and’ ? Or would this method not be using the synonym dictionaries at all? Josh Stompro - LARL IT Director -----Original Message----- From: Open-ils-general [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf Of Mike Rylander Sent: Thursday, May 25, 2017 11:19 AM To: Evergreen Discussion Group Subject: Re: [OPEN-ILS-GENERAL] Synonym Dictionary - Numbers, & Josh, To cover numbers, it looks like you just need to add dictionaries (I probably wouldn't use just one for everything) for uint, etc. Note, you can stack dictionaries. As for & (along with |, !, and maybe parens), it may be best to simply map those to some well-known token in search_normalize() that's very unlikely to be used in the real world. Perhaps some unicode codepoint, like ☃ and friends. Those are special characters used by tsearch itself. HTH, -- Mike Rylander | President | Equinox Open Library Initiative | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@equinoxinitiative.org<mailto:mi...@equinoxinitiative.org> | web: http://equinoxinitiative.org On Thu, May 25, 2017 at 11:05 AM, Josh Stompro <stomp...@exchange.larl.org<mailto:stomp...@exchange.larl.org>> wrote: > Hello, I’ve followed the steps in the following wiki pages to enable a > synonym dictionary but I’m not getting the results I expect. > > > > https://wiki.evergreen-ils.org/doku.php?id=scratchpad:brush_up_search# > synonym_dictionary > > > > Spelled out numbers do get translated to digits (six -> 6) but digits > don’t get translated ( 6 -> six). > > > > When I test the synonym dictionary with something like the following > it looks like it works: > > select ts_lexize('synonym_larl', '6'); > > ts_lexize > > ----------- > > {six} > > (1 row) > > > > But when I look at the the metabib.title_field_entry for a record that > has been reindexed I see the following. > > select * from metabib.title_field_entry where source=102449 limit 100; > > id | source | field | value > | > index_vector > > ---------+--------+-------+----------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > 2402931 | 102449 | 6 | Little house on the prairie Season 6 [disc 2] > test seven | '2':9A,13C,20C '6':7A,12C,18C '7':14C 'disc':8A,19C > 'hous':13C 'house':2A 'littl':12C 'little':1A 'on':3A,14C 'prairi':16C > 'prairie':5A 'season':6A,17C 'seven':11A,22C 'test':10A,21C > 'the':4A,15C > > > > Seven gets added as ‘seven’ and ‘7’, but the ‘2’ and ‘6’ do not. > > > > So I’m wondering if the search configuration needs to cover numeric > tokens to make that work? > > > > select * from ts_debug('synonym_larl', '6'); > > alias | description | token | dictionaries | dictionary | lexemes > > -------+------------------+-------+--------------+------------+------- > -------+------------------+-------+--------------+------------+-- > > uint | Unsigned integer | 6 | {simple} | simple | {6} > > > > \dF+ synonym_larl; > > Text search configuration "public.synonym_larl" > > Parser: "pg_catalog.default" > > Token | Dictionaries > > -----------------+-------------- > > asciihword | synonym_larl > > asciiword | synonym_larl > > email | simple > > file | simple > > float | simple > > host | simple > > hword | simple > > hword_asciipart | synonym_larl > > hword_numpart | simple > > hword_part | simple > > int | simple > > numhword | simple > > numword | simple > > sfloat | simple > > uint | simple > > url | simple > > url_path | simple > > version | simple > > word | simple > > > > Maybe the uint token needs to be set to synonym_larl also? But I’m > wondering if this has bad side effects? > > > > Also, another mapping we would like to make is ‘&’ -> ‘and’ , ‘and’ -> ‘&’. > But it doesn’t look like tsearch knows how to categorize ‘&’ as a token. > > > > select * from ts_debug('synonym_larl', '&'); > > alias | description | token | dictionaries | dictionary | lexemes > > -------+---------------+-------+--------------+------------+--------- > > blank | Space symbols | & | {} | | > > > > Works fine going the other way and the ‘&’ ends up in the index. > > > > select * from ts_debug('synonym_larl', 'and'); > > alias | description | token | dictionaries | dictionary | > lexemes > > -----------+-----------------+-------+----------------+--------------+--------- > > asciiword | Word, all ASCII | and | {synonym_larl} | synonym_larl | {&} > > > > Thanks > > Josh > > > > > > Lake Agassiz Regional Library - Moorhead MN larl.org > > Josh Stompro | Office 218.233.3757 EXT-139 > > LARL IT Director | Cell 218.790.2110 > >