+1 to ICU, and I'd also be interested in follow-up. In case transliteration might also be helpful for your case, I took a cursory glance at the out-of-the-box transliteration ids (https://github.com/unicode-org/icu/tree/main/icu4c/source/data/translit) and I don't think there's anything for the scripts you're interested in (but I also didn't really know what I was looking for, so you may want to look yourself). If you _do_ find yourself in the position of wanting transliteration for these scripts and not being able to find an out-of-the-box impl, I'll also note that I _think_ it may be more straightforward than one might initially assume to write, load, register, and employ a custom transliterator rule file. I haven't actually tried this yet, but the possibility occurred to me in the course of working on LUCENE-8972 and I thought I'd share the idea. Feel free to reach out if you decide to try to tackle the custom transliteration; I have some preliminary ideas about how to proceed with it.
Michael On Fri, Jun 11, 2021 at 10:21 AM Alexandre Rafalovitch <[email protected]> wrote: > > Hi Peter, > > This is a fascinating problem. I would not mind seeing a resolved > solution fed back into the list. > > I think your best bet lies in exploring the icu4j library that ships > with Solr, but needs to be enabled in solrconfig.xml. A little bit is > explained at > https://solr.apache.org/guide/8_8/language-analysis.html#unicode-collation > and > https://solr.apache.org/guide/8_8/charfilterfactories.html#solr-icunormalizer2charfilterfactory > > After that, it is basically "the shoulders of the giants". If you are > trying to trace the true support then ICU4J is the implementation of > http://site.icu-project.org/ (International Components for Unicode) > which implements Unicode, which seems to have support for the > languages you discuss: https://www.unicode.org/charts/#scripts > (Unified Canadian Aboriginal Syllabics). This seems to imply that word > and sentence boundaries (which is what I assume you are after) are > also in Unicode, therefore in ICU, therefore in ICU4j, therefore in > Solr. > > And that brings us back to the valid magical invocation. The specific > invocation would depend on the exact search issue you are trying to > resolve and figuring out the language codes/names for your > languages/locales. > > I did do a Thai language demo of phonetic search against Thai text. > Very long time ago, so not a copy/paste, but still relevant. This is > excerpt from my demo: > https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55 > > <!-- > During indexing: > 1) tokenize Thai text with built-in rules+dictionary > 2) map it to latin characters (with special accents indicating > tones > 3) get rid of tone marks, as nobody uses them > 4) do some phonetic (BMF) broadening to match possible > alternative spellings in English > > During querying, we don't want this field type matching > Thai text on query (BMFF is a little too aggressive for that). So, we > are doing English-specific query chain > --> > <fieldType name="thai_english" class="solr.TextField"> > <analyzer type="index"> > <tokenizer class="solr.ICUTokenizerFactory"/> > <filter class="solr.ICUTransformFilterFactory" > id="Thai-Latin" /> > <filter class="solr.ICUTransformFilterFactory" > id="NFD; [:Nonspacing Mark:] Remove; NFC" /> > <filter class="solr.BeiderMorseFilterFactory" /> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory" /> > <filter class="solr.LowerCaseFilterFactory" /> > <filter class="solr.BeiderMorseFilterFactory" /> > </analyzer> > </fieldType> > > Hope this helps, > Alex. > P.s. If you progress but still get stuck, feel free to reach out > directly as well. I am in Montreal, the questions resonated with me. > > On Thu, 10 Jun 2021 at 15:38, Peter Tyrrell <[email protected]> wrote: > > > > I'm quite familiar with indexing English and French languages in Solr, but > > has anybody got any tips on indexing and querying (Canadian) indigenous > > First Nations languages? Depending on the language, terms may be written in > > a syllabic script > > (https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics) or in > > Americanist phonetic notation > > (https://en.wikipedia.org/wiki/Americanist_phonetic_notation). > > > > > > Peter > > > > Peter Tyrrell, MLIS > > Lead Developer at Andornot > > 1-866-266-2525 x706 / [email protected] > >
