Re: Approaches to indexing indigenous languages?

Michael Gibney Fri, 11 Jun 2021 07:46:16 -0700

+1 to ICU, and I'd also be interested in follow-up. In case
transliteration might also be helpful for your case, I took a cursory
glance at the out-of-the-box transliteration ids
(https://github.com/unicode-org/icu/tree/main/icu4c/source/data/translit)
and I don't think there's anything for the scripts you're interested
in (but I also didn't really know what I was looking for, so you may
want to look yourself). If you _do_ find yourself in the position of
wanting transliteration for these scripts and not being able to find
an out-of-the-box impl, I'll also note that I _think_ it may be more
straightforward than one might initially assume to write, load,
register, and employ a custom transliterator rule file. I haven't
actually tried this yet, but the possibility occurred to me in the
course of working on LUCENE-8972 and I thought I'd share the idea.
Feel free to reach out if you decide to try to tackle the custom
transliteration; I have some preliminary ideas about how to proceed
with it.


Michael


On Fri, Jun 11, 2021 at 10:21 AM Alexandre Rafalovitch
<[email protected]> wrote:
>
> Hi Peter,
>
> This is a fascinating problem. I would not mind seeing a resolved
> solution fed back into the list.
>
> I think your best bet lies in exploring the icu4j library that ships
> with Solr, but needs to be enabled in solrconfig.xml. A little bit is
> explained at 
> https://solr.apache.org/guide/8_8/language-analysis.html#unicode-collation
> and 
> https://solr.apache.org/guide/8_8/charfilterfactories.html#solr-icunormalizer2charfilterfactory
>
> After that, it is basically "the shoulders of the giants". If you are
> trying to trace the true support then ICU4J is the implementation of
> http://site.icu-project.org/ (International Components for Unicode)
> which implements Unicode, which seems to have support for the
> languages you discuss: https://www.unicode.org/charts/#scripts
> (Unified Canadian Aboriginal Syllabics). This seems to imply that word
> and sentence boundaries (which is what I assume you are after) are
> also in Unicode, therefore in ICU, therefore in ICU4j, therefore in
> Solr.
>
> And that brings us back to the valid magical invocation. The specific
> invocation would depend on the exact search issue you are trying to
> resolve and figuring out the language codes/names for your
> languages/locales.
>
> I did do a Thai language demo of phonetic search against Thai text.
> Very long time ago, so not a copy/paste, but still relevant. This is
> excerpt from my demo:
> https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
>
>         <!--
>             During indexing:
>             1) tokenize Thai text with built-in rules+dictionary
>             2) map it to latin characters (with special accents indicating 
> tones
>             3) get rid of tone marks, as nobody uses them
>             4) do some phonetic (BMF) broadening to match possible
> alternative spellings in English
>
>             During querying, we don't want this field type matching
> Thai text on query (BMFF is a little too aggressive for that). So, we
> are doing English-specific query chain
>         -->
>         <fieldType name="thai_english" class="solr.TextField">
>             <analyzer type="index">
>                 <tokenizer class="solr.ICUTokenizerFactory"/>
>                 <filter class="solr.ICUTransformFilterFactory"
> id="Thai-Latin" />
>                 <filter class="solr.ICUTransformFilterFactory"
> id="NFD; [:Nonspacing Mark:] Remove; NFC" />
>                 <filter class="solr.BeiderMorseFilterFactory" />
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.StandardTokenizerFactory" />
>                 <filter class="solr.LowerCaseFilterFactory" />
>                 <filter class="solr.BeiderMorseFilterFactory" />
>             </analyzer>
>         </fieldType>
>
> Hope this helps,
>     Alex.
> P.s. If you progress but still get stuck, feel free to reach out
> directly as well. I am in Montreal, the questions resonated with me.
>
> On Thu, 10 Jun 2021 at 15:38, Peter Tyrrell <[email protected]> wrote:
> >
> > I'm quite familiar with indexing English and French languages in Solr, but 
> > has anybody got any tips on indexing and querying (Canadian) indigenous 
> > First Nations languages? Depending on the language, terms may be written in 
> > a syllabic script 
> > (https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics) or in 
> > Americanist phonetic notation 
> > (https://en.wikipedia.org/wiki/Americanist_phonetic_notation).
> >
> >
> > Peter
> >
> > Peter Tyrrell, MLIS
> > Lead Developer at Andornot
> > 1-866-266-2525 x706 / [email protected]
> >

Re: Approaches to indexing indigenous languages?

Reply via email to