https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=9729

--- Comment #10 from David Cook <dc...@prosentient.com.au> ---
Oh I've had some fun playing with ICU...

chain.xml:
<icu_chain locale="">
  <tokenize rule="l"/>
  <transliterate rule="[:Punctuation:] } [:WhiteSpace:] > ''"/>
  <transform rule="[:WhiteSpace:] Remove "/>
  <display/>
  <casemap rule="l"/>
</icu_chain>

echo -n '.NET. test' | yaz-icu -c chain.xml
1 1 '.net'' '.NET''
2 1 'test' 'test'

--
Here we tokenize based on the line break (ie space), and then we perform our
transliteerate and transform rules as per
http://userguide.icu-project.org/transforms/general. 

With the transliterate, we can use the following syntax:

"before_context { text_to_replace } after_context > completed_result |
result_to_revisit ;"

So here the "text_to_replace" is the [:Punctuation:] and the "after_context" is
[:WhiteSpace:], and the completed result is transliterating the punctuation
into nothing. 

So we trim the "." from the end of NET but we don't trim the "." from the
start. 

Of course, that doesn't really work in practice, because it misses sooo many
other scenarios:

echo -n 'Was that a good idea?' | yaz-icu -c chain.xml
1 1 'was' 'Was'
2 1 'that' 'that'
3 1 'a' 'a'
4 1 'good' 'good'
5 1 'idea?' 'idea?'

I'm not really sure how to solve this problem in an efficient way. We could
just map "C#", "C++", and ".NET" to "csharp", "cplusplus', and 'dotnet', but
that's not a very scalable or comprehensive solution for all Koha users.

-- 
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

Reply via email to