Several Reply-To: follows today.. --- Forwarded from Steffen Nurpmeso <[email protected]> --- Date: Mon, 02 Dec 2024 20:52:08 +0100 Author: Steffen Nurpmeso <[email protected]> From: Steffen Nurpmeso <[email protected]> To: Doug Ewell <[email protected]> Subject: Re: [private] German sharp S uppercase mapping Message-ID: <20241202195208.puaParvJ@steffen%sdaoden.eu>
Doug Ewell wrote in <sj0pr03mb659877d11b14cd8301fa25bfca...@sj0pr03mb6598.namprd03.prod.outl\ ook.com>: |Steffen Nurpmeso wrote: | |>|Casing for text meant for human readers should follow current local |>|conventions. |>| |>|Casing for text meant for machine processing (file systems, |>|databases, etc.) must remain stable, even when local conventions |>|change. |> |> Sorry that makes totally no sense to me. | |I am guessing you haven’t had to provide support for systems (computer \ |or otherwise) which depend on standards that are not stable, or which \ |introduce their own instability. Sure, i use ISO C (ha!), not to mention IDNA 2003/8. |When your internal database lookup function expects the uppercase form \ |of „schließen” to be „SCHLIESSEN”, and one day the user-level function \ |fails because the internal lookup now expects „SCHLIEẞEN”, it won’t \ |matter much that the internal function is more correct. Sounds like bad design really. Ok ok that sounds fat now, but really i have a hard time transposing your words to real life software. You know, and that is *so* bad in real life (i actually drowned in examples, some of which i produced myself). Unicode has stability, U+00DF is small and U+1E9E is uppercase. The issue is old it seems: # (cd /x/doc/coding/charset-plus/data/; grep -ri 1E9E) [hand selected lines] auxiliary/SentenceBreakProperty.txt:1E9E ; Upper # L& LATIN CAPITAL LETTER SHARP S extracted/DerivedName.txt:1E9E ; LATIN CAPITAL LETTER SHARP S extracted/DerivedGeneralCategory.txt:1E9E ; Lu # LATIN CAPITAL LETTER SHARP S CaseFolding.txt:1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S CaseFolding.txt:1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S DerivedCoreProperties.txt:1E9E ; Uppercase # L& LATIN CAPITAL LETTER SHARP S DerivedCoreProperties.txt:1E9E ; Changes_When_Lowercased # L& LATIN CAPITAL LETTER SHARP S ^ here DerivedCoreProperties.txt:1E9E ; Changes_When_Casefolded # L& LATIN CAPITAL LETTER SHARP S ^ here DerivedCoreProperties.txt:1E9E ; Changes_When_Casemapped # L& LATIN CAPITAL LETTER SHARP S ^ here DerivedNormalizationProps.txt:1E9E ; NFKC_CF; 0073 0073 # L& LATIN CAPITAL LETTER SHARP S DerivedNormalizationProps.txt:1E9E ; Changes_When_NFKC_Casefolded # L& LATIN CAPITAL LETTER SHARP S NamesList.txt: * uppercase is "SS" or 1E9E NamesList.txt:1E9E LATIN CAPITAL LETTER SHARP S UnicodeData.txt:1E9E;LATIN CAPITAL LETTER SHARP S;Lu;0;L;;;;;N;;;;00DF; So a complete implementation dealing with Unicode always had to deal with this issue. Even my s-ctext which i started by the end of March 2013 and practically stopped in October 2013 due to a CVE to a codebase i maintain, without having been informed on it, that is, but to which i will hopefully come back at a later time, knew about that already. #?0|kent:.s-ctext.git$ git grep -i Changes_When_Lowercase master|wc -l 573 #?0|kent:.s-ctext.git$ git grep -i Changes_When_Lowercase master|tail -1 master:tools/ucd-props.h:# define sct_Changes_When_Lowercased (1ull<<47) (I want to point out that the header comments /* Aiieeh, we cannot use enum due to datatype restrictions <-> portability */) -- End forward <20241202195208.puaParvJ@steffen%sdaoden.eu> --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) | |And in Fall, feel "The Dropbear Bard"s ball(s). | |The banded bear |without a care, |Banged on himself for e'er and e'er | |Farewell, dear collar bear
