Comments on 'notes/unicode-composition-for-filenames'

Julian Foad Tue, 22 Feb 2011 09:18:21 -0800

> Proposed Support Library
> ========================
> 
>    Assumptions
>    -----------
> 
>    The main assumption is that we'll keep using APR for character set


s/character set/character encoding/.

>    conversion, meaning that the recoding solution to choose would not
>    need to provide any other functionality than recoding.

s/recoding/converting between NFD and NFC UTF8 encodings/.


> Proposed Normal Form
> ====================
> 
> The proposed internal 'normal form' should be NFC, if only if                 
>                       
> it were because it's the most compact form of the two [...]
> would give the maximum performance from utf8proc [...]

I'm not very familiar with all the issues here, but although choosing
NFC may make individual conversions more efficient, I wonder if a
solution that involves normalizing to NFD could have benefits that are
more significant than this.  (Reading through this doc sequentially, we
get to this section on choosing NFC before we get to the list of
possible solutions, and it looks like premature optimization.)

For example, a solution that involves normalizing all input to NFD would
have the advantages that on MacOSX it would need to do *no* conversions
and would continue to work with old repositories in Mac-only workshops.

Further down the road, once we have all this normalization in place and
can guarantee that all new repositories and certain parts of old
repositories (revision >= N, perhaps) are already normalized, then we
won't need to do normalization of repository paths in the majority of
cases.

We will still need to run conversions on client input (from the OS and
from the user), at least on non-MacOSX clients.  In these cases, we are
already running native-to-UTF8 conversions (although bypassed when not
native is UTF8) so I wonder if the overhead of normalizing to NFD is
really that much greater than NFC.

I'm just not clear if these ideas have already been considered and
dismissed.

- Julian

Comments on 'notes/unicode-composition-for-filenames'

Reply via email to