> Proposed Support Library > ======================== > > Assumptions > ----------- > > The main assumption is that we'll keep using APR for character set
s/character set/character encoding/. > conversion, meaning that the recoding solution to choose would not > need to provide any other functionality than recoding. s/recoding/converting between NFD and NFC UTF8 encodings/. > Proposed Normal Form > ==================== > > The proposed internal 'normal form' should be NFC, if only if > > it were because it's the most compact form of the two [...] > would give the maximum performance from utf8proc [...] I'm not very familiar with all the issues here, but although choosing NFC may make individual conversions more efficient, I wonder if a solution that involves normalizing to NFD could have benefits that are more significant than this. (Reading through this doc sequentially, we get to this section on choosing NFC before we get to the list of possible solutions, and it looks like premature optimization.) For example, a solution that involves normalizing all input to NFD would have the advantages that on MacOSX it would need to do *no* conversions and would continue to work with old repositories in Mac-only workshops. Further down the road, once we have all this normalization in place and can guarantee that all new repositories and certain parts of old repositories (revision >= N, perhaps) are already normalized, then we won't need to do normalization of repository paths in the majority of cases. We will still need to run conversions on client input (from the OS and from the user), at least on non-MacOSX clients. In these cases, we are already running native-to-UTF8 conversions (although bypassed when not native is UTF8) so I wonder if the overhead of normalizing to NFD is really that much greater than NFC. I'm just not clear if these ideas have already been considered and dismissed. - Julian