lars.schnei...@autodesk.com writes: > +static int validate_encoding(const char *path, const char *enc, > + const char *data, size_t len, int die_on_error) > +{ > + /* We only check for UTF here as UTF?? can be an alias for UTF-?? */ > + if (startscase_with(enc, "UTF")) { > + /* > + * Check for detectable errors in UTF encodings > + */ > + if (has_prohibited_utf_bom(enc, data, len)) { > + const char *error_msg = _( > + "BOM is prohibited in '%s' if encoded as %s"); > + /* > + * This advice is shown for UTF-??BE and UTF-??LE > encodings. > + * We cut off the last two characters of the encoding > name > + # to generate the encoding name suitable for BOMs. > + */
Yuck. The code pretends to abstract away the details in a helper has_prohibited_x() yet the caller still knows quite a lot. > + const char *advise_msg = _( > + "The file '%s' contains a byte order " > + "mark (BOM). Please use %s as " > + "working-tree-encoding."); > + char *upper_enc = xstrdup_toupper(enc); > + upper_enc[strlen(upper_enc)-2] = '\0'; > + advise(advise_msg, path, upper_enc); > + free(upper_enc); I think this up-casing is more problematic than without, not from the point of view of the internal code, but from the point of view of the end user experience. When the user writes utf16le or utf-16le and the data does not trigger the BOM check, we are likely to successfully convert it. I do not see the merit of suggesting UTF16 or UTF-16 in such a case, over telling them to just drop the byte-order suffix from the encoding names (i.e. utf16 or utf-16). If you are trying to force/nudge people in the direction of canonical way of spelling things (which may not be a bad idea), then "utf16le" as the original input would want to result in "UTF-16" with dash in the advise, no? On the other hand, if we are not enforcing such a policy decision but merely explaining a way to work around this check, then it may be better to give a variant with the smaller difference from the original (i.e. without up-casing).