> On 17 Dec 2017, at 18:14, Torsten Bögershausen <[email protected]> wrote:
>
> On Mon, Dec 11, 2017 at 04:50:23PM +0100, [email protected] wrote:
>> From: Lars Schneider <[email protected]>
>>
>> +`encoding`
>> +^^^^^^^^^^
>> +
>> +By default Git assumes UTF-8 encoding for text files. The `encoding`
>
> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.
I am not sure I understand what you want to tell me here.
Do you think UTF-8 encoding is not correct in the sentence above?
>
>> +attribute sets the encoding to be used in the working directory.
>> +If the path is added to the index, then Git encodes the content to
>> +UTF-8. On checkout the content is encoded back to the original
>> +encoding. Consequently, you can use all built-in Git text processing
>> +tools (e.g. git diff, line ending conversions, etc.) with your
>> +non-UTF-8 encoded file.
>> +
>> +Please note that re-encoding content can cause errors and requires
>> +resources. Use the `encoding` attribute only if you cannot store
>> +a file in UTF-8 encoding and if you want Git to be able to process
>> +the text.
>> +
>> +------------------------
>> +*.txt text encoding=UTF-16
>> +------------------------
>
> I think that encoding (or worktree-encoding as said elsewhere) must imply
> text.
> (That is not done in convert.c, but assuming binary or even auto
> does not make much sense to me)
"text" only means "LF". "-text" means CRLF which would be perfectly fine
for UTF-16. Therefore I don't think we need to imply text.
Does this make sense?
>
>
>> +
>> +All `iconv` encodings with a stable round-trip conversion to and from
>> +UTF-8 are supported. You can see a full list with the following command:
>> +
>> +------------------------
>> +iconv --list
>> +------------------------
>> +
>> `eol`
>> ^^^^^
>>
>> diff --git a/convert.c b/convert.c
>> index 20d7ab67bd..ee19c17104 100644
>> --- a/convert.c
>> +++ b/convert.c
>> @@ -7,6 +7,7 @@
>> #include "sigchain.h"
>> #include "pkt-line.h"
>> #include "sub-process.h"
>> +#include "utf8.h"
>>
>> /*
>> * convert.c - convert a file when checking it out and checking it in.
>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct
>> text_stat *stats,
>>
>> }
>
> I would avoid to use these #ifdefs here.
> All functions can be in strbuf.c (and may have #ifdefs there), but not here.
I'll try that. But wouldn't it make more sense to move the functions to utf.c?
>
>>
>> +#ifdef NO_ICONV
>> +#ifndef _ICONV_T
>> +/* The type is just a placeholder and not actually used. */
>> +typedef void* iconv_t;
>> +#endif
>> +#endif
>> +
>> +static struct encoding {
>> + const char *name;
>> + iconv_t to_git; /* conversion to Git's canonical encoding (UTF-8)
>> */
>> + iconv_t to_worktree; /* conversion to user-defined encoding */
>> + struct encoding *next;
>> +} *encoding, **encoding_tail;
>
> This seems like an optimazation, assuning that iconv_open() eats a lot of
> ressources.
> I don't think this is the case. (Undere MacOS we run iconv_open() together
> with
> every opendir(), and optimizing this out did not show any measurable
> improvements)
True, but then I would need to free() the memory in a lot of places.
Therefore I thought it is easier to keep the object. OK for you?
>> +static const char *default_encoding = "utf-8";
>
> The most portable form is "UTF-8" (correct me if that is wrong)
It shouldn't matter. But I've changed it to uppercase to be on the safe side.
>> +static iconv_t invalid_conversion = (iconv_t)-1;
>> +
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> + struct strbuf *buf, struct encoding *enc)
>> +{
>> +#ifndef NO_ICONV
>> + char *dst, *re_src;
>> + int dst_len, re_src_len;
>> +
>> + /*
>> + * No encoding is specified or there is nothing to encode.
>> + * Tell the caller that the content was not modified.
>> + */
>> + if (!enc || (src && !src_len))
>> + return 0;
>> +
>> + /*
>> + * Looks like we got called from "would_convert_to_git()".
>> + * This means Git wants to know if it would encode (= modify!)
>> + * the content. Let's answer with "yes", since an encoding was
>> + * specified.
>> + */
>> + if (!buf && !src)
>> + return 1;
>> +
>> + if (enc->to_git == invalid_conversion) {
>> + enc->to_git = iconv_open(default_encoding, encoding->name);
>> + if (enc->to_git == invalid_conversion)
>> + warning(_("unsupported encoding %s"), encoding->name);
>> + }
>
> /* There are 2 different types of reaction:
> Either users know what that a warning means: You asked for
> problems,
> and do the right thing. Other may may ignore the warning
> - in both cases an error is useful */
Agreed!
>> + if (enc->to_worktree == invalid_conversion)
>> + enc->to_worktree = iconv_open(encoding->name, default_encoding);
>> +
>> + /*
>> + * Do nothing if the encoding is not supported. This could happen in
>> + * two cases:
>> + * (1) The encoding is garbage. That is no problem as we would not
>> + * encode the content to UTF-8 on "add" and we would not encode
>> + * it back on "checkout".
>> + * (2) Git users use different iconv versions that support different
>> + * encodings. This could lead to problems, as the content might
>> + * not be encoded on "add" but encoded back on "checkout" (or
>> + * the other way around).
>> + * We print a one-time warning to the user in both cases above.
>> + */
>
> Isn't an error better than "garbage in -> garbage out" ?
Agreed. I changed the warning to an error.
>> diff --git a/t/t0028-encoding.sh b/t/t0028-encoding.sh
>
> (I didn't review the test yet)
>
> Another comment for a possible improvement:
> "git diff" could be told to write the worktree-encoding into the diff,
> and "git apply" can pick that up.
Yes, we could do that. However, I would tackle that in a separate series.
Thanks,
Lars