> On 23 Feb 2018, at 21:11, Junio C Hamano <[email protected]> wrote:
>
> Junio C Hamano <[email protected]> writes:
>
>> Lars Schneider <[email protected]> writes:
>>
>>> I still think it would be nice to see diffs for arbitrary encodings.
>>> Would it be an option to read the `encoding` attribute and use it in
>>> `git diff`?
>>
>> Reusing that gitk-only thing and suddenly start doing so would break
>> gitk users, no? The tool expects the diff to come out encoded in
>> the encoding that is specified by that attribute (which is learned
>> from get_path_encoding helper) and does its thing.
>>
>> I guess that gitk uses diff-tree plumbing and you won't be applying
>> this change to the plumbing, perhaps? If so, it might not be too
>> bad, but those who decided to postprocess "git diff" output (instead
>> of "git diff-tree" output) mimicking how gitk does it by thinking
>> that is the safe and sane thing to do will be broken by such a
>> change. You could do "use the encoding only when a command line
>> option says so", but then people will add a configuration variable
>> to turn it always on and these existing scripts will be broken.
>>
>> I do not personally have much sympathy for the last case (i.e. those
>> who scripted around 'git diff' instead of 'git diff-tree' to get
>> broken), so making the new feature only work with the Porcelain "git
>> diff" might be an option. I'll need a bit more time to formulate
>> the rest of my thought ;-)
>
> So we are introducing in this series a way to say in what encoding
> the things should be placed in the working tree files (i.e. the
> w-t-e attribute attached to the paths). Currently there is no
> mechanism to say what encoding the in-repo contents are and UTF-8 is
> assumed when conversion from/to w-t-e is required, but there is no
> fundamental reason why it shouldn't be customizable (if anything, as
> a piece of fact on the in-repo data, in-repo-encoding is *more*
> appropriate to be an attribute than w-t-e that can merely be project
> preference at best, as I mentioned earlier in this thread).
Correct.
> We always use the in-repo contents when generating 'diff'. I think
> by "attribute to be used in diff", what you are reallying after is
> to convert the in-repo contents to that encoding _BEFORE_ running
> 'diff' on it. E.g. in-repo UTF-16 that can have NUL bytes all over
> the place will not diff well with the xdiff machinery, but if you
> first convert it to UTF-8 and have xdiff work on it, you can get
> reasonable result out of it. It is unclear what encoding you want
> your final diff output in (it is equally valid in such a set-up to
> desire your patch output in UTF-16 or UTF-8), but assuming that you
> want UTF-8 in your patch output, perhaps we do not have to break
> gitk users by hijacking the 'encoding' attribute. Instead what you
> want is a single bit that says between in-repo or working tree which
> representation should be given to the xdiff machinery.
I fear that we could confuse users with an additional knob/bit that
defines what we diff against. Git always diff'ed against in-repo
content and I feel it should stay that way.
However, I agree with your earlier emails that "working-tree-encoding"
is just one half of the feature. I also think it would be nice to be
able to define the "in-repo-encoding" as well. Then we could define
something like that:
*.foo text in-repo-encoding=UTF-16LE
This tells Git that the file is stored as UTF-16LE. This would help Git
generating a diff via UTF-8 conversion. I feel that the final patch
should be in UTF-16LE again. Maybe over time we could then deprecate the
"encoding" attribute as the "in-repo-encoding" attribute serves a similar
purpose (maybe gitk can switch to it).
In that case we could also do things like that:
*.bar text working-tree-encoding=SHIFT-JIS
in-repo-encoding=UTF-16LE
SHIFT-JIS encoded files would be reencoded to UTF-16LE on checkin.
On checkout the opposite would happen. This way we would lift the
"UTF-8 is the only in-repo encoding" limitation of the current w-t-e
implementation.
Does this sound sensible to you? That being said, I think "in-repo-encoding"
would deserve an own series.
- Lars