> On 15 Dec 2017, at 10:58, Jeff King <p...@peff.net> wrote:
> 
> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote:
> 
>> From: Lars Schneider <larsxschnei...@gmail.com>
>> 
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
> 
> This made me wonder what happens when you have a file that _was_ utf16,
> and then you later convert it to utf8 and add a .gitattributes entry.
> 
> I tried this with your patch:
> 
>  git init repo
>  cd repo
> 
>  echo foo | iconv -t utf16 >file
>  git add file
>  git commit -m utf16
> 
>  echo 'file encoding=utf16' >.gitattributes
>  touch file ;# make it stat-dirty
>  git commit -m convert
> 
>  git checkout HEAD^
> 
> That works OK, because we try to read the attributes from the
> destination tree for a checkout. If you do this:
> 
>  echo 'file encoding=utf16' >.git/info/attributes
>  git checkout HEAD^
> 
> then we get:
> 
>  warning: failed to encode 'file' from utf-8 to utf16
> 
> At least it figured out that it couldn't convert the content. It's
> slightly troubling that it would try in the first place, though; are
> there encoding pairs where we might accidentally generate nonsense?

At this point we interpret utf-16 content as utf-8 and try to convert
it to utf-16. That of course fails because utf-16 content is no valid
utf-8. How could we stop trying that? How could Git possibly know what
kind of encoding is used (apart from our new hint in gitattributes)?

I asked myself the question about nonsense encoding pairs, too. That
was the reason I added the "X -> UTF-8 -> Y && X == Y" check on Git
add. However, with ".git/info/attributes" you could circumvent that
check and probably generate nonsense.


> It _should_ be uncommon, I think, to have a repo-level attribute set
> like that. It's very common for us to use whatever happens to be in the
> checked-out .gitattributes for some attributes (e.g., when doing a diff
> of an older commit), but I think for checkout-related ones it's not.  So
> I think it may generally work in practice. And certainly the line-ending
> code would share any similar problems, but at least there the result is
> less confusing than mojibake.
> 
> Playing around, I also managed to do this:
> 
>  echo 'file encoding=utf16' >.gitattributes
>  echo foo >file
> 
>  # I did these with an old version of git that didn't
>  # support the new attribute, so they blindly added the utf8 content.
>  git add .
>  git commit -m convert
> 
>  git.compile checkout HEAD^
> 
> which yielded:
> 
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
> Now obviously my situation is somewhat nonsense. I was trying to force
> the in-repo representation to utf8, but ended up with a mismatched
> working tree file. But what's somewhat troubling is that I couldn't
> checkout _away_ from that state due to the die() in convert_to_git().
> Which is in turn just there as part of refresh_index().
> 
> And indeed, other commands hit the same problem:
> 
>  $ git.compile diff
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
>  $ git.compile checkout -f
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
> It may make sense to die() during "git add ." (since we're actually
> changing the index entry, and we don't want to put nonsense into a
> tree). But I'm not sure it's the best thing for operations which just
> want to read the content. For them, perhaps it would be more appropriate
> to issue a warning and return the untouched content.

Absolutely! Thanks for spotting this. I will try to run die() only on
"git add" in v2.

- Lars

Reply via email to