Re: [RFC/PATCH 2/8 v2] git_remote_helpers: fix input when running under Python 3

2013-01-15 Thread Junio C Hamano
John Keeping  writes:

>> That really feels wrong.  Displaying is a separate issue and it is
>> the _right_ thing to punt the problem at the lower-level machinery
>> level.
>
> But the display will require decoding the ref name to a Unicode string,
> which depends on the encoding of the underlying ref name, so it feels
> like it should be decoded where it's read (see [1]).

If you botch the decoding in a way you cannot recover the original
byte string, you cannot create a ref whose name is the original byte
string, no?  Keeping the original byte string internally (this
includes where you use it to create new refs or update existing
refs), and attempting to convert it to Unicode when you choose to
show that string as a part of a message to the user (and falling
back to replacing some bytes to '?' if you cannot, but do so only in
the message), you won't have that problem.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/8 v2] git_remote_helpers: fix input when running under Python 3

2013-01-15 Thread John Keeping
On Tue, Jan 15, 2013 at 12:51:13PM -0800, Junio C Hamano wrote:
> John Keeping  writes:
>> Although 2to3 will fix most issues in Python 2 code to make it run under
>> Python 3, it does not handle the new strict separation between byte
>> strings and unicode strings.  There is one instance in
>> git_remote_helpers where we are caught by this, which is when reading
>> refs from "git for-each-ref".
>>
>> While we could fix this by explicitly handling refs as byte strings,
>> this is merely punting the problem to users of the library since the
>> same problem will be encountered as soon you want to display the ref
>> name to a user.
>>
>> Instead of doing this, explicit decode the incoming byte string into a
>> unicode string.
> 
> That really feels wrong.  Displaying is a separate issue and it is
> the _right_ thing to punt the problem at the lower-level machinery
> level.

But the display will require decoding the ref name to a Unicode string,
which depends on the encoding of the underlying ref name, so it feels
like it should be decoded where it's read (see [1]).

>> Following the lead of pygit2 (the Python bindings for
>> libgit2 - see [1] and [2]),...
> 
> I do not think other people getting it wrong is not an excuse to
> repeat the same mistake.
>
> Is it really so cumbersome to handle byte strings as byte strings in
> Python?

As [1] says, there is a potential for bugs whenever people attempt to
combine Unicode and byte strings.  I think it also violates the
principle of least surprise if a ref name (a string) doesn't behave like
a normal string.

[1] 
http://docs.python.org/3.3/howto/unicode.html#tips-for-writing-unicode-aware-programs


John
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/8 v2] git_remote_helpers: fix input when running under Python 3

2013-01-15 Thread Junio C Hamano
John Keeping  writes:

> Although 2to3 will fix most issues in Python 2 code to make it run under
> Python 3, it does not handle the new strict separation between byte
> strings and unicode strings.  There is one instance in
> git_remote_helpers where we are caught by this, which is when reading
> refs from "git for-each-ref".
>
> While we could fix this by explicitly handling refs as byte strings,
> this is merely punting the problem to users of the library since the
> same problem will be encountered as soon you want to display the ref
> name to a user.
>
> Instead of doing this, explicit decode the incoming byte string into a
> unicode string.

That really feels wrong.  Displaying is a separate issue and it is
the _right_ thing to punt the problem at the lower-level machinery
level.

> Following the lead of pygit2 (the Python bindings for
> libgit2 - see [1] and [2]),...

I do not think other people getting it wrong is not an excuse to
repeat the same mistake.

Is it really so cumbersome to handle byte strings as byte strings in
Python?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC/PATCH 2/8 v2] git_remote_helpers: fix input when running under Python 3

2013-01-15 Thread John Keeping
Although 2to3 will fix most issues in Python 2 code to make it run under
Python 3, it does not handle the new strict separation between byte
strings and unicode strings.  There is one instance in
git_remote_helpers where we are caught by this, which is when reading
refs from "git for-each-ref".

While we could fix this by explicitly handling refs as byte strings,
this is merely punting the problem to users of the library since the
same problem will be encountered as soon you want to display the ref
name to a user.

Instead of doing this, explicit decode the incoming byte string into a
unicode string.  Following the lead of pygit2 (the Python bindings for
libgit2 - see [1] and [2]), use the filesystem encoding by default,
providing a way for callers to override this if necessary.

[1] 
https://github.com/libgit2/pygit2/blob/e34911b63e5d2266f9f72a4e3f32e27b13190feb/src/pygit2/reference.c#L261
[2] 
https://github.com/libgit2/pygit2/blob/e34911b63e5d2266f9f72a4e3f32e27b13190feb/include/pygit2/utils.h#L55

Signed-off-by: John Keeping 
---

I think this is in fact the best way to handle this, and I hope the
above description clarified why I don't think we want to treat refs as
byte strings in Python 3.

My only remaining question is whether it would be better to set the
error mode when decoding to "replace" instead of "strict" (the default).
"strict" will cause a UnicodeError if the string cannot be decoded
whereas "replace" will use U+FFFD (the replacement character). [3]

I think it's better to use "strict" and let the user know that
something has gone wrong rather than silently change the string, but I'd
welcome other opinions.

[3] http://docs.python.org/2/library/codecs.html#codec-base-classes

 git_remote_helpers/git/importer.py | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/git_remote_helpers/git/importer.py 
b/git_remote_helpers/git/importer.py
index e28cc8f..5bc16a4 100644
--- a/git_remote_helpers/git/importer.py
+++ b/git_remote_helpers/git/importer.py
@@ -1,5 +1,6 @@
 import os
 import subprocess
+import sys
 
 from git_remote_helpers.util import check_call, check_output
 
@@ -10,17 +11,26 @@ class GitImporter(object):
 This importer simply delegates to git fast-import.
 """
 
-def __init__(self, repo):
+def __init__(self, repo, ref_encoding=None):
 """Creates a new importer for the specified repo.
+
+If ref_encoding is specified that refs are decoded using that
+encoding.  Otherwise the system filesystem encoding is used.
 """
 
 self.repo = repo
+self.ref_encoding = ref_encoding
 
 def get_refs(self, gitdir):
 """Returns a dictionary with refs.
 """
 args = ["git", "--git-dir=" + gitdir, "for-each-ref", "refs/heads"]
-lines = check_output(args).strip().split('\n')
+encoding = self.ref_encoding
+if encoding is None:
+encoding = sys.getfilesystemencoding()
+if encoding is None:
+encoding = sys.getdefaultencoding()
+lines = check_output(args).decode(encoding).strip().split('\n')
 refs = {}
 for line in lines:
 value, name = line.split(' ')
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html