Re: [RFC/PATCH 2/8 v2] git_remote_helpers: fix input when running under Python 3
John Keeping writes: >> That really feels wrong. Displaying is a separate issue and it is >> the _right_ thing to punt the problem at the lower-level machinery >> level. > > But the display will require decoding the ref name to a Unicode string, > which depends on the encoding of the underlying ref name, so it feels > like it should be decoded where it's read (see [1]). If you botch the decoding in a way you cannot recover the original byte string, you cannot create a ref whose name is the original byte string, no? Keeping the original byte string internally (this includes where you use it to create new refs or update existing refs), and attempting to convert it to Unicode when you choose to show that string as a part of a message to the user (and falling back to replacing some bytes to '?' if you cannot, but do so only in the message), you won't have that problem. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/8 v2] git_remote_helpers: fix input when running under Python 3
On Tue, Jan 15, 2013 at 12:51:13PM -0800, Junio C Hamano wrote: > John Keeping writes: >> Although 2to3 will fix most issues in Python 2 code to make it run under >> Python 3, it does not handle the new strict separation between byte >> strings and unicode strings. There is one instance in >> git_remote_helpers where we are caught by this, which is when reading >> refs from "git for-each-ref". >> >> While we could fix this by explicitly handling refs as byte strings, >> this is merely punting the problem to users of the library since the >> same problem will be encountered as soon you want to display the ref >> name to a user. >> >> Instead of doing this, explicit decode the incoming byte string into a >> unicode string. > > That really feels wrong. Displaying is a separate issue and it is > the _right_ thing to punt the problem at the lower-level machinery > level. But the display will require decoding the ref name to a Unicode string, which depends on the encoding of the underlying ref name, so it feels like it should be decoded where it's read (see [1]). >> Following the lead of pygit2 (the Python bindings for >> libgit2 - see [1] and [2]),... > > I do not think other people getting it wrong is not an excuse to > repeat the same mistake. > > Is it really so cumbersome to handle byte strings as byte strings in > Python? As [1] says, there is a potential for bugs whenever people attempt to combine Unicode and byte strings. I think it also violates the principle of least surprise if a ref name (a string) doesn't behave like a normal string. [1] http://docs.python.org/3.3/howto/unicode.html#tips-for-writing-unicode-aware-programs John -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/8 v2] git_remote_helpers: fix input when running under Python 3
John Keeping writes: > Although 2to3 will fix most issues in Python 2 code to make it run under > Python 3, it does not handle the new strict separation between byte > strings and unicode strings. There is one instance in > git_remote_helpers where we are caught by this, which is when reading > refs from "git for-each-ref". > > While we could fix this by explicitly handling refs as byte strings, > this is merely punting the problem to users of the library since the > same problem will be encountered as soon you want to display the ref > name to a user. > > Instead of doing this, explicit decode the incoming byte string into a > unicode string. That really feels wrong. Displaying is a separate issue and it is the _right_ thing to punt the problem at the lower-level machinery level. > Following the lead of pygit2 (the Python bindings for > libgit2 - see [1] and [2]),... I do not think other people getting it wrong is not an excuse to repeat the same mistake. Is it really so cumbersome to handle byte strings as byte strings in Python? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH 2/8 v2] git_remote_helpers: fix input when running under Python 3
Although 2to3 will fix most issues in Python 2 code to make it run under Python 3, it does not handle the new strict separation between byte strings and unicode strings. There is one instance in git_remote_helpers where we are caught by this, which is when reading refs from "git for-each-ref". While we could fix this by explicitly handling refs as byte strings, this is merely punting the problem to users of the library since the same problem will be encountered as soon you want to display the ref name to a user. Instead of doing this, explicit decode the incoming byte string into a unicode string. Following the lead of pygit2 (the Python bindings for libgit2 - see [1] and [2]), use the filesystem encoding by default, providing a way for callers to override this if necessary. [1] https://github.com/libgit2/pygit2/blob/e34911b63e5d2266f9f72a4e3f32e27b13190feb/src/pygit2/reference.c#L261 [2] https://github.com/libgit2/pygit2/blob/e34911b63e5d2266f9f72a4e3f32e27b13190feb/include/pygit2/utils.h#L55 Signed-off-by: John Keeping --- I think this is in fact the best way to handle this, and I hope the above description clarified why I don't think we want to treat refs as byte strings in Python 3. My only remaining question is whether it would be better to set the error mode when decoding to "replace" instead of "strict" (the default). "strict" will cause a UnicodeError if the string cannot be decoded whereas "replace" will use U+FFFD (the replacement character). [3] I think it's better to use "strict" and let the user know that something has gone wrong rather than silently change the string, but I'd welcome other opinions. [3] http://docs.python.org/2/library/codecs.html#codec-base-classes git_remote_helpers/git/importer.py | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/git_remote_helpers/git/importer.py b/git_remote_helpers/git/importer.py index e28cc8f..5bc16a4 100644 --- a/git_remote_helpers/git/importer.py +++ b/git_remote_helpers/git/importer.py @@ -1,5 +1,6 @@ import os import subprocess +import sys from git_remote_helpers.util import check_call, check_output @@ -10,17 +11,26 @@ class GitImporter(object): This importer simply delegates to git fast-import. """ -def __init__(self, repo): +def __init__(self, repo, ref_encoding=None): """Creates a new importer for the specified repo. + +If ref_encoding is specified that refs are decoded using that +encoding. Otherwise the system filesystem encoding is used. """ self.repo = repo +self.ref_encoding = ref_encoding def get_refs(self, gitdir): """Returns a dictionary with refs. """ args = ["git", "--git-dir=" + gitdir, "for-each-ref", "refs/heads"] -lines = check_output(args).strip().split('\n') +encoding = self.ref_encoding +if encoding is None: +encoding = sys.getfilesystemencoding() +if encoding is None: +encoding = sys.getdefaultencoding() +lines = check_output(args).decode(encoding).strip().split('\n') refs = {} for line in lines: value, name = line.split(' ') -- 1.8.1 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html