Re: [PATCH v3 6/8] git-remote-testpy: hash bytes explicitly

John Keeping Sun, 27 Jan 2013 06:19:38 -0800

On Sun, Jan 27, 2013 at 05:44:37AM +0100, Michael Haggerty wrote:
> On 01/26/2013 10:44 PM, Junio C Hamano wrote:
> > John Keeping <j...@keeping.me.uk> writes:
> >> @@ -45,7 +45,7 @@ def get_repo(alias, url):
> >>      repo.get_head()
> >>  
> >>      hasher = _digest()
> >> -    hasher.update(repo.path)
> >> +    hasher.update(repo.path.encode('utf-8'))
> >>      repo.hash = hasher.hexdigest()
> >>  
> >>      repo.get_base_path = lambda base: os.path.join(
> 
> This will still fail under Python 2.x if repo.path is a byte string that
> contains non-ASCII characters.


I had forgotten about Python 2 while doing this.

>                                 And it will fail under Python 3.1 and
> later if repo.path contains characters using the surrogateescape
> encoding option [1], as it will if the original command-line argument
> contained bytes that cannot be decoded into Unicode using the user's
> default encoding:

Interesting.  I wasn't aware of the "surrogateescape" error handler.

> 'surrogateescape' is not supported in Python 3.0, but I think it would
> be quite acceptable only to support Python 3.x for x >= 1.

I agree.

> But 'surrogateescape' doesn't seem to be supported at all in Python 2.x
> (I tested 2.7.3 and it's not there).
> 
> Here you don't really need byte-for-byte correctness; it would be enough
> to get *some* byte string that is unique for a given input (ideally,
> consistent with ASCII or UTF-8 for backwards compatibility).  So you
> could use
> 
>     b = s.encode('utf-8', 'backslashreplace')
> 
> Unfortunately, this doesn't work under Python 2.x:
> 
>     $ python2 -c "
>     import sys
>     print(repr(sys.argv[1]))
>     print(repr(sys.argv[1].encode('utf-8', 'backslashreplace')))
>     " $(echo français|iconv -t latin1)
>     'fran\xe7ais'
>     Traceback (most recent call last):
>       File "<string>", line 4, in <module>
>     UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position
> 4: ordinal not in range(128)
> 
> Apparently when you call bytestring.encode(), Python first tries to
> decode the string to Unicode using the 'ascii' encoding.

Actually it appears to use sys.getdefaultencoding() to do this initial
decode.  Not that it makes much difference here since the failure is the
same.

> So to handle all of the cases across Python versions as closely as
> possible to the old 2.x code, it might be necessary to make the code
> explicitly depend on the Python version number, like:
> 
>     hasher = _digest()
>     if sys.hexversion < 0x03000000:
>         pathbytes = repo.path
>     elif sys.hexversion < 0x03010000:
>         # If support for Python 3.0.x is desired (note: result can
>         # be different in this case than under 2.x or 3.1+):
>         pathbytes = repo.path.encode(sys.getfilesystemencoding(),
> 'backslashreplace')
>     else
>         pathbytes = repo.path.encode(sys.getfilesystemencoding(),
> 'surrogateescape')
>     hasher.update(pathbytes)
>     repo.hash = hasher.hexdigest()

If we don't want to put a version check in it probably wants to look
like this (ignoring Python 3.0 since I don't think we need to support
it):

    hasher = _digest()
    try:
        codecs.lookup_error('surrogateescape')
    except LookupError:
        pathbytes = repo.path
    else:
        pathbytes = repo.path.encode(sys.getfilesystemencoding(),
                                     'surrogateescape')
    hasher.update(pathbytes)
    repo.hash = hasher.hexdigest()

The version with a version check seems better to me, although this
should probably be a utility function.


John
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 6/8] git-remote-testpy: hash bytes explicitly

Reply via email to