[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-17 Thread Peter Bienstman

Peter Bienstman  added the comment:

> Lars Gustäbel  added the comment:
> 
> So, use the pax format. It stores the filenames as utf-8 and this way you
>  will be on the safe side.
> 
> I hope we both agree that the solution to your particular problem is
>  nothing tarfile.py can provide.

If I want to extract a pax archive to a unicode path with non-latin 
characters, how should I encode the path before passing it to 'extractall'? 
would utf-8 be OK?

Peter

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-16 Thread Ezio Melotti

Ezio Melotti  added the comment:

Lars, I think the situation can still be improved. If tarfile works with bytes 
strings it should accept only bytes strings or unicode strings that can be 
encoded in ASCII, and encode them as soon as it gets them.
In the problem reported by Peter, he was passing u"." that is a unicode 
ASCII-only string. Later in the program this string gets mixed with a byte 
string and this causes an implicit decoding, i.e. it turns the byte strings to 
unicode (and possibly fails if the filename is non-ASCII). Even if the decoding 
succeeds, eventually tarfile will have to convert the unicode string to a byte 
string again.

A better approach would be to encode using the ASCII codec all the unicode 
strings that are passed.
If the unicode strings are ASCII-only (like the u"." Peter was passing), they 
can be encoded without problems. When they get mixed with other strings they 
are all bytes strings so no implicit decoding happens.
If the unicode strings are non-ASCII, the encoding will fail immediately and 
warn the user that he will have to encode the unicode string before passing it 
to the function.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-16 Thread Lars Gustäbel

Lars Gustäbel  added the comment:

So, use the pax format. It stores the filenames as utf-8 and this way you will 
be on the safe side.

I hope we both agree that the solution to your particular problem is nothing 
tarfile.py can provide. So, I am going to close this issue now.

--
resolution:  -> works for me
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-15 Thread Peter Bienstman

Peter Bienstman  added the comment:

On Friday 15 January 2010 02:14:30 pm Lars Gustäbel wrote:
> Lars Gustäbel  added the comment:
> 
> I suppose you do not have a real problem here. I thought your problem was
>  that you want to use unicode pathnames as input and output to tarfile. You
>  don't need that.
> 
> You want to transfer an archive from one system to another. You can do that
>  with tarfile already. Python 3.x's tarfile does the same as Python 2.x's
>  tarfile, except that in 3.x *all* strings are unicode strings.
> 
> If you have different encodings on these systems, that should not be a
>  problem unless these encodings are not compatible with each other. If you
>  want to use a tar archive created on a utf-8 system on a iso-8859-1 system
>  that is no problem, as long as you use the pax format and all the utf-8
>  characters used are also valid iso-8859-1 characters.

I think I *do* have a problem. I want to create a tar archive on one system, 
where the filenames could contain non latin characters. I'm sending this tar 
file over a socket to a different system (with potentially a different 
encoding), 
where I want to extract it to a directory which name could contain non-latin 
characters.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-15 Thread Lars Gustäbel

Lars Gustäbel  added the comment:

I suppose you do not have a real problem here. I thought your problem was that 
you want to use unicode pathnames as input and output to tarfile. You don't 
need that.

You want to transfer an archive from one system to another. You can do that 
with tarfile already. Python 3.x's tarfile does the same as Python 2.x's 
tarfile, except that in 3.x *all* strings are unicode strings.

If you have different encodings on these systems, that should not be a problem 
unless these encodings are not compatible with each other. If you want to use a 
tar archive created on a utf-8 system on a iso-8859-1 system that is no 
problem, as long as you use the pax format and all the utf-8 characters used 
are also valid iso-8859-1 characters.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-15 Thread Peter Bienstman

Peter Bienstman  added the comment:

On Friday 15 January 2010 11:51:24 am Lars Gustäbel wrote:
> Lars Gustäbel  added the comment:
> 
> First, use a string pathname for extractall(). Most likely, your script is
>  going to work. Convert all pathnames to strings using
>  sys.getfilesystemencoding() before you add() them. Ensure that all systems
>  you are going to use the archives on have the same filesystem encoding,
>  e.g. utf-8. 

Unfortunately, that is beyond my control. Am I then totally out of luck? Would 
the implementation of tarfile in 3.0 be useable on 2.6 (perhaps with small 
modifications?)

>  Pax archives are probably the best choice if you plan to keep
>  the archives for several years. If you simply want to transfer data from
>  one system to the other throwing the archives away afterwards, the format
>  is rather irrelevant.

The archives are throw-away, transfer only, but they could be used on any 
system.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-15 Thread Lars Gustäbel

Lars Gustäbel  added the comment:

First, use a string pathname for extractall(). Most likely, your script is 
going to work. Convert all pathnames to strings using 
sys.getfilesystemencoding() before you add() them. Ensure that all systems you 
are going to use the archives on have the same filesystem encoding, e.g. utf-8. 
Pax archives are probably the best choice if you plan to keep the archives for 
several years. If you simply want to transfer data from one system to the other 
throwing the archives away afterwards, the format is rather irrelevant.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-15 Thread Peter Bienstman

Peter Bienstman  added the comment:

So what do suggest then as the best approach if I want to use unicode paths in 
tar files in Python 2.x in a way that is portable across different systems?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-15 Thread Lars Gustäbel

Lars Gustäbel  added the comment:

In the 2.x branch tarfile is not prepared to deal with unicode pathnames at 
all. This changed in Python 3. The fact that it works anyway (in the majority 
of cases) to add filenames as unicode objects is pure coincidence - I suppose 
you have a utf-8 system encoding. On a latin-1 system your script would fail 
much earlier during the add() call.

Some reading: http://docs.python.org/library/tarfile.html#unicode-issues

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-13 Thread Ezio Melotti

Ezio Melotti  added the comment:

When test.tar is opened, the filename is read as a string, so when 
os.path.join() is called in self._extract_member(tarinfo, os.path.join(path, 
tarinfo.name)), path is u'.' and tarinfo.name is '\xea\x80\x80a.ogg'.
tarinfo.name is a byte string, so in os.path.join it is converted implicitly to 
Unicode using the ascii codec because the path is unicode and since it contains 
non-ascii chars the error is raised.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-13 Thread Lars Gustäbel

Changes by Lars Gustäbel :


--
assignee:  -> lars.gustaebel
nosy: +lars.gustaebel

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-13 Thread Ezio Melotti

Changes by Ezio Melotti :


--
components: +Library (Lib) -Extension Modules
nosy: +ezio.melotti
priority:  -> normal
stage:  -> test needed
type: crash -> behavior
versions: +Python 2.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7693] tarfile.extractall can't have unicode extraction path

2010-01-13 Thread Peter Bienstman

New submission from Peter Bienstman :

import tarfile

fname = unichr(40960) + u"a.ogg"

f = file(fname, "w")
f.write("A")
f.close()

tar_pipe = tarfile.open("test.tar", mode="w|",
format=tarfile.PAX_FORMAT)
tar_pipe.add(fname)
tar_pipe.close()

tar_pipe = tarfile.open("test.tar")
tar_pipe.extractall(u".") # Just "." as string works fine.

This gives:

Traceback (most recent call last):
  File "a.py", line 15, in 
tar_pipe.extractall(u".") # Just "." as string works fine.
  File "/usr/lib/python2.6/tarfile.py", line 2031, in extractall
self.extract(tarinfo, path)
  File "/usr/lib/python2.6/tarfile.py", line 2068, in extract
self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 1: ordinal 
not in range(128)

--
components: Extension Modules
messages: 97717
nosy: pbienst
severity: normal
status: open
title: tarfile.extractall can't have unicode extraction path
type: crash
versions: Python 2.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com