[issue36534] tarfile: handling Windows (path) illegal characters in archive member names

2020-10-07 Thread Eryk Sun


Eryk Sun  added the comment:

> extract the sanitizing function into a common module 
> (could be *pathlib*?) to avoid duplicates

I would prefer something common, cross-platform, and function-based such as 
os.path.isreservedname and os.path.sanitizename. In posixpath, it would just 
have to reserve and sanitize slash [/] and null [\0]. The real work would be in 
ntpath.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36534] tarfile: handling Windows (path) illegal characters in archive member names

2020-10-07 Thread Cristi Fati


Cristi Fati  added the comment:

As I see things now, there are multiple things (not necessarily related to this 
issue) to deal with:

1. Update *tarfile* and add *\_sanitize\_windows\_name* (name can change), that 
uses *pathlib.\_WindowsFlavour.reserved\_names* (or some public wrapper), and 
also handles control chars (pointed out by @eriksun), so that it covers as many 
cases as possible (I'd say all, but there's almost always one that gets away)

2. Fix *pathlib.\_WindowsFlavour.reserved\_names*

3. Apply the fix to *zipfile* as well

4. (optional) extract the sanitizing function into a common module (could be 
*pathlib*?) to avoid duplicates

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36534] tarfile: handling Windows (path) illegal characters in archive member names

2020-10-07 Thread STINNER Victor


STINNER Victor  added the comment:

> IIRC there's already an open issue for that.

Ah, I found bpo-27827 "pathlib is_reserved fails for some reserved paths on 
Windows", open since 2016 (by you ;-)).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36534] tarfile: handling Windows (path) illegal characters in archive member names

2020-10-06 Thread Eryk Sun


Eryk Sun  added the comment:

> This issue is about tarfile. Maybe create another issue to enhance 
> the pathlib module?

IIRC there's already an open issue for that. But in case anyone were to look to 
pathlib as an example of what should be reserved, I wanted to highlight here 
how its reserved_names list is incomplete and how its is_reserved() method is 
insufficient.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36534] tarfile: handling Windows (path) illegal characters in archive member names

2020-10-06 Thread STINNER Victor


STINNER Victor  added the comment:

> pathlib._WindowsFlavour.is_reserved() fails to reserve names (...)

This issue is about tarfile. Maybe create another issue to enhance the pathlib 
module?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36534] tarfile: handling Windows (path) illegal characters in archive member names

2020-10-06 Thread Eryk Sun


Eryk Sun  added the comment:

> The pathlib module has _WindowsFlavour.reserved_names list of 
> Windows reserved names:

pathlib._WindowsFlavour.reserved_names is missing "CONIN$" and "CONOUT$". Prior 
to Windows 8 these two are reserved as relative names. In Windows 8+, they're 
also reserved in directories, just like the other reserved device names.

pathlib._WindowsFlavour.is_reserved() fails to reserve names containing ASCII 
control characters [0-31], vertical bar [|], the file-stream delimiter [:] 
(i.e. "filename:streamname:streamtype"), and the five wildcard characters 
[*?"<>]. (Maybe it should allow the file-stream delimiter, but that requires 
validating that a file stream is proper.) It fails to reserve names that end 
with a dot or space, which includes UNC and device paths except for \\?\ 
verbatim paths. It fails to match all reserved base names, which begin with a 
reserved device name, followed by zero or more spaces, a dot or colon, and zero 
or more characters. If names that contain colon are already reserved, then this 
check only has to be modified to strip trailing spaces before comparing against 
the list of reserved device names.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36534] tarfile: handling Windows (path) illegal characters in archive member names

2020-10-06 Thread STINNER Victor


STINNER Victor  added the comment:

> Also, while _sanitize_windows_name() handles trailing dots, for some reason 
> it overlooks trailing spaces. It also doesn't handle reserved DOS device 
> names.

The pathlib module has _WindowsFlavour.reserved_names list of Windows
reserved names:

>>> pprint.pprint(sorted(pathlib._WindowsFlavour.reserved_names))
['AUX',
 'COM1',
 'COM2',
 'COM3',
 'COM4',
 'COM5',
 'COM6',
 'COM7',
 'COM8',
 'COM9',
 'CON',
 'LPT1',
 'LPT2',
 'LPT3',
 'LPT4',
 'LPT5',
 'LPT6',
 'LPT7',
 'LPT8',
 'LPT9',
 'NUL',
 'PRN']

--
nosy: +vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36534] tarfile: handling Windows (path) illegal characters in archive member names

2019-04-05 Thread Eryk Sun


Eryk Sun  added the comment:

_sanitize_windows_name() fails to translate the reserved control characters 
(0x01-0x1F) and backslash in names. 

What I've seen done in some cases (e.g. Unix network shares mapped to SMB) is 
to translate names using the private use area block, e.g. 0xF001 - 0xF07F. 
Windows has no problem with characters in this range in a filename. (Displaying 
these characters sensibly is another matter.) For Windows 10, this is 
especially useful since the Linux subsystem automatically translates this PUA 
block back to ASCII when accessing a Windows volume via drvfs. For example:

C:\Temp\pua>python -q
>>> import sys
>>> sys.platform
'win32'
>>> name = ''.join(map(chr, range(0xf001, 0xf080)))
>>> _ = open(name, 'w')
>>> ^Z

C:\Temp\pua>bash -c "python3 -q"
>>> import os, sys
>>> sys.platform
'linux'
>>> os.listdir()
['\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f
  \x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
   !"#$%&\'()*+,-./0123456789:;<=>?
  @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_
  `abcdefghijklmnopqrstuvwxyz{|}~\x7f']

Also, while _sanitize_windows_name() handles trailing dots, for some reason it 
overlooks trailing spaces. It also doesn't handle reserved DOS device names. 
The reserved names include NUL, CON, CONIN$, CONOUT$, AUX, PRN, COM[1-9], 
LPT[1-9], and these names plus zero or more spaces and possibly a dot or colon 
and any subsequent characters. For example:

>>> os.path._getfullpathname('con')
'.\\con'
>>> os.path._getfullpathname('con  ')
'.\\con'
>>> os.path._getfullpathname('con:')
'.\\con'
>>> os.path._getfullpathname('con :')
'.\\con'
>>> os.path._getfullpathname('con : spam')
'.\\con'
>>> os.path._getfullpathname('con . eggs')
'.\\con'

It's not a reserved device name if the first character after zero or more 
spaces is not a dot or colon. For example:

>>> os.path._getfullpathname('con spam')
'C:\\con spam'

We can create filenames with reserved device names or trailing spaces and dots 
by using a \\?\ prefixed path (i.e. a non-normalized device path). However, 
most programs don't use \\?\ paths, so it's probably better to translate these 
names.

--
nosy: +eryksun

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36534] tarfile: handling Windows (path) illegal characters in archive member names

2019-04-05 Thread Karthikeyan Singaravelan


Change by Karthikeyan Singaravelan :


--
components: +Windows
nosy: +lars.gustaebel, paul.moore, steve.dower, tim.golden, zach.ware

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36534] tarfile: handling Windows (path) illegal characters in archive member names

2019-04-05 Thread Cristi Fati


New submission from Cristi Fati :

Although tar is a Nix based (and mostly used) format, it gains popularity on 
Win too.

As tarfile is running on Win, I think it should handle (work around) path 
incompatibilities, as zipfile (`ZipFile._sanitize_windows_name`) does.

Applies to all branches.

More details on [Tarfile/Zipfile extractall() changing filename of some 
files](https://stackoverflow.com/questions/55340013/tarfile-zipfile-extractall-changing-filename-of-some-files/55348443#55348443).

Regarding the current zipfile handling: it also can be improved (as it has a 
small bug), for example if the archive contains 2 files ("file:" and "file_") 
it won't work as expected. But this is a rare corner case.

I didn't prepare a patch, since I did so for another issue 
(https://bugs.python.org/issue36247 - which I consider an ugly one),  
 and it wasn't well received, also it was rejected (for different reasons). If 
this issue gets the green light from whomever is in charge, I'll be happy to 
provide one.

--
components: Library (Lib)
messages: 339486
nosy: CristiFati
priority: normal
severity: normal
status: open
title: tarfile: handling Windows (path) illegal characters in archive member 
names
type: enhancement
versions: Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com