[issue45981] Get raw file name in bytes from ZipFile

2021-12-15 Thread Devourer Station


Devourer Station  added the comment:

I do think providing a rawfile field in the ZipInfo struct helps.
As a library, ZipFile should let users know what they are dealing with.
Users can get data from zip files, and ZipFile shouldn't corrupt them.
I don't mean that we should provide everything in raw bytes.
What I mean is that DATA could be CONVERTED, but couldn't be CORRUPTED.

--

___
Python tracker 
<https://bugs.python.org/issue45981>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45981] Get raw file name in bytes from ZipFile

2021-12-04 Thread Devourer Station


Devourer Station  added the comment:

Null bytes appear in abnormal zip files. (I haven't seen any multibyte encoding 
that represents a character with null bytes)

But non-utf8 encodings are common in normal zip files, as windows uses 
different encodings for different language settings. (On the other hand, Linux 
suggests everyone use UTF8 regardless of their language settings.)

It's a pity that nowadays few software supports specifying encoding when 
extracting archives.
(We have unzip-iconv patch on Linux, even if the patch is never accepted by 
unzip)

Changing the language and rebooting my OS makes no sense, and I don't know why.

--

___
Python tracker 
<https://bugs.python.org/issue45981>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45981] Get raw file name in bytes from ZipFile

2021-12-04 Thread Devourer Station

Devourer Station  added the comment:

In file Lib/zipfile.py:
1357>  flags = centdir[5]
1358>  if flags & 0x800:
1359># UTF-8 file names extension
1360>filename = filename.decode('utf-8')
1361>  else:
1362># Historical ZIP filename encoding
1363>filename = filename.decode('cp437')

ZipFile simply decodes all non-utf8 file names by encoding CP437.

In file Lib/zipfile.py:
352>  # This is used to ensure paths in generated ZIP files always use
353>  # forward slashes as the directory separator, as required by the
354>  # ZIP format specification.
355>  if os.sep != "/" and os.sep in filename:
356>filename = filename.replace(os.sep, "/")

And it replaces every '\\' with '/' on windows.

Consider we have a file named '\x97\x5c\x92\x9b', which is '予兆' in Japanese 
encoded in SHIFT_JIS.
You may have noticed the problem:

  '\x5c' is '\\'(backslash) in ASCII

So you will see ZipFile decodes the bytes by CP437, and replaces all '\\' with 
'/'.
And the Japanese character '予' is replaced partially, it is no longer itself.

Someone says we can replace '/' with '\\' back, and decode it by CP437 to get 
the raw bytes.
But what if both '/'('\x2f') and '\\'('\x5c') appear in the raw filename?

Simply replacing '\\' in a bytestream without knowning the encoding is by no 
means a good way.
Maybe we can provide a rawname field in the ZipInfo struct?

--

___
Python tracker 
<https://bugs.python.org/issue45981>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45981] Get raw file name in bytes from ZipFile

2021-12-04 Thread Devourer Station


New submission from Devourer Station :

It's quite annoying that ZipFile corrupts the filename by simply replacing '\\' 
with '/', not providing the raw file name in bytes to us.

--
components: Library (Lib)
messages: 407665
nosy: accelerator0099
priority: normal
severity: normal
status: open
title: Get raw file name in bytes from ZipFile
type: enhancement
versions: Python 3.10

___
Python tracker 
<https://bugs.python.org/issue45981>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com