[issue28080] Allow reading member names with bogus encodings in zipfile

2022-03-23 Thread Gregory P. Smith


Gregory P. Smith  added the comment:

Thanks Serhiy!

--
resolution:  -> fixed
stage: patch review -> commit review
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2022-03-23 Thread Stephen J. Turnbull


Stephen J. Turnbull  added the comment:

I'm not going to have time to look at the PR for a couple days.

I don't understand what the use case is for writing or appending with filenames 
in a non-UTF-8 encoding.  At least in my experience, reading such files is 
rare, but I have never been asked to write one.  The correspondents who send me 
zipfiles with the directory encoded in shift_jisx0213 are perfectly happy to 
read zipfiles with the directory encoded in UTF-8.

If that is true for other users, then unzipping the file to a temporary 
directory with the appropriate --metadata-encoding, adding the required paths 
there, and zipping a new archive seems perfectly workable.  In that case making 
this feature read-only makes the most sense to me.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2022-03-22 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:


New changeset a25a985535ccbb7df8caddc0017550ff4eae5855 by Serhiy Storchaka in 
branch 'main':
bpo-28080: Add support for the fallback encoding in ZIP files (GH-32007)
https://github.com/python/cpython/commit/a25a985535ccbb7df8caddc0017550ff4eae5855


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2022-03-22 Thread Gregory P. Smith


Gregory P. Smith  added the comment:

Your PR looks good to me.

I agree with not making it easy to _write_ zipfiles with non-standard encoding 
used for names.

There is a possibility that someone wants that ability when writing zip files 
(not yet clear) in https://bugs.python.org/issue40172.  I don't like it, but if 
the need exists, that feature can be addressed on that issue via relevant 
options.

--
nosy: +gregory.p.smith

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2022-03-20 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I experimented with this a lot. There is a problem with the append mode. We can 
read in the append mode, therefore we need an encoding. But when we close a 
ZipFile after appending, non-ASCII file names will be encoded in UTF-8 in the 
central directory. Next time when we open the archive for reading with 
different encoding we will get an error because filenames in the central 
directory and in local headers are different. We need to write non-ASCII files 
back with the specified encoding to get a self-consistent data.

Finally I left this as it was initially. We can return to the problem with the 
append module later.

The differences between PR 32007 and your patches:

* The parameter was renamed to metadata_encoding to avoid confusion with 
existing parameter of ZipFile.open() encoding. In future I am going to use it 
also for comments. The attribute and the CLI option were renamed 
correspondingly.
* --metadata-encoding can also be used with the -t option.
* "surrogateescape" no longer used. If the encoding in not suitable, you will 
get an error. Use the default and decode filenames manually in such cases. We 
can change this in future.
* Updated documentation.
* Tests were significantly rewritten. Now they test the behavior with wrong 
metadata_encoding, mixed UTF-8 and legacy encodings, and reading after append.

I was going to make more changes, but left it for future.

--
versions: +Python 3.11 -Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2022-03-20 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
pull_requests: +30095
pull_request: https://github.com/python/cpython/pull/32007

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-12-27 Thread Stephen J. Turnbull

Stephen J. Turnbull added the comment:

Thanks for followup!  I was just about to write you, now that 3.6 is out.  
Season's Greetings!

First, how do you propose to proceed with issue28115 ("use argparse for the 
ZipFile module")?  If you expect to commit that first (I'm in no hurry for this 
patch, BTW, as long as it gets into 3.7 I'm happy), this issue should depend on 
it and use argparse too.

I don't see any good reason for allowing non-UTF-8 encoding to a file open for 
writing, and a good reason (the ZipFile standard) for not allowing it.  
Certainly the CLI should not allow it, any more than it does now.  At least in 
my experiments InfoZip and the default zip utilities on Windows and Mac DTRT 
with UTF-8 zipfiles, so there is no absolute need for writing nonconforming 
zipfiles.  If you want to block on a convert-to-UTF-8 option I can do that (but 
I don't need it myself).  (Note to self: if writing to existing zipfile is 
extension of existing file, need to prevent mixed encodings.  Also warn about 
conversion.)

I thought I checked that comments were decoded.  Maybe that's only on the UTF-8 
path?  Or maybe I needed more coffee.  (Hope so, that would be a messy problem 
if ASCII/Latin1 returns bytes and UTF-8 returns str!)  I'll think about this.  
Yes, it's a backwards-compatibility issue so needs care.  Would be weird if 
names are decoded but other metadata (comments) not, though.  Surely someone 
would complain if they actually used comments?  (I'm thinking maybe a 
compatibility break might be OK?  With deprecation cycle?)  I expect to check 
all execution paths accessing metadata and have a proposed patch by 12/31.

I think I'm still short some tests, will check and write them if needed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-12-26 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

A ZipFile can be read when open in not read-only mode. Thus the encoding 
argument should be accepted when mode != 'r'.

It would be weird to read file names and write new entries with different 
encodings. Thus the encoding argument should affect output encoding too.

You have named the new ZipFile attribute metadataEncoding. Indeed, I missed 
this, and other developers missed this when ported to Python 3, but the 
specification says, that the UTF-8 bit affect not just the encoding of file 
names, but the encoding of comments. Thus a file comment must be a string, and 
be decoded with the same encoding as a file name. Currently it is of type 
bytes. I don't know what is the best way to resolve this issue without breaking 
backward compatibility. Perhaps add the text_comment property.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-12-26 Thread INADA Naoki

Changes by INADA Naoki :


--
nosy: +inada.naoki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-12 Thread Stephen J. Turnbull

Stephen J. Turnbull added the comment:

Cleaned up a few loose ends while it's all fresh in mind.  Will ping python-dev 
in 4-6 weeks for review for 3.7.

Thanks to Serhiy for review.  The current version of the patch is much improved 
over the initial submission due to his efforts.

--
Added file: http://bugs.python.org/file44593/encoded-member-names-v3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-12 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Python is programming language, I don't understand what you mean saying 
"available to nonprogrammers". As a programmer you can recode ZipInfo name 
before outputting or what you want to do with it:

filename = filename.encode('cp437').decode(encoding)

In command line you can can use iconv or other converters:

python3 -m zipfile -l myzip.zip | iconv -t cp437 | iconv -f "$encoding"

This is not very handy, but works in most cases.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-12 Thread Stephen J. Turnbull

Stephen J. Turnbull added the comment:

If you have a workaround that's available to nonprogrammers, I'd like to hear 
about it.  I have found none, that's why I went to the trouble to put together 
a patch even though I knew that the odds of actually getting it in to Python 
3.6 was very low -- my patch (or Sergey Dorofeev's, but that needs work to be 
applicable to trunk) does everything I've ever needed, so I suppose it would do 
for all the use cases so far posted (except umedoblock's encoding-guessing 
approach, but that can be handled by many 3rd-party encoding-guessing codecs, I 
think, and IMO should be).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-12 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Sorry, I can't push the patch in a haste. In needs more design discussion, 
comparing with other implementations (I found only that ZipFile in Java can 
take the charset argument, but "charset" is common name for text encoding in 
Java), wider discussion. The problem exists for years, and there are 
workarounds.

--
versions:  -Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-11 Thread Stephen J. Turnbull

Stephen J. Turnbull added the comment:

Can't reply on Rietveld?  Lost 2 hours work!

Patch updated (encoded-member-names-v2), most changes accepted.  Not happy 
about name change or default to cp437, I want this API to be hard to use and 
not be part of the normal process (utf-8 or cp437).  Considering errors= 
argument, but that must default to 'strict' -- the problem this patch solves is 
zip utilities extracting to files with unreadable names, surrogateescape is 
more of the same.  Two incomplete tests (assertRaisesRegex and capture main() 
stderr) still in progress, must do dayjob now.

--
Added file: http://bugs.python.org/file44571/encoded-member-names-v2

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Added comments on Rietveld. Maybe I'll commit rewritten patch tomorrow before 
12:00 UTC (oh, already today!).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-11 Thread Stephen J. Turnbull

Stephen J. Turnbull added the comment:

Re: wait for 3.7 if reviewers are busy, understood.  N.B. Contributor agreement 
is now on file (I received the PDF from python.org already).

Re: existing patches:
My patch is very similar in the basic approach to Sergey Dorofeev's patch in 
issue10614.  Main differences:
(1) Sergey's patch treats the "encoding" parameter as a first class citizen 
with a default to cp437, whereas mine treats it as a special case defaulting to 
None, with utf-8 and cp437 getting special treatment as the standard encodings. 
 Subtle point, but I like it this way.
(2) My patch includes support for the argument in the __main__ script.
(3) Sergey's patch misses one execution path in the current code so needs 
update before application.

The Japanese patches by umedoblock are very Japanese-centric, and worse, they 
try to guess the encoding by the crude method of seeing what decodes 
successfully.  They are not acceptable IMO.

Aaargh.  Just noticed the Japanese in test_zipfile.py.  Will change it to use 
\u escapes soon.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-11 Thread Ned Deily

Ned Deily added the comment:

Stephen, thanks for submitting the patch.  Unless another core developer has 
time to review and commit this prior to the feature code off tomorrow, this 
will probably need to wait for 3.7.  Also, at the moment, your tracker user 
record does not indicate that there is a contributor agreement on file for you. 
 You could followup on that with Ewa at the PSF or just submit a new form.

--
nosy: +ned.deily
versions: +Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

This issue is a duplicate of issue10614. But proposed patch looks more ready.

--
assignee:  -> serhiy.storchaka
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-11 Thread Stephen J. Turnbull

Stephen J. Turnbull added the comment:

I should have a contributor agreement form on file.

Ned Deily suggested that I try to get this patch in before the 12 noon deadline 
Sept. 12, so here it is.

I believe the patch is "safe" in the sense that its functionality needs to be 
explicitly enabled, and it should be very difficult to persuade it to 
inadvertantly write to any file.  No existing execution paths should be changed 
at all.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-11 Thread Stephen J. Turnbull

Stephen J. Turnbull added the comment:

Suggested NEWS/whatsnew entry:

Add a new *memberNameEncoding* argument to the ZipFile constructor, allowing
:mod:`zipfile` to read filenames in non-conforming encodings from the
zipfile as Unicode.  This implementation assumes all member names have the same 
encoding.

Motivation:

There are applications in Japan that create zipfiles with directories 
containing filenames encoded in Shift JIS.  There may be such software in other 
countries as well.  As this is a violation of the Zip format definition, this 
library implements only an option to read such files.

Done:

(1) Add a memberNameEncoding argument to the main() function, which may be set 
from the command line with "--membernameencoding={codec}".  This command line 
option may be used with -e or -l, but not -c or -t.  There is no point to it in 
the latter, since the member names are not printed.
(2) Add a memberNameEncoding argument to the ZipFile constructor.  This is the 
only way to set it, so this is global to the ZipFile.
(3) Add this attribute to repr.
(4) Add a check that the mode is `read` in main() and in the ZipFile 
constructor, and if not invoke USAGE and exit or raise RuntimeError.
(5) When retrieving member names in constructing ZipInfo instances, check if 
memberNameEncoding is set, and if so use it, unless the UTF-8 bit is set. In 
that case, obey the UTF-8 bit, as the specified encoding is surely user error.
(6) Add a CODEC_USAGE message.
(7) Update the docs (docstrings, library reference, NEWS).
(8) Add tests:
(a) List a zipfile's SJIS-encoded directory.
(b) List a UTF-8-encoded directory and an ISO-8859-1-encoded directory as 
Shift-JIS.
(c) Check that USAGE is invoked on attempts to write a zipfile in main().
(d) Check that an appropriate error is raised on attempts to write in other 
functions.
Many other tests are run as well.
ALL TESTS PASS.
(9) Docs build without error.

To do (?):

(10) NEWS/whatsnew
(11) Check relevant code paths are all covered by tests.
(12) Review docs for clarity and organization.

Not done:

I don't think these are appropriate/needed at this time, but listed in case 
somebody thinks otherwise.

(13) Add a subtype of RuntimeError (see 7d)?
(14) Issue warning if both membernameencoding and utf-8 bit are set (see 4)?
(15) Support InfoZip encoding extension mentioned in APPNOTE.TXT - .ZIP File 
Format Specification, v6.3.4.
(16) Support per-member encodings (I think the zipfile standard permits, but 
not sure).

--
keywords: +needs review
status: pending -> open
Added file: http://bugs.python.org/file44564/encoded-member-names

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-11 Thread Serhiy Storchaka

New submission from Serhiy Storchaka:

Could you please provide more information sjt?

--
nosy: +serhiy.storchaka
status: open -> pending

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28080] Allow reading member names with bogus encodings in zipfile

2016-09-11 Thread Stephen J. Turnbull

Changes by Stephen J. Turnbull :


--
components: Library (Lib)
keywords: patch
nosy: sjt
priority: normal
severity: normal
status: open
title: Allow reading member names with bogus encodings in zipfile
type: enhancement
versions: Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com