[issue42369] Reading ZipFile not thread-safe

2022-03-20 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed
versions: +Python 3.10, Python 3.11, Python 3.9 -Python 3.7, Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2022-03-20 Thread miss-islington


miss-islington  added the comment:


New changeset 4aa8b802513340d12a6ffea3d5e2228ac6c7d5b8 by Miss Islington (bot) 
in branch '3.9':
bpo-42369: Fix thread safety of zipfile._SharedFile.tell (GH-26974)
https://github.com/python/cpython/commit/4aa8b802513340d12a6ffea3d5e2228ac6c7d5b8


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2022-03-20 Thread miss-islington


miss-islington  added the comment:


New changeset 4352ca234e979ad1c7158981addf899b119cd448 by Miss Islington (bot) 
in branch '3.10':
bpo-42369: Fix thread safety of zipfile._SharedFile.tell (GH-26974)
https://github.com/python/cpython/commit/4352ca234e979ad1c7158981addf899b119cd448


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2022-03-20 Thread miss-islington


Change by miss-islington :


--
pull_requests: +30097
pull_request: https://github.com/python/cpython/pull/32009

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2022-03-20 Thread miss-islington


Change by miss-islington :


--
nosy: +miss-islington
nosy_count: 9.0 -> 10.0
pull_requests: +30096
pull_request: https://github.com/python/cpython/pull/32008

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2022-03-20 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:


New changeset e730ae7effe4f13b24f1b5fb1fca005709c86acb by Kevin Mehall in 
branch 'main':
bpo-42369: Fix thread safety of zipfile._SharedFile.tell (GH-26974)
https://github.com/python/cpython/commit/e730ae7effe4f13b24f1b5fb1fca005709c86acb


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2022-02-01 Thread Matthew Davis


Matthew Davis  added the comment:

In addition to fixing any unexpected behavior, can we update the documentation 
[1] to state what the expected behavior is in terms of thread safety?

[1] https://docs.python.org/3/library/zipfile.html

--
nosy: +mdavis-xyz

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2022-01-03 Thread Thomas


Thomas  added the comment:

@khaledk I finally got some time off, so here you go 
https://github.com/1/ParallelZipFile

I can not offer any support for a more correct implementation of the zip 
specification due to time constraints, but maybe the code is useful for you 
anyway.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2021-11-07 Thread Spencer Brown


Change by Spencer Brown :


--
nosy: +Spencer Brown

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2021-08-02 Thread Khaled K


Khaled K  added the comment:

Hi Thomas, I'm facing the same issue. Would you care to opensource your 
implementation.

--
nosy: +khaledkoutini

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2021-07-01 Thread Thomas


Thomas  added the comment:

The monkey patch works for me! Thank you very much! (I have only tested 
reading, not writing).

However, the lock contention of Python's ZipFile is so bad that using multiple 
threads actually makes the code run _slower_ than single threaded code when 
reading a zip file with many small files. For this reason, I am not using 
ZipFile any longer. Instead, I have implemented a subset of the zip spec 
without locks, which gives me a speedup of over 2500 % for reading many small 
files compared to ZipFile.

I think that the architecture of ZipFile should be reconsidered, but this 
exceeds the scope of this issue.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2021-06-30 Thread Kevin Mehall


Kevin Mehall  added the comment:

I think I found the root cause of this problem and proposed a fix in 
https://github.com/python/cpython/pull/26974

To monkey-patch this fix on existing versions of Python, I'm using:

class PatchedSharedFile(zipfile._SharedFile):
def __init__(self, *args):
super().__init__(*args)
self.tell = lambda: self._pos
zipfile._SharedFile = PatchedSharedFile

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2021-06-30 Thread Kevin Mehall


Change by Kevin Mehall :


--
keywords: +patch
nosy: +kevinmehall
nosy_count: 5.0 -> 6.0
pull_requests: +25538
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/26974

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2021-01-25 Thread Jen Garcia


Change by Jen Garcia :


--
nosy: +cuibonobo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2020-11-17 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2020-11-16 Thread Thomas


Thomas  added the comment:

I have simplified the test case a bit more:

import multiprocessing.pool, zipfile

# Create a ZipFile with two files and same content
with zipfile.ZipFile("test.zip", "w", zipfile.ZIP_STORED) as z:
z.writestr("file1", b"0"*1)
z.writestr("file2", b"0"*1)

# Read file1  with two threads at once
with zipfile.ZipFile("test.zip", "r") as z:
pool = multiprocessing.pool.ThreadPool(2)
while True:
pool.map(z.read, ["file1", "file1"])

Two files are sufficient to cause the error. It does not matter which files are 
read or which content they have.

I also narrowed down the point of failure a bit. After

self._file.seek(self._pos)

in _SharedFile.read ( 
https://github.com/python/cpython/blob/c79667ff7921444911e8a5dfa5fba89294915590/Lib/zipfile.py#L742
 ), the following assertion should hold:

assert(self._file.tell() == self._pos)

The issue occurs when seeking to position 35 (size of header + length of name). 
Most of the time, self._file.tell() will then be 35 as expected, but sometimes 
it is 8227 instead, i.e. 35 + 8192.

I am not sure how this can happen since the file object should be locked.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2020-11-16 Thread Thomas


Thomas  added the comment:

Scratch what I said in the previous message. I thought that the lock was 
created in _SharedFile and did not notice that it was passed as a parameter.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2020-11-16 Thread Thomas


Thomas  added the comment:

I have not observed any segfaults yet. Only zipfile.BadZipFile exceptions so 
far.

The exact file at which it crashes is fairly random. It even crashes if all 
threads try to read the same file multiple times.

I think the root cause of the problem is that the reads of zef_file in 
ZipFile.read are not locked properly.

https://github.com/python/cpython/blob/c79667ff7921444911e8a5dfa5fba89294915590/Lib/zipfile.py#L1515

The underlying file object is shared between all ZipExtFiles. Every time a 
thread makes a call to ZipFile.read, a new lock is created in _SharedFile, but 
that lock only protects against multiple threads reading the same ZipExtFile. 
Multiple threads reading different ZipExtFiles with the same underlying file 
object will cause trouble. The locks do nothing in this scenario because they 
are individual to each thread and not shared.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2020-11-16 Thread Eric V. Smith


Eric V. Smith  added the comment:

I'm changing from "crash" to "behavior". We use "crash" for a segfault or 
equivalent. I realize that most people are unlikely to know this, but we 
consider "crash" to be more alarming, so I want to make sure it's correct.

Also: when this happens, is it always for file 127, or does it change on each 
run?

--
nosy: +eric.smith
type: crash -> behavior

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2020-11-16 Thread Ma Lin


Change by Ma Lin :


--
nosy: +malin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2020-11-16 Thread Thomas


Change by Thomas :


--
components: +Library (Lib)
type:  -> crash

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42369] Reading ZipFile not thread-safe

2020-11-16 Thread Thomas


New submission from Thomas :

According to https://docs.python.org/3.5/whatsnew/changelog.html#id108 
bpo-14099, reading multiple ZipExtFiles should be thread-safe, but it is not.

I created a small example where two threads try to read files from the same 
ZipFile simultaneously, which crashes with a Bad CRC-32 error. This is 
especially surprising since all files in the ZipFile only contain 0-bytes and 
have the same CRC.

My use case is a ZipFile with 82000 files. Creating multiple ZipFiles from the 
same "physical" zip file is not a satisfactory workaround because it takes 
several seconds each time. Instead, I open it only once and clone it for each 
thread:

with zipfile.ZipFile("/tmp/dummy.zip", "w") as dummy:
pass

def clone_zipfile(z):
z_cloned = zipfile.ZipFile("/tmp/dummy.zip")
z_cloned.NameToInfo = z.NameToInfo
z_cloned.fp = open(z.fp.name, "rb")
return z_cloned

This is a much better solution for my use case than locking. I am using 
multiple threads because I want to finish my task faster, but locking defeats 
that purpose.

However, this cloning is somewhat of a dirty hack and will break when the file 
is not a real file but rather a file-like object.

Unfortunately, I do not have a solution for the general case.

--
files: test.py
messages: 381090
nosy: Thomas
priority: normal
severity: normal
status: open
title: Reading ZipFile not thread-safe
versions: Python 3.7, Python 3.8
Added file: https://bugs.python.org/file49601/test.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com