[issue40757] tarfile: ignore_zeros = True won't raise exception even on invalid (non-zero) TARs
mxmlnkn added the comment: I think you misunderstood. foo.txt is a file, which actually exists but contains non-TAR data. E.g. try: base64 /dev/urandom | head -c $(( 2048 )) > foo.txt python3 -c 'import tarfile; print(list(tarfile.open("foo.txt")))' Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.9/tarfile.py", line 1616, in open raise ReadError("file could not be opened successfully") tarfile.ReadError: file could not be opened successfully python3 -c 'import tarfile; print(list(tarfile.open("foo.txt", ignore_zeros=True)))' [] -- status: pending -> open ___ Python tracker <https://bugs.python.org/issue40757> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45287] zipfile.is_zipfile returns true for a rar file containing zips
New submission from mxmlnkn : I have created a RAR file containing two zip files like this: zip bag.zip README.md CHANGELOG.md zip bag1.zip CHANGELOG.md rar a zips.rar bag.zip bag1.zip And when calling `zipfile.is_zipfile` on zips.rar, it returns true even though it obviously is not a zip. The zips.rar file doesn't even begin with the magic bytes `PK` for zip but with `Rar!`. -- files: zips.rar messages: 402624 nosy: mxmlnkn priority: normal severity: normal status: open title: zipfile.is_zipfile returns true for a rar file containing zips Added file: https://bugs.python.org/file50305/zips.rar ___ Python tracker <https://bugs.python.org/issue45287> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45286] zipfile missing API for links
New submission from mxmlnkn : When using zipfile as a library to get simple files, there is no way to method to determine whether a ZipInfo object is a link or not. Moreover, links can be simply opened and read like normal files and will contain the link path, which is unexpected in most cases, I think. ZipInfo already has an `is_dir` getter. It would be nice if there also was an `is_link` getter. Note that `__repr__` actually shows the filemode, which is `lrwxrwxrwx`. But there is not even a getter for the file mode. For now, I can try to use the code from `__repl__` to extract the file mode from the `external_attr` member but the contents of that member are not documented in zipfile and assuming it is the same as in the ZIP file format specification, it's OS-dependent. Additionally to `is_link` some getter like `linkname` or so would be nice. As to how it should behave when calling `open` or `read` on a link, I'm not sure. -- messages: 402617 nosy: mxmlnkn priority: normal severity: normal status: open title: zipfile missing API for links ___ Python tracker <https://bugs.python.org/issue45286> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40843] tarfile: ignore_zeros = True exceedingly slow on a sparse tar file
New submission from mxmlnkn : Consider this example replicating a real use case where I was downloading the 1.191TiB ImageNet in sequential order for ~1GiB in order to preview it: echo "foo" > bar tar cf sparse.tar bar #!/usr/bin/env python3 # -*- coding: utf-8 -*- import os import tarfile import time t0 = time.time() for tarInfo in tarfile.open( 'sparse.tar', 'r:', ignore_zeros = True ): pass t1 = time.time() print( f"Small TAR took {t1 - t0}s to iterate over" ) f = open( 'sparse.tar', 'wb' ) f.truncate( 2*1024*1024*1024 ) f.close() t0 = time.time() for tarInfo in tarfile.open( 'sparse.tar', 'r:', ignore_zeros = True ): pass t1 = time.time() print( f"Small TAR with sparse tail took {t1 - t0}s to iterate over" ) Output: Small TAR took 0.00020813941955566406s to iterate over Small TAR with sparse tail took 6.999570846557617s to iterate over So, iterating over sparse holes takes tarfile ~300MiB/s. Which sounds fast but is really slow for 1.2TiB and when thinking about it as tarfile doing basically >nothing<. There should be better options like using os.lseek with os.SEEK_DATA if available to skip those empty holes. An alternative would be an option to tell tarfile how many zeros it should at maximum skip. Personally, I only use the ignore_zeros option to be able to work with concatenated TARs, which in my case only have up to 19*512 byte empty tar blocks to be skipped. Anything longer would indicate an invalid file. I'm aware that these maximum runs of zeros vary depending on the tar blocking factor, so it should be adjustable. -- messages: 370611 nosy: mxmlnkn priority: normal severity: normal status: open title: tarfile: ignore_zeros = True exceedingly slow on a sparse tar file versions: Python 3.7 ___ Python tracker <https://bugs.python.org/issue40843> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40757] tarfile: ignore_zeros = True won't raise exception even on invalid (non-zero) TARs
New submission from mxmlnkn : Normally, when opening an existing non-TAR file, e.g., a file with random data, an exception is raised: tarfile.open( "foo.txt" ) --- ReadError Traceback (most recent call last) in () > 1 f = tarfile.open( "notes.txt", ignore_zeros = False ) /usr/lib/python3.7/tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs) 1576 fileobj.seek(saved_pos) 1577 continue -> 1578 raise ReadError("file could not be opened successfully") 1579 1580 elif ":" in mode: ReadError: file could not be opened successfully However, when specifying ignore_zeros = True, this check against invalid data seems to be turned off. Note that it is >invalid< data not >zero< data and therefore should still raise an exception! tarfile.open( "foo.txt", ignore_zeros = True ) Iterating over that opened tarfile also works without exception however nothing will be iterated over, i.e., it behaves like an empty TAR instead of like an invalid TAR. -- components: Library (Lib) messages: 369816 nosy: mxmlnkn priority: normal severity: normal status: open title: tarfile: ignore_zeros = True won't raise exception even on invalid (non-zero) TARs type: behavior versions: Python 3.7 ___ Python tracker <https://bugs.python.org/issue40757> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com