[issue40757] tarfile: ignore_zeros = True won't raise exception even on invalid (non-zero) TARs

2022-01-21 Thread mxmlnkn


mxmlnkn  added the comment:

I think you misunderstood. foo.txt is a file, which actually exists but 
contains non-TAR data. E.g. try:

base64 /dev/urandom | head -c $(( 2048 ))  > foo.txt
python3 -c 'import tarfile; print(list(tarfile.open("foo.txt")))'

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.9/tarfile.py", line 1616, in open
raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully

python3 -c 'import tarfile; print(list(tarfile.open("foo.txt", 
ignore_zeros=True)))'

[]

--
status: pending -> open

___
Python tracker 
<https://bugs.python.org/issue40757>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45287] zipfile.is_zipfile returns true for a rar file containing zips

2021-09-25 Thread mxmlnkn


New submission from mxmlnkn :

I have created a RAR file containing two zip files like this:

zip bag.zip README.md CHANGELOG.md
zip bag1.zip CHANGELOG.md
rar a zips.rar bag.zip bag1.zip

And when calling `zipfile.is_zipfile` on zips.rar, it returns true even though 
it obviously is not a zip. The zips.rar file doesn't even begin with the magic 
bytes `PK` for zip but with `Rar!`.

--
files: zips.rar
messages: 402624
nosy: mxmlnkn
priority: normal
severity: normal
status: open
title: zipfile.is_zipfile returns true for a rar file containing zips
Added file: https://bugs.python.org/file50305/zips.rar

___
Python tracker 
<https://bugs.python.org/issue45287>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45286] zipfile missing API for links

2021-09-25 Thread mxmlnkn


New submission from mxmlnkn :

When using zipfile as a library to get simple files, there is no way to method 
to determine whether a ZipInfo object is a link or not. Moreover, links can be 
simply opened and read like normal files and will contain the link path, which 
is unexpected in most cases, I think.

ZipInfo already has an `is_dir` getter. It would be nice if there also was an 
`is_link` getter. Note that `__repr__` actually shows the filemode, which is 
`lrwxrwxrwx`. But there is not even a getter for the file mode.

For now, I can try to use the code from `__repl__` to extract the file mode 
from the `external_attr` member but the contents of that member are not 
documented in zipfile and assuming it is the same as in the ZIP file format 
specification, it's OS-dependent.

Additionally to `is_link` some getter like `linkname` or so would be nice. As 
to how it should behave when calling `open` or `read` on a link, I'm not sure.

--
messages: 402617
nosy: mxmlnkn
priority: normal
severity: normal
status: open
title: zipfile missing API for links

___
Python tracker 
<https://bugs.python.org/issue45286>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40843] tarfile: ignore_zeros = True exceedingly slow on a sparse tar file

2020-06-02 Thread mxmlnkn


New submission from mxmlnkn :

Consider this example replicating a real use case where I was downloading the 
1.191TiB ImageNet in sequential order for ~1GiB in order to preview it:

echo "foo" > bar
tar cf sparse.tar bar


#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import tarfile
import time

t0 = time.time()
for tarInfo in tarfile.open( 'sparse.tar', 'r:', ignore_zeros = True ):
pass
t1 = time.time()
print( f"Small TAR took {t1 - t0}s to iterate over" )

f = open( 'sparse.tar', 'wb' )
f.truncate( 2*1024*1024*1024 )
f.close()

t0 = time.time()
for tarInfo in tarfile.open( 'sparse.tar', 'r:', ignore_zeros = True ):
pass
t1 = time.time()
print( f"Small TAR with sparse tail took {t1 - t0}s to iterate over" )


Output:

Small TAR took 0.00020813941955566406s to iterate over
Small TAR with sparse tail took 6.999570846557617s to iterate over


So, iterating over sparse holes takes tarfile ~300MiB/s. Which sounds fast but 
is really slow for 1.2TiB and when thinking about it as tarfile doing basically 
>nothing<.

There should be better options like using os.lseek with os.SEEK_DATA if 
available to skip those empty holes.

An alternative would be an option to tell tarfile how many zeros it should at 
maximum skip.
Personally, I only use the ignore_zeros option to be able to work with 
concatenated TARs, which in my case only have up to 19*512 byte empty tar 
blocks to be skipped. Anything longer would indicate an invalid file. I'm aware 
that these maximum runs of zeros vary depending on the tar blocking factor, so 
it should be adjustable.

--
messages: 370611
nosy: mxmlnkn
priority: normal
severity: normal
status: open
title: tarfile: ignore_zeros = True exceedingly slow on a sparse tar file
versions: Python 3.7

___
Python tracker 
<https://bugs.python.org/issue40843>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40757] tarfile: ignore_zeros = True won't raise exception even on invalid (non-zero) TARs

2020-05-24 Thread mxmlnkn


New submission from mxmlnkn :

Normally, when opening an existing non-TAR file, e.g., a file with random data, 
an exception is raised:

tarfile.open( "foo.txt" )

---
ReadError Traceback (most recent call last)
 in ()
> 1 f = tarfile.open( "notes.txt", ignore_zeros = False )

/usr/lib/python3.7/tarfile.py in open(cls, name, mode, fileobj, bufsize, 
**kwargs)
   1576 fileobj.seek(saved_pos)
   1577 continue
-> 1578 raise ReadError("file could not be opened successfully")
   1579 
   1580 elif ":" in mode:

ReadError: file could not be opened successfully

However, when specifying ignore_zeros = True, this check against invalid data 
seems to be turned off. Note that it is >invalid< data not >zero< data and 
therefore should still raise an exception!

tarfile.open( "foo.txt", ignore_zeros = True )

Iterating over that opened tarfile also works without exception however nothing 
will be iterated over, i.e., it behaves like an empty TAR instead of like an 
invalid TAR.

--
components: Library (Lib)
messages: 369816
nosy: mxmlnkn
priority: normal
severity: normal
status: open
title: tarfile: ignore_zeros = True won't raise exception even on invalid 
(non-zero) TARs
type: behavior
versions: Python 3.7

___
Python tracker 
<https://bugs.python.org/issue40757>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com