New submission from Jussi Judin:

I managed to create a tarball that brought out quite nasty behavior with 
tarfile.TarFile.extract() and tarfile.TarFile.extractall() functions when there 
are hard links inside a tarball that point to themselves with a file that is 
included in the tarball. In Python 2.7 it leads to an exception and with Python 
3.4-3.6 it extracts the same file from the tarball multiple times.

First we create a tarball that causes this behavior:

$ mkdir -p tardata/1/2/3/4/5/6/7/8/9
$ dd if=/dev/zero of=tardata/1/2/3/4/5/6/7/8/9/zeros.data bs=1000000 count=500
# tar by default adds all directories recursively multiple times to the 
archive, but duplicates are created as hard links:
$ find tardata | xargs tar cvfz tardata.tar.gz

Then let's extract the tarball with tarfile module
Let following commands demonstrate what happens with the attached tartest.py 
file

$ python2.7.13 tartest.py noskip tardata.tar.gz /tmp/tardata-python-2.7.13
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
Traceback (most recent call last):
  File "tartest.py", line 17, in <module>
    unarchive(skip, archive, dest)
  File "tartest.py", line 12, in unarchive
    tar_fd.extract(info, dest)
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2118, in extract
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2202, in _extract_member
    self.makelink(tarinfo, targetpath)
  File "python/2.7.13/lib/python2.7/tarfile.py", line 2286, in makelink
    os.link(tarinfo._link_target, targetpath)
OSError: [Errno 2] No such file or directory

And with Python 3.6.0 (and earlier Python 3 series based Pythons that I have 
tested):

$ time python3.6.0 tartest.py noskip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted 11 times
...
real    0m42.747s
user    0m17.564s
sys     0m6.144s

If we then make the tarfile skip extraction of hard links that point to 
themselves:

$ time python3.6.0 tartest.py skip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted once
...
Skipping tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- skipped hard links 10 times
...
real    0m2.688s
user    0m1.816s
sys     0m0.532s

>From the used user CPU time it's obvious that there is happening a lot of 
>unneeded decompression when we compare Python 3.6 results. If I use 
>TarFile.extractall(), it behaves similarly as using TarFile.extract() 
>individually on TarInfo objects. GNU tar seems to behave in such fashion that 
>it skips over the extraction of the actual file data when it encounters this 
>situation.

----------
components: Library (Lib)
files: tartest.py
messages: 288284
nosy: Jussi Judin
priority: normal
severity: normal
status: open
title: TarFile.extract() suffers from hard links inside tarball
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6
Added file: http://bugs.python.org/file46658/tartest.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29612>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to