New submission from Jussi Judin: I managed to create a tarball that brought out quite nasty behavior with tarfile.TarFile.extract() and tarfile.TarFile.extractall() functions when there are hard links inside a tarball that point to themselves with a file that is included in the tarball. In Python 2.7 it leads to an exception and with Python 3.4-3.6 it extracts the same file from the tarball multiple times.
First we create a tarball that causes this behavior: $ mkdir -p tardata/1/2/3/4/5/6/7/8/9 $ dd if=/dev/zero of=tardata/1/2/3/4/5/6/7/8/9/zeros.data bs=1000000 count=500 # tar by default adds all directories recursively multiple times to the archive, but duplicates are created as hard links: $ find tardata | xargs tar cvfz tardata.tar.gz Then let's extract the tarball with tarfile module Let following commands demonstrate what happens with the attached tartest.py file $ python2.7.13 tartest.py noskip tardata.tar.gz /tmp/tardata-python-2.7.13 ... tardata/1/2/3/4/5/6/7/8/9/zeros.data ... tardata/1/2/3/4/5/6/7/8/9/zeros.data Traceback (most recent call last): File "tartest.py", line 17, in <module> unarchive(skip, archive, dest) File "tartest.py", line 12, in unarchive tar_fd.extract(info, dest) File "python/2.7.13/lib/python2.7/tarfile.py", line 2118, in extract self._extract_member(tarinfo, os.path.join(path, tarinfo.name)) File "python/2.7.13/lib/python2.7/tarfile.py", line 2202, in _extract_member self.makelink(tarinfo, targetpath) File "python/2.7.13/lib/python2.7/tarfile.py", line 2286, in makelink os.link(tarinfo._link_target, targetpath) OSError: [Errno 2] No such file or directory And with Python 3.6.0 (and earlier Python 3 series based Pythons that I have tested): $ time python3.6.0 tartest.py noskip tardata.tar.gz /tmp/tardata-python-3.6.0 ... tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted 11 times ... real 0m42.747s user 0m17.564s sys 0m6.144s If we then make the tarfile skip extraction of hard links that point to themselves: $ time python3.6.0 tartest.py skip tardata.tar.gz /tmp/tardata-python-3.6.0 ... tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted once ... Skipping tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- skipped hard links 10 times ... real 0m2.688s user 0m1.816s sys 0m0.532s >From the used user CPU time it's obvious that there is happening a lot of >unneeded decompression when we compare Python 3.6 results. If I use >TarFile.extractall(), it behaves similarly as using TarFile.extract() >individually on TarInfo objects. GNU tar seems to behave in such fashion that >it skips over the extraction of the actual file data when it encounters this >situation. ---------- components: Library (Lib) files: tartest.py messages: 288284 nosy: Jussi Judin priority: normal severity: normal status: open title: TarFile.extract() suffers from hard links inside tarball type: behavior versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6 Added file: http://bugs.python.org/file46658/tartest.py _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue29612> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com