With tar files, to extract the file you want, you first have to read through all the files before it. So reading the k-th file takes time proportional to k (or the size of the first k files, anyway). If you read all N files like that separately, it will take time 1 + 2 + 3 + ... + N, which is O(N^2). So if it's slow just untarring the file once, doing it N times is going to be *really* unpleasant. :)
On Tue, Mar 29, 2011 at 5:41 PM, Cook, Malcolm <[email protected]> wrote: > ooops, more like: > > tar -t big-file.tar.gz | parallel tar -O -x -f big-file.tar.gz '|' > someCommandThatReadsFromStdIn > > > Malcolm Cook > Stowers Institute for Medical Research - Bioinformatics > Kansas City, Missouri USA > > > >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf >> Of Cook, Malcolm >> Sent: Tuesday, March 29, 2011 4:35 PM >> To: 'Ole Tange'; 'Jay Hacker' >> Cc: '[email protected]' >> Subject: RE: Processing files from a tar archive in parallel >> >> Hmmm >> >> use tar-t to extract the filenames pipe that into parallel to >> call tar again to extract just that file and pipe it to some >> other command >> >> tar -t big-file.tar.gz | parallel tar -f big-file.tar.gz - >> '|' someCommandThatReadsFromStdIn >> >> Malcolm Cook >> Stowers Institute for Medical Research - Bioinformatics >> Kansas City, Missouri USA >> >> >> >> > -----Original Message----- >> > From: [email protected] >> > [mailto:[email protected]] On Behalf Of Ole >> > Tange >> > Sent: Tuesday, March 29, 2011 4:14 PM >> > To: Jay Hacker >> > Cc: [email protected] >> > Subject: Re: Processing files from a tar archive in parallel >> > >> > On Tue, Mar 29, 2011 at 10:14 PM, Jay Hacker <[email protected]> >> > wrote: >> > > On Tue, Mar 29, 2011 at 11:20 AM, Hans Schou >> <[email protected]> wrote: >> > >> On Tue, 29 Mar 2011, Jay Hacker wrote: >> > >> >> > >>> I have a large gzipped tar archive containing many small >> > files; just >> > >>> untarring it takes a lot of time and space. I'd like to >> > be able to >> > >>> process each file in the archive, ideally without untarring the >> > >>> whole thing first, >> > : >> > >> tar xvf big-file.tar.gz | parallel echo "Proc this file {}" >> > >> >> > >> Parallel will start when the first file is untared. >> > : >> > > That is a great idea. However, can I be sure the file is >> > completely >> > > written to disk before tar prints the filename? >> > >> > While I loved Hans' idea, it does indeed have a race >> condition. This >> > should run 'ls -l' on each file after decompressing and >> clearly fails >> > now and then: >> > >> > $ tar xvf ../i.tgz | parallel ls -l > ls-l >> > ls: cannot access 1792: No such file or directory >> > ls: cannot access 209: No such file or directory >> > ls: cannot access 21: No such file or directory >> > ls: cannot access 2256: No such file or directory >> > ls: cannot access 2349: No such file or directory >> > ls: cannot access 2363: No such file or directory >> > ls: cannot access 246: No such file or directory >> > ls: cannot access 2712: No such file or directory >> > >> > But you could unpack in a new dir and use: >> > http://www.gnu.org/software/parallel/man.html#example__gnu_par >> > allel_as_dir_processor >> > >> > That seems to work. >> > >> > /Ole >> > >> > >>
