Re: Processing files from a tar archive in parallel

Jay Hacker Wed, 30 Mar 2011 10:18:51 -0700

With tar files, to extract the file you want, you first have to read
through all the files before it.  So reading the k-th file takes time
proportional to k (or the size of the first k files, anyway).  If you
read all N files like that separately, it will take time 1 + 2 + 3 +
... + N, which is O(N^2).  So if it's slow just untarring the file
once, doing it N times is going to be *really* unpleasant.  :)




On Tue, Mar 29, 2011 at 5:41 PM, Cook, Malcolm <[email protected]> wrote:
> ooops, more like:
>
>        tar -t big-file.tar.gz  | parallel tar -O -x -f big-file.tar.gz '|' 
> someCommandThatReadsFromStdIn
>
>
> Malcolm Cook
> Stowers Institute for Medical Research -  Bioinformatics
> Kansas City, Missouri  USA
>
>
>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf
>> Of Cook, Malcolm
>> Sent: Tuesday, March 29, 2011 4:35 PM
>> To: 'Ole Tange'; 'Jay Hacker'
>> Cc: '[email protected]'
>> Subject: RE: Processing files from a tar archive in parallel
>>
>> Hmmm
>>
>> use tar-t to extract the filenames pipe that into parallel to
>> call tar again to extract just that file and pipe it to some
>> other command
>>
>> tar -t big-file.tar.gz  | parallel tar -f big-file.tar.gz -
>> '|' someCommandThatReadsFromStdIn
>>
>> Malcolm Cook
>> Stowers Institute for Medical Research -  Bioinformatics
>> Kansas City, Missouri  USA
>>
>>
>>
>> > -----Original Message-----
>> > From: [email protected]
>> > [mailto:[email protected]] On Behalf Of Ole
>> > Tange
>> > Sent: Tuesday, March 29, 2011 4:14 PM
>> > To: Jay Hacker
>> > Cc: [email protected]
>> > Subject: Re: Processing files from a tar archive in parallel
>> >
>> > On Tue, Mar 29, 2011 at 10:14 PM, Jay Hacker <[email protected]>
>> > wrote:
>> > > On Tue, Mar 29, 2011 at 11:20 AM, Hans Schou
>> <[email protected]> wrote:
>> > >> On Tue, 29 Mar 2011, Jay Hacker wrote:
>> > >>
>> > >>> I have a large gzipped tar archive containing many small
>> > files; just
>> > >>> untarring it takes a lot of time and space.  I'd like to
>> > be able to
>> > >>> process each file in the archive, ideally without untarring the
>> > >>> whole thing first,
>> > :
>> > >> tar xvf big-file.tar.gz | parallel echo "Proc this file {}"
>> > >>
>> > >> Parallel will start when the first file is untared.
>> > :
>> > > That is a great idea.  However, can I be sure the file is
>> > completely
>> > > written to disk before tar prints the filename?
>> >
>> > While I loved Hans' idea, it does indeed have a race
>> condition. This
>> > should run 'ls -l' on each file after decompressing and
>> clearly fails
>> > now and then:
>> >
>> > $ tar xvf ../i.tgz | parallel ls -l > ls-l
>> > ls: cannot access 1792: No such file or directory
>> > ls: cannot access 209: No such file or directory
>> > ls: cannot access 21: No such file or directory
>> > ls: cannot access 2256: No such file or directory
>> > ls: cannot access 2349: No such file or directory
>> > ls: cannot access 2363: No such file or directory
>> > ls: cannot access 246: No such file or directory
>> > ls: cannot access 2712: No such file or directory
>> >
>> > But you could unpack in a new dir and use:
>> > http://www.gnu.org/software/parallel/man.html#example__gnu_par
>> > allel_as_dir_processor
>> >
>> > That seems to work.
>> >
>> > /Ole
>> >
>> >
>>

Re: Processing files from a tar archive in parallel

Reply via email to