Re: [R] gzfile with multiple entries in the archive

John James Fri, 17 Nov 2006 07:28:18 -0800

Following suggestions from Prof. Ripley and several others to use gzfile,
here's rough code that will unzip a tgz into your working directory and
return a list of the files. (It doesn't warn you that it is overwriting
files!)


The magic numbers refer to the current tar header specification; the block
sizes etc. are arbitrary.

It is inefficient in that it re-reads the file from the start for every
file. I couldn't get the file pointer to stay and change the readBin mode
back from 'character' to 'raw' although the reverse is used! Is there a
setting I've missed?

Also, is there a better way to do the convert(..) function?

All criticisms gratefully received, especially being pointed to an existing
function.

John James
Mango Solutions

unzip <- function(x, archiveDirectory = '.', zipExtension='tgz',
block=50000, maxBlocks=100, maxCountFiles=100) {
        # Example
        # unzip('test.tgz')
        convert <- function(oct= 2, oldRoot=8, newRoot=10) {
                if((newRoot==16))
                        return(structure(convert(oct, oldRoot, 10),
class='hexmode'))
                if(newRoot>10)
                        return(simpleError('WIP'))
                if(class(oct)=='hexmode') {
                        oct <- unclass(oct)
                        if(newRoot==10)
                                return(oct)
                        oldRoot  <- 10
                        return(simpleError('WIP'))
                }
                oct <- as.numeric(oct)
                ret <- 0
                oldPower <- 1
                while(oct > 0.1){
                        newoct <- floor(oct / newRoot)
                        rem <- oct - newoct * newRoot 
                        ret <- rem * oldPower + ret
                        oldPower <- oldPower * oldRoot
                        oct <- newoct
                }
                if(newRoot==16)
                        ret <- structure(ret,  class = 'hexmode')
                ret
        }
        listOfFiles <- list()
        theArchives <- list.files(archiveDirectory, pattern = zipExtension)
        if(length(grep(x, theArchives))==0)
                return(simpleError(paste('No archive matching *', x, '*.',
zipExtension, ' found')))
        what <- paste(archiveDirectory, theArchives[grep(x, theArchives)],
sep=.Platform$file.sep)
        tmp <- tempfile()
        nextBlockStartsAt <- readUpTo <- countFiles <- mu <- safety <- 0
        zz <- gzfile(what, 'rb')
        ww <- file(tmp, 'wb')
        on.exit(unlink(tmp))
        while(length(mu)>0) {
                if(safety > maxBlocks)  {
                        return(simpleError(paste('Archive File too large')))
                }
                safety <- safety + 1
                mu <- readBin(zz, 'raw', block)
                writeBin(mu, ww) 
        }
        close(zz)
        close(ww)
        while(countFiles < maxCountFiles){
                countFiles <- countFiles + 1
                zz <- file(tmp, 'rb')
                stuff <- readBin(zz, 'raw', n=nextBlockStartsAt)
                header <- readBin(zz, character(), n=100)
                header <- header[nchar(header)>0][c(1,5)]
                close(zz)
                if(any(is.na(header))) {
                        break;
                }
                listOfFiles[[countFiles]] <- header[1]
                zz <- file(tmp, 'rb')
                body <- readBin(zz, 'raw', n = 512 + nextBlockStartsAt +
convert(header[2]))
                writeBin(body[-c(1:(512 + nextBlockStartsAt))], header[1])
                readUpTo <- 512 + nextBlockStartsAt + convert(header[2])
                nextBlockStartsAt <- (readUpTo%/%512 + 1) * 512
                close(zz)
          }
        listOfFiles
}

-----Original Message-----
From: Prof Brian Ripley [mailto:[EMAIL PROTECTED] 
Sent: 14 November 2006 15:18
To: John James
Cc: r-help@stat.math.ethz.ch
Subject: Re: [R] gzfile with multiple entries in the archive

On Tue, 14 Nov 2006, John James wrote:

> If I open a tgz archive with gzfile and then parse it using readLines I
miss
> the initial line of each member of the archive - and also the name of the
> file although the archive otherwise complete (but useless!).

You can use a gzfile connection to read the underlying .tar file, but that 
is not a text file and you will need to pick its structure apart yourself 
via readBin and readChar.

> Is there any way within R to extract both the list of files in a tgz
archive
> and to extract any one of these files?

> Clearly I can use zcat and tar on Linux, but I need this to work within
the
> R environment on Windows!

You could use tar on Windows: it is in the R tools set.

-- 
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] gzfile with multiple entries in the archive

Reply via email to