Following suggestions from Prof. Ripley and several others to use gzfile, here's rough code that will unzip a tgz into your working directory and return a list of the files. (It doesn't warn you that it is overwriting files!)
The magic numbers refer to the current tar header specification; the block sizes etc. are arbitrary. It is inefficient in that it re-reads the file from the start for every file. I couldn't get the file pointer to stay and change the readBin mode back from 'character' to 'raw' although the reverse is used! Is there a setting I've missed? Also, is there a better way to do the convert(..) function? All criticisms gratefully received, especially being pointed to an existing function. John James Mango Solutions unzip <- function(x, archiveDirectory = '.', zipExtension='tgz', block=50000, maxBlocks=100, maxCountFiles=100) { # Example # unzip('test.tgz') convert <- function(oct= 2, oldRoot=8, newRoot=10) { if((newRoot==16)) return(structure(convert(oct, oldRoot, 10), class='hexmode')) if(newRoot>10) return(simpleError('WIP')) if(class(oct)=='hexmode') { oct <- unclass(oct) if(newRoot==10) return(oct) oldRoot <- 10 return(simpleError('WIP')) } oct <- as.numeric(oct) ret <- 0 oldPower <- 1 while(oct > 0.1){ newoct <- floor(oct / newRoot) rem <- oct - newoct * newRoot ret <- rem * oldPower + ret oldPower <- oldPower * oldRoot oct <- newoct } if(newRoot==16) ret <- structure(ret, class = 'hexmode') ret } listOfFiles <- list() theArchives <- list.files(archiveDirectory, pattern = zipExtension) if(length(grep(x, theArchives))==0) return(simpleError(paste('No archive matching *', x, '*.', zipExtension, ' found'))) what <- paste(archiveDirectory, theArchives[grep(x, theArchives)], sep=.Platform$file.sep) tmp <- tempfile() nextBlockStartsAt <- readUpTo <- countFiles <- mu <- safety <- 0 zz <- gzfile(what, 'rb') ww <- file(tmp, 'wb') on.exit(unlink(tmp)) while(length(mu)>0) { if(safety > maxBlocks) { return(simpleError(paste('Archive File too large'))) } safety <- safety + 1 mu <- readBin(zz, 'raw', block) writeBin(mu, ww) } close(zz) close(ww) while(countFiles < maxCountFiles){ countFiles <- countFiles + 1 zz <- file(tmp, 'rb') stuff <- readBin(zz, 'raw', n=nextBlockStartsAt) header <- readBin(zz, character(), n=100) header <- header[nchar(header)>0][c(1,5)] close(zz) if(any(is.na(header))) { break; } listOfFiles[[countFiles]] <- header[1] zz <- file(tmp, 'rb') body <- readBin(zz, 'raw', n = 512 + nextBlockStartsAt + convert(header[2])) writeBin(body[-c(1:(512 + nextBlockStartsAt))], header[1]) readUpTo <- 512 + nextBlockStartsAt + convert(header[2]) nextBlockStartsAt <- (readUpTo%/%512 + 1) * 512 close(zz) } listOfFiles } -----Original Message----- From: Prof Brian Ripley [mailto:[EMAIL PROTECTED] Sent: 14 November 2006 15:18 To: John James Cc: r-help@stat.math.ethz.ch Subject: Re: [R] gzfile with multiple entries in the archive On Tue, 14 Nov 2006, John James wrote: > If I open a tgz archive with gzfile and then parse it using readLines I miss > the initial line of each member of the archive - and also the name of the > file although the archive otherwise complete (but useless!). You can use a gzfile connection to read the underlying .tar file, but that is not a text file and you will need to pick its structure apart yourself via readBin and readChar. > Is there any way within R to extract both the list of files in a tgz archive > and to extract any one of these files? > Clearly I can use zcat and tar on Linux, but I need this to work within the > R environment on Windows! You could use tar on Windows: it is in the R tools set. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.