-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Apologies for entering this late, but last week was extremely busy. I hadn't realized that you would implement something so quickly and I could have saved you some time. I was in the process of adding facilities for gzipped tar files in the Rcompression package. A new version (0.3-0) is available from the www.omegahat.org respositories www.omegahat.org/R for source and Windows. This uses code from the zlib-1.2.3 contrib/ directory to do the extraction and table of contents. So it should be pretty quick. And it allows for "event-driven" programming with callback functions, and has "hints" for avoiding vector resizing issues which make it considerably faster. D. John James wrote: > Following suggestions from Prof. Ripley and several others to use gzfile, > here's rough code that will unzip a tgz into your working directory and > return a list of the files. (It doesn't warn you that it is overwriting > files!) > > The magic numbers refer to the current tar header specification; the block > sizes etc. are arbitrary. > > It is inefficient in that it re-reads the file from the start for every > file. I couldn't get the file pointer to stay and change the readBin mode > back from 'character' to 'raw' although the reverse is used! Is there a > setting I've missed? > > Also, is there a better way to do the convert(..) function? > > All criticisms gratefully received, especially being pointed to an existing > function. > > John James > Mango Solutions > > unzip <- function(x, archiveDirectory = '.', zipExtension='tgz', > block=50000, maxBlocks=100, maxCountFiles=100) { > # Example > # unzip('test.tgz') > convert <- function(oct= 2, oldRoot=8, newRoot=10) { > if((newRoot==16)) > return(structure(convert(oct, oldRoot, 10), > class='hexmode')) > if(newRoot>10) > return(simpleError('WIP')) > if(class(oct)=='hexmode') { > oct <- unclass(oct) > if(newRoot==10) > return(oct) > oldRoot <- 10 > return(simpleError('WIP')) > } > oct <- as.numeric(oct) > ret <- 0 > oldPower <- 1 > while(oct > 0.1){ > newoct <- floor(oct / newRoot) > rem <- oct - newoct * newRoot > ret <- rem * oldPower + ret > oldPower <- oldPower * oldRoot > oct <- newoct > } > if(newRoot==16) > ret <- structure(ret, class = 'hexmode') > ret > } > listOfFiles <- list() > theArchives <- list.files(archiveDirectory, pattern = zipExtension) > if(length(grep(x, theArchives))==0) > return(simpleError(paste('No archive matching *', x, '*.', > zipExtension, ' found'))) > what <- paste(archiveDirectory, theArchives[grep(x, theArchives)], > sep=.Platform$file.sep) > tmp <- tempfile() > nextBlockStartsAt <- readUpTo <- countFiles <- mu <- safety <- 0 > zz <- gzfile(what, 'rb') > ww <- file(tmp, 'wb') > on.exit(unlink(tmp)) > while(length(mu)>0) { > if(safety > maxBlocks) { > return(simpleError(paste('Archive File too large'))) > } > safety <- safety + 1 > mu <- readBin(zz, 'raw', block) > writeBin(mu, ww) > } > close(zz) > close(ww) > while(countFiles < maxCountFiles){ > countFiles <- countFiles + 1 > zz <- file(tmp, 'rb') > stuff <- readBin(zz, 'raw', n=nextBlockStartsAt) > header <- readBin(zz, character(), n=100) > header <- header[nchar(header)>0][c(1,5)] > close(zz) > if(any(is.na(header))) { > break; > } > listOfFiles[[countFiles]] <- header[1] > zz <- file(tmp, 'rb') > body <- readBin(zz, 'raw', n = 512 + nextBlockStartsAt + > convert(header[2])) > writeBin(body[-c(1:(512 + nextBlockStartsAt))], header[1]) > readUpTo <- 512 + nextBlockStartsAt + convert(header[2]) > nextBlockStartsAt <- (readUpTo%/%512 + 1) * 512 > close(zz) > } > listOfFiles > } > > -----Original Message----- > From: Prof Brian Ripley [mailto:[EMAIL PROTECTED] > Sent: 14 November 2006 15:18 > To: John James > Cc: r-help@stat.math.ethz.ch > Subject: Re: [R] gzfile with multiple entries in the archive > > On Tue, 14 Nov 2006, John James wrote: > > >>If I open a tgz archive with gzfile and then parse it using readLines I > > miss > >>the initial line of each member of the archive - and also the name of the >>file although the archive otherwise complete (but useless!). > > > You can use a gzfile connection to read the underlying .tar file, but that > is not a text file and you will need to pick its structure apart yourself > via readBin and readChar. > > >>Is there any way within R to extract both the list of files in a tgz > > archive > >>and to extract any one of these files? > > >>Clearly I can use zcat and tar on Linux, but I need this to work within > > the > >>R environment on Windows! > > > You could use tar on Windows: it is in the R tools set. > - -- Duncan Temple Lang [EMAIL PROTECTED] Department of Statistics work: (530) 752-4782 4210 Mathematical Sciences Building fax: (530) 752-7099 One Shields Ave. University of California at Davis Davis, CA 95616, USA -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) iD8DBQFFX44l9p/Jzwa2QP4RAh9GAJ9H0HMc8YOQV3OCehf5Zk4GFc9ApACfebXn j3Jxj57iXe935pXaR2mRA0o= =HAmN -----END PGP SIGNATURE----- ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.