Re: [R] gzfile with multiple entries in the archive

Duncan Temple Lang Sat, 18 Nov 2006 14:52:12 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Apologies for entering this late, but last week was extremely busy.

I hadn't realized that you would implement something so quickly
and I could have saved you some time.
I was in the process of adding facilities for gzipped tar
files in the Rcompression package.
A new version (0.3-0) is available from the www.omegahat.org
respositories

  www.omegahat.org/R

for source and Windows.

This uses code from the zlib-1.2.3 contrib/ directory
to do the extraction and table of contents. So it should be
pretty quick.  And it allows for "event-driven" programming
with callback functions, and has "hints"
for avoiding vector resizing issues which make it considerably
faster.

 D.


John James wrote:
> Following suggestions from Prof. Ripley and several others to use gzfile,
> here's rough code that will unzip a tgz into your working directory and
> return a list of the files. (It doesn't warn you that it is overwriting
> files!)
> 
> The magic numbers refer to the current tar header specification; the block
> sizes etc. are arbitrary.
> 
> It is inefficient in that it re-reads the file from the start for every
> file. I couldn't get the file pointer to stay and change the readBin mode
> back from 'character' to 'raw' although the reverse is used! Is there a
> setting I've missed?
> 
> Also, is there a better way to do the convert(..) function?
> 
> All criticisms gratefully received, especially being pointed to an existing
> function.
> 
> John James
> Mango Solutions
> 
> unzip <- function(x, archiveDirectory = '.', zipExtension='tgz',
> block=50000, maxBlocks=100, maxCountFiles=100) {
>       # Example
>       # unzip('test.tgz')
>       convert <- function(oct= 2, oldRoot=8, newRoot=10) {
>               if((newRoot==16))
>                       return(structure(convert(oct, oldRoot, 10),
> class='hexmode'))
>               if(newRoot>10)
>                       return(simpleError('WIP'))
>               if(class(oct)=='hexmode') {
>                       oct <- unclass(oct)
>                       if(newRoot==10)
>                               return(oct)
>                       oldRoot  <- 10
>                       return(simpleError('WIP'))
>               }
>               oct <- as.numeric(oct)
>               ret <- 0
>               oldPower <- 1
>               while(oct > 0.1){
>                       newoct <- floor(oct / newRoot)
>                       rem <- oct - newoct * newRoot 
>                       ret <- rem * oldPower + ret
>                       oldPower <- oldPower * oldRoot
>                       oct <- newoct
>               }
>               if(newRoot==16)
>                       ret <- structure(ret,  class = 'hexmode')
>               ret
>       }
>       listOfFiles <- list()
>       theArchives <- list.files(archiveDirectory, pattern = zipExtension)
>       if(length(grep(x, theArchives))==0)
>               return(simpleError(paste('No archive matching *', x, '*.',
> zipExtension, ' found')))
>       what <- paste(archiveDirectory, theArchives[grep(x, theArchives)],
> sep=.Platform$file.sep)
>       tmp <- tempfile()
>       nextBlockStartsAt <- readUpTo <- countFiles <- mu <- safety <- 0
>       zz <- gzfile(what, 'rb')
>       ww <- file(tmp, 'wb')
>       on.exit(unlink(tmp))
>       while(length(mu)>0) {
>               if(safety > maxBlocks)  {
>                       return(simpleError(paste('Archive File too large')))
>               }
>               safety <- safety + 1
>               mu <- readBin(zz, 'raw', block)
>               writeBin(mu, ww) 
>       }
>       close(zz)
>       close(ww)
>       while(countFiles < maxCountFiles){
>               countFiles <- countFiles + 1
>               zz <- file(tmp, 'rb')
>               stuff <- readBin(zz, 'raw', n=nextBlockStartsAt)
>               header <- readBin(zz, character(), n=100)
>               header <- header[nchar(header)>0][c(1,5)]
>               close(zz)
>               if(any(is.na(header))) {
>                       break;
>               }
>               listOfFiles[[countFiles]] <- header[1]
>               zz <- file(tmp, 'rb')
>               body <- readBin(zz, 'raw', n = 512 + nextBlockStartsAt +
> convert(header[2]))
>               writeBin(body[-c(1:(512 + nextBlockStartsAt))], header[1])
>               readUpTo <- 512 + nextBlockStartsAt + convert(header[2])
>               nextBlockStartsAt <- (readUpTo%/%512 + 1) * 512
>               close(zz)
>         }
>       listOfFiles
> }
> 
> -----Original Message-----
> From: Prof Brian Ripley [mailto:[EMAIL PROTECTED] 
> Sent: 14 November 2006 15:18
> To: John James
> Cc: r-help@stat.math.ethz.ch
> Subject: Re: [R] gzfile with multiple entries in the archive
> 
> On Tue, 14 Nov 2006, John James wrote:
> 
> 
>>If I open a tgz archive with gzfile and then parse it using readLines I
> 
> miss
> 
>>the initial line of each member of the archive - and also the name of the
>>file although the archive otherwise complete (but useless!).
> 
> 
> You can use a gzfile connection to read the underlying .tar file, but that 
> is not a text file and you will need to pick its structure apart yourself 
> via readBin and readChar.
> 
> 
>>Is there any way within R to extract both the list of files in a tgz
> 
> archive
> 
>>and to extract any one of these files?
> 
> 
>>Clearly I can use zcat and tar on Linux, but I need this to work within
> 
> the
> 
>>R environment on Windows!
> 
> 
> You could use tar on Windows: it is in the R tools set.
> 

- --
Duncan Temple Lang                    [EMAIL PROTECTED]
Department of Statistics              work:  (530) 752-4782
4210 Mathematical Sciences Building   fax:   (530) 752-7099
One Shields Ave.
University of California at Davis
Davis,
CA 95616,
USA
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)

iD8DBQFFX44l9p/Jzwa2QP4RAh9GAJ9H0HMc8YOQV3OCehf5Zk4GFc9ApACfebXn
j3Jxj57iXe935pXaR2mRA0o=
=HAmN
-----END PGP SIGNATURE-----

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] gzfile with multiple entries in the archive

Reply via email to