Peter and Dan,
I like the idea of replacing all open() with galaxy_open() in all tools. You 
can tell the format by looking at the first 4 byes (see C code below from the 
UCSC browser team). Is there some pythonic way of overriding open?

You need to read the first four bytes of the file to see if it is compressed 
and call gzip.open inside of the function and pass pack the handle. 

For now, it would require a global sweep through the tools to change open() to 
galaxy_open(), but it is probably a good idea to have tool developers avoid 
calling open directly.

You would have to have special handling if there are multiple files in the 
compressed archive but that support could be added later.

-Robert


def galaxy_open(filename, mode="r"):
   compressor = getCompressor(filename, mode)
   if compessor != NULL:
         return openCompressed(filename, mode, compressor)
   else:
         return open(filename, mode)


def openCompressed(filename, mode):
      4bytes = read4bytes(filename)
      ext = getExtensionFromHdrSig(4bytes)
      if ext == "gz" :
         return gzip.open(filename, mode)
      else if ext == "bz2":
         return bz2.BZ2File(filename, mode)
      else if ext == "zip":
         return zipfile.ZipFile(filename, mode)
        
          

char *getExtensionFromHdrSig(char *first4bytes)
/* Check if header has signature of supported compression stream,
   and return a phoney filename with extension for it, or NULL if no sig found. 
*/
{
char buf[20];
char *ext=NULL;
if (startsWith("\x1f\x8b",first4bytes)) ext = "gz";
else if (startsWith("\x1f\x9d\x90",first4bytes)) ext = "Z";
else if (startsWith("BZ",first4bytes)) ext = "bz2";
else if (startsWith("PK\x03\x04",first4bytes)) ext = "zip";
if (ext==NULL)
    return NULL;
}
On Jul 8, 2013, at 4:05 AM, Peter Cock wrote:

> On Thu, Jul 4, 2013 at 9:49 PM, Robert Baertsch
> <robert.baert...@gmail.com> wrote:
>> Dan,
>> Do these readers support gzip files?
>> 
>>       reader = fastqVerboseErrorReader
>>        reader = fastqReader
> 
> Presumably you are writing a Python script using this library?
> The answer is a qualified yes. Instead of passing them a normal
> file handle using open("example.fastq") you instead use
> gzip.open("example.fastq") via import gzip.
> 
>> Do I have to define a special type in galaxy for gzipped files or will the 
>> fastq type be ok?
>> 
> 
> This needs a special file format - but you are not the first person to
> look at this, some groups have defined custom gzipped variants of
> the FASTQ formats within their own Galaxy instances. I've not
> done this but there should be some useful emails in the archive.
> 
> Note you'd also need to modify any tool definitions to that they
> can accept a gzipped FASTQ file.
> 
>> Ideally, I would like to keep my files zipped and not have galaxy unzip 
>> them, since they triple in size when unzipped.
>> 
>> I'm happy to do a push request if you don't support this but I want to make 
>> sure I'm in line with your roadmap.
> 
> Personally I would like a more general system in Galaxy for
> potentially any file type to be held compressed in a range of
> formats (e.g. using gzip, bgzf, xy, bz2, etc), with exclusions
> for things like BAM which are already compressed. This way
> naive tools would get the gzipped file file uncompressed to a
> temporary folder before use (i.e. no change for the tool wrapper),
> but if a tool accepts a gzipped file it will get that (less disk IO
> and CPU usage, but requires updating tool wrappers).
> 
> That idea is quite ambitious through ;)
> 
>> I have written a simple tool to convert Illumina fastq to mapsplice fastq. 
>> Does that already exist already somewhere?
>> 
> 
> I don't know.
> 
> Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to