I will implement this if the galaxy team likes the approach. 

We did this in ucsc genome browser code years ago: a single open_helper call 
handles, gzip, http, ftp and pipes. No need to care about how the data is 
compressed or where it data resides. 

wouldn't it be great to be able to pipe data between workflow steps rather than 
writing to disk?  I admit that this will require some work but the first step 
is to abstract the open.

On Jul 9, 2013, at 10:38 AM, Peter Cock wrote:

> On Tue, Jul 9, 2013 at 5:53 PM, Robert Baertsch <rbaer...@ucsc.edu> wrote:
>> On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:
>>> The tools available in Galaxy are written in a range
>>> of languages including C, Perl, R, etc. Yes, some are in Python,
>>> but of those most are independent of Galaxy and can be used
>>> separately from Galaxy.
>> 
>> the helper function would have to ported to R. We are talking
>> about how galaxy compressed data. Once we decide that, we
>> can determine how to best implement it.
> 
> Individual tools called from Galaxy read and create the files -
> and we can't usually control them at this level (modifying them all
> to call a Galaxy managed file open mechanism is not an option).
> 
>> Proposal: Do not treat compressed data as a separate data type.
>> Treat it as an independent attribute that can be applied to any data.
>> Otherwise you will have to create a gzipped , zip and bz2 type for
>> every type that you want to compress.
> 
> That's what I've been saying - the fact that some people are
> already using a new gzipped FASTQ format within their Galaxy
> instances is practical, but I view it as a short term solution only.
> 
>>>> Encoding the gzip status in the datatype will create an explosion of
>>>> datatypes. Compression is not actually a datatype, it tells you nothing
>>>> about the content data that is stored in the file.
>>> 
>>> What we'd previously discussed was a dual system, holding
>>> the file type as now (e.g. FASTA, SAM, GFF3, etc) and any
>>> compression (e.g., None, normal GZIP, BGZF which is a
>>> GZIP variant, BZIP2, etc).
>> 
>> What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also?
> 
> Note ZIP is a bit different, as it is often a multiple file bundle -
> it behaves differently from GZIP, BGZF, XY, BZIP2 etc in that
> regard.
> 
> But otherwise, yes. As a specific example, the tabix tool used BGZF
> compressed tabular data to combine compression and efficient
> random access. This would be useful for many annotation files
> (e.g. GTF, GFF3).
> 
>> This will quickly get out of hand and create a mess for tool
>> developers that need to support all thees types.
> 
> Why? Individual tool developers don't need to know if Galaxy
> is keeping the original data file on disk compressed - unless
> the tool XML says otherwise, Galaxy would hide this detail
> and call the tool with an uncompressed input file.
> 
> (Unix named pipe which decompresses the file on the file would
> be a potential alternative - but only if the tool XML was marked
> up to say that an input could be streamed. The default must be
> to assume potential random access to the input files)
> 
>> The tool code and tool xml should be written to handle uncompressed
>> data and galaxy should handle the details of decompression. This
>> is not hard to do.
> 
> It isn't trivial either ;)
> 
> Peter


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to