Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Robert Baertsch

On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:

> On Mon, Jul 8, 2013 at 11:21 PM, Robert Baertsch  wrote:
>> I respectfully disagree,  If you want an extensible system, you should
>> always wrap primitive system level calls.
>> 
>> Any tools that opens a file that could be compressed would be affected.
>> That is a huge number of tools. Do you really want a cottage industry of
>> tools that have different methods of dealing with compression?
> 
> But defining a Python helper function within the Galaxy Python
> libraries doesn't achieve that.
> 
> Are you talking about patching the OS level POSIX open functions
> or something?

no.
> The tools available in Galaxy are written in a range
> of languages including C, Perl, R, etc. Yes, some are in Python,
> but of those most are independent of Galaxy and can be used
> separately from Galaxy.
the helper function would have to ported to R. We are talking about how galaxy 
compressed data. Once we decide that, we can determine how to best implement it.

Proposal: Do not treat compressed data as a separate data type. Treat it as an 
independent attribute that can be applied to any data. Otherwise you will have 
to create a gzipped , zip and bz2 type for every type that you want to compress.

people can use the python helpers or write their own in other languages,
 
We need a galaxy_open function to hide details of compression from tool 
developers.

We could also open http files or pipes without any changes to tools. (other 
than changing open() to galaxy_open()

> 
>> Encoding the gzip status in the datatype will create an explosion of
>> datatypes. Compression is not actually a datatype, it tells you nothing
>> about the content data that is stored in the file.
> 
> What we'd previously discussed was a dual system, holding
> the file type as now (e.g. FASTA, SAM, GFF3, etc) and any
> compression (e.g., None, normal GZIP, BGZF which is a
> GZIP variant, BZIP2, etc).

What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also? 

This will quickly get out of hand and create a mess for tool developers that 
need to support all thees types.

The tool code and tool xml should be written to handle uncompressed data and 
galaxy should handle the details of decompression. This is not hard to do.
> 
> Galaxy tool wrappers currently define input files with a list
> of file types - they'd also have to give a list of supported
> compression types (defaulting to none). Likewise for any
> output files - if they are already compressed the XML for
> the tool wrapper would have to tell Galaxy this.
> 
>> It is up to the galaxy team to provide a standard way to interact
>> with compressed files.
> 
> That is my preference too - although this could be driven by
> the Galaxy community rather than the core team? I see
> defining new datatypes like 'gzippedfastq' as a stop gap
> special case (but a very practical route for now).
> 
>> My proposed solution, is a very small change that could
>> be phased in over time. Any tools that uses open would not support
>> compressed files, but they would not break on uncompressed files.
>> 
>> Do others have an opinion?
> 
> Either I don't understand your plan, or it would only help in
> a tiny minority of cases.
> 
> Regards,
> 
> Peter


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Robert Baertsch

On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:

> On Mon, Jul 8, 2013 at 11:21 PM, Robert Baertsch  wrote:
>> I respectfully disagree,  If you want an extensible system, you should
>> always wrap primitive system level calls.
>> 
>> Any tools that opens a file that could be compressed would be affected.
>> That is a huge number of tools. Do you really want a cottage industry of
>> tools that have different methods of dealing with compression?
> 
> But defining a Python helper function within the Galaxy Python
> libraries doesn't achieve that.
> 
> Are you talking about patching the OS level POSIX open functions
> or something?

no.
> The tools available in Galaxy are written in a range
> of languages including C, Perl, R, etc. Yes, some are in Python,
> but of those most are independent of Galaxy and can be used
> separately from Galaxy.
the helper function would have to ported to R. We are talking about how galaxy 
compressed data. Once we decide that, we can determine how to best implement it.

Proposal: Do not treat compressed data as a separate data type. Treat it as an 
independent attribute that can be applied to any data. Otherwise you will have 
to create a gzipped , zip and bz2 type for every type that you want to compress.

people can use the python helpers or write their own in other languages,

We need a galaxy_open function to hide details of compression from tool 
developers.

We could also open http files or pipes without any changes to tools. (other 
than changing open() to galaxy_open()

> 
>> Encoding the gzip status in the datatype will create an explosion of
>> datatypes. Compression is not actually a datatype, it tells you nothing
>> about the content data that is stored in the file.
> 
> What we'd previously discussed was a dual system, holding
> the file type as now (e.g. FASTA, SAM, GFF3, etc) and any
> compression (e.g., None, normal GZIP, BGZF which is a
> GZIP variant, BZIP2, etc).

What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also? 

This will quickly get out of hand and create a mess for tool developers that 
need to support all thees types.

The tool code and tool xml should be written to handle uncompressed data and 
galaxy should handle the details of decompression. This is not hard to do.
> 
> Galaxy tool wrappers currently define input files with a list
> of file types - they'd also have to give a list of supported
> compression types (defaulting to none). Likewise for any
> output files - if they are already compressed the XML for
> the tool wrapper would have to tell Galaxy this.
> 
>> It is up to the galaxy team to provide a standard way to interact
>> with compressed files.
> 
> That is my preference too - although this could be driven by
> the Galaxy community rather than the core team? I see
> defining new datatypes like 'gzippedfastq' as a stop gap
> special case (but a very practical route for now).
> 
>> My proposed solution, is a very small change that could
>> be phased in over time. Any tools that uses open would not support
>> compressed files, but they would not break on uncompressed files.
>> 
>> Do others have an opinion?
> 
> Either I don't understand your plan, or it would only help in
> a tiny minority of cases.
> 
> Regards,
> 
> Peter


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Robert Baertsch
I will implement this if the galaxy team likes the approach. 

We did this in ucsc genome browser code years ago: a single open_helper call 
handles, gzip, http, ftp and pipes. No need to care about how the data is 
compressed or where it data resides. 

wouldn't it be great to be able to pipe data between workflow steps rather than 
writing to disk?  I admit that this will require some work but the first step 
is to abstract the open.

On Jul 9, 2013, at 10:38 AM, Peter Cock wrote:

> On Tue, Jul 9, 2013 at 5:53 PM, Robert Baertsch  wrote:
>> On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:
>>> The tools available in Galaxy are written in a range
>>> of languages including C, Perl, R, etc. Yes, some are in Python,
>>> but of those most are independent of Galaxy and can be used
>>> separately from Galaxy.
>> 
>> the helper function would have to ported to R. We are talking
>> about how galaxy compressed data. Once we decide that, we
>> can determine how to best implement it.
> 
> Individual tools called from Galaxy read and create the files -
> and we can't usually control them at this level (modifying them all
> to call a Galaxy managed file open mechanism is not an option).
> 
>> Proposal: Do not treat compressed data as a separate data type.
>> Treat it as an independent attribute that can be applied to any data.
>> Otherwise you will have to create a gzipped , zip and bz2 type for
>> every type that you want to compress.
> 
> That's what I've been saying - the fact that some people are
> already using a new gzipped FASTQ format within their Galaxy
> instances is practical, but I view it as a short term solution only.
> 
 Encoding the gzip status in the datatype will create an explosion of
 datatypes. Compression is not actually a datatype, it tells you nothing
 about the content data that is stored in the file.
>>> 
>>> What we'd previously discussed was a dual system, holding
>>> the file type as now (e.g. FASTA, SAM, GFF3, etc) and any
>>> compression (e.g., None, normal GZIP, BGZF which is a
>>> GZIP variant, BZIP2, etc).
>> 
>> What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also?
> 
> Note ZIP is a bit different, as it is often a multiple file bundle -
> it behaves differently from GZIP, BGZF, XY, BZIP2 etc in that
> regard.
> 
> But otherwise, yes. As a specific example, the tabix tool used BGZF
> compressed tabular data to combine compression and efficient
> random access. This would be useful for many annotation files
> (e.g. GTF, GFF3).
> 
>> This will quickly get out of hand and create a mess for tool
>> developers that need to support all thees types.
> 
> Why? Individual tool developers don't need to know if Galaxy
> is keeping the original data file on disk compressed - unless
> the tool XML says otherwise, Galaxy would hide this detail
> and call the tool with an uncompressed input file.
> 
> (Unix named pipe which decompresses the file on the file would
> be a potential alternative - but only if the tool XML was marked
> up to say that an input could be streamed. The default must be
> to assume potential random access to the input files)
> 
>> The tool code and tool xml should be written to handle uncompressed
>> data and galaxy should handle the details of decompression. This
>> is not hard to do.
> 
> It isn't trivial either ;)
> 
> Peter


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Robert Baertsch
great. Let's put the bx-python calls in a galaxy_open helper function.

On Jul 8, 2013, at 3:20 PM, James Taylor wrote:

> open_compressed in bx-python does this already (for bz2 as well).
> 
> On Jul 8, 2013, at 5:58 PM, Peter Cock  wrote:
> 
>> On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch
>>  wrote:
>>> Peter and Dan,
>>> I like the idea of replacing all open() with galaxy_open() in all tools. You
>>> can tell the format by looking at the first 4 byes (see C code below from
>>> the UCSC browser team). Is there some pythonic way of overriding open?
>> 
>> There is monkey patching (replace the current 'open' function with
>> your modified version), but that is not a good idea in general.
>> 
>> In any case, this would only affect the small number of Python
>> tools which happen to use the Galaxy parsing libraries - which
>> is a very small fraction of the tools in Galaxy. Most of the tools
>> in Galaxy are compiled programs and are entirely separate.
>> 
>> Peter
>> ___
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>> http://lists.bx.psu.edu/
>> 
>> To search Galaxy mailing lists use the unified search at:
>> http://galaxyproject.org/search/mailinglists/


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Robert Baertsch
great. Let's put the bx-python calls in a galaxy_open helper function.

On Jul 8, 2013, at 3:20 PM, James Taylor wrote:

> open_compressed in bx-python does this already (for bz2 as well).
> 
> On Jul 8, 2013, at 5:58 PM, Peter Cock  wrote:
> 
>> On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch
>>  wrote:
>>> Peter and Dan,
>>> I like the idea of replacing all open() with galaxy_open() in all tools. You
>>> can tell the format by looking at the first 4 byes (see C code below from
>>> the UCSC browser team). Is there some pythonic way of overriding open?
>> 
>> There is monkey patching (replace the current 'open' function with
>> your modified version), but that is not a good idea in general.
>> 
>> In any case, this would only affect the small number of Python
>> tools which happen to use the Galaxy parsing libraries - which
>> is a very small fraction of the tools in Galaxy. Most of the tools
>> in Galaxy are compiled programs and are entirely separate.
>> 
>> Peter
>> ___
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>> http://lists.bx.psu.edu/
>> 
>> To search Galaxy mailing lists use the unified search at:
>> http://galaxyproject.org/search/mailinglists/


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Peter Cock
On Tue, Jul 9, 2013 at 5:53 PM, Robert Baertsch  wrote:
> On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:
>> The tools available in Galaxy are written in a range
>> of languages including C, Perl, R, etc. Yes, some are in Python,
>> but of those most are independent of Galaxy and can be used
>> separately from Galaxy.
>
> the helper function would have to ported to R. We are talking
> about how galaxy compressed data. Once we decide that, we
> can determine how to best implement it.

Individual tools called from Galaxy read and create the files -
and we can't usually control them at this level (modifying them all
to call a Galaxy managed file open mechanism is not an option).

> Proposal: Do not treat compressed data as a separate data type.
> Treat it as an independent attribute that can be applied to any data.
> Otherwise you will have to create a gzipped , zip and bz2 type for
> every type that you want to compress.

That's what I've been saying - the fact that some people are
already using a new gzipped FASTQ format within their Galaxy
instances is practical, but I view it as a short term solution only.

>>> Encoding the gzip status in the datatype will create an explosion of
>>> datatypes. Compression is not actually a datatype, it tells you nothing
>>> about the content data that is stored in the file.
>>
>> What we'd previously discussed was a dual system, holding
>> the file type as now (e.g. FASTA, SAM, GFF3, etc) and any
>> compression (e.g., None, normal GZIP, BGZF which is a
>> GZIP variant, BZIP2, etc).
>
> What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also?

Note ZIP is a bit different, as it is often a multiple file bundle -
it behaves differently from GZIP, BGZF, XY, BZIP2 etc in that
regard.

But otherwise, yes. As a specific example, the tabix tool used BGZF
compressed tabular data to combine compression and efficient
random access. This would be useful for many annotation files
(e.g. GTF, GFF3).

> This will quickly get out of hand and create a mess for tool
> developers that need to support all thees types.

Why? Individual tool developers don't need to know if Galaxy
is keeping the original data file on disk compressed - unless
the tool XML says otherwise, Galaxy would hide this detail
and call the tool with an uncompressed input file.

(Unix named pipe which decompresses the file on the file would
be a potential alternative - but only if the tool XML was marked
up to say that an input could be streamed. The default must be
to assume potential random access to the input files)

> The tool code and tool xml should be written to handle uncompressed
> data and galaxy should handle the details of decompression. This
> is not hard to do.

It isn't trivial either ;)

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Robert Baertsch
I respectfully disagree,  If you want an extensible system, you should always 
wrap primitive system level calls.

Any tools that opens a file that could be compressed would be affected. That is 
a huge number of tools. Do you really want a cottage industry of tools that 
have different methods of dealing with compression?

Encoding the gzip status in the datatype will create an explosion of datatypes. 
Compression is not actually a datatype, it tells you nothing about the content 
data that is stored in the file.

It is up to the galaxy team to provide a standard way to interact with 
compressed files.  My proposed solution, is a very small change that could be 
phased in over time. Any tools that uses open would not support compressed 
files, but they would not break on uncompressed files.

Do others have an opinion?

On Jul 8, 2013, at 2:58 PM, Peter Cock wrote:

> On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch
>  wrote:
>> Peter and Dan,
>> I like the idea of replacing all open() with galaxy_open() in all tools. You
>> can tell the format by looking at the first 4 byes (see C code below from
>> the UCSC browser team). Is there some pythonic way of overriding open?
> 
> There is monkey patching (replace the current 'open' function with
> your modified version), but that is not a good idea in general.
> 
> In any case, this would only affect the small number of Python
> tools which happen to use the Galaxy parsing libraries - which
> is a very small fraction of the tools in Galaxy. Most of the tools
> in Galaxy are compiled programs and are entirely separate.
> 
> Peter


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread Robert Baertsch
Peter and Dan,
I like the idea of replacing all open() with galaxy_open() in all tools. You 
can tell the format by looking at the first 4 byes (see C code below from the 
UCSC browser team). Is there some pythonic way of overriding open?

You need to read the first four bytes of the file to see if it is compressed 
and call gzip.open inside of the function and pass pack the handle. 

For now, it would require a global sweep through the tools to change open() to 
galaxy_open(), but it is probably a good idea to have tool developers avoid 
calling open directly.

You would have to have special handling if there are multiple files in the 
compressed archive but that support could be added later.

-Robert


def galaxy_open(filename, mode="r"):
   compressor = getCompressor(filename, mode)
   if compessor != NULL:
 return openCompressed(filename, mode, compressor)
   else:
 return open(filename, mode)


def openCompressed(filename, mode):
  4bytes = read4bytes(filename)
  ext = getExtensionFromHdrSig(4bytes)
  if ext == "gz" :
 return gzip.open(filename, mode)
  else if ext == "bz2":
 return bz2.BZ2File(filename, mode)
  else if ext == "zip":
 return zipfile.ZipFile(filename, mode)

  

char *getExtensionFromHdrSig(char *first4bytes)
/* Check if header has signature of supported compression stream,
   and return a phoney filename with extension for it, or NULL if no sig found. 
*/
{
char buf[20];
char *ext=NULL;
if (startsWith("\x1f\x8b",first4bytes)) ext = "gz";
else if (startsWith("\x1f\x9d\x90",first4bytes)) ext = "Z";
else if (startsWith("BZ",first4bytes)) ext = "bz2";
else if (startsWith("PK\x03\x04",first4bytes)) ext = "zip";
if (ext==NULL)
return NULL;
}
On Jul 8, 2013, at 4:05 AM, Peter Cock wrote:

> On Thu, Jul 4, 2013 at 9:49 PM, Robert Baertsch
>  wrote:
>> Dan,
>> Do these readers support gzip files?
>> 
>>   reader = fastqVerboseErrorReader
>>reader = fastqReader
> 
> Presumably you are writing a Python script using this library?
> The answer is a qualified yes. Instead of passing them a normal
> file handle using open("example.fastq") you instead use
> gzip.open("example.fastq") via import gzip.
> 
>> Do I have to define a special type in galaxy for gzipped files or will the 
>> fastq type be ok?
>> 
> 
> This needs a special file format - but you are not the first person to
> look at this, some groups have defined custom gzipped variants of
> the FASTQ formats within their own Galaxy instances. I've not
> done this but there should be some useful emails in the archive.
> 
> Note you'd also need to modify any tool definitions to that they
> can accept a gzipped FASTQ file.
> 
>> Ideally, I would like to keep my files zipped and not have galaxy unzip 
>> them, since they triple in size when unzipped.
>> 
>> I'm happy to do a push request if you don't support this but I want to make 
>> sure I'm in line with your roadmap.
> 
> Personally I would like a more general system in Galaxy for
> potentially any file type to be held compressed in a range of
> formats (e.g. using gzip, bgzf, xy, bz2, etc), with exclusions
> for things like BAM which are already compressed. This way
> naive tools would get the gzipped file file uncompressed to a
> temporary folder before use (i.e. no change for the tool wrapper),
> but if a tool accepts a gzipped file it will get that (less disk IO
> and CPU usage, but requires updating tool wrappers).
> 
> That idea is quite ambitious through ;)
> 
>> I have written a simple tool to convert Illumina fastq to mapsplice fastq. 
>> Does that already exist already somewhere?
>> 
> 
> I don't know.
> 
> Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] gzipped fastq reader

2013-07-09 Thread James Taylor
open_compressed in bx-python does this already (for bz2 as well).

On Jul 8, 2013, at 5:58 PM, Peter Cock  wrote:

> On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch
>  wrote:
>> Peter and Dan,
>> I like the idea of replacing all open() with galaxy_open() in all tools. You
>> can tell the format by looking at the first 4 byes (see C code below from
>> the UCSC browser team). Is there some pythonic way of overriding open?
> 
> There is monkey patching (replace the current 'open' function with
> your modified version), but that is not a good idea in general.
> 
> In any case, this would only affect the small number of Python
> tools which happen to use the Galaxy parsing libraries - which
> is a very small fraction of the tools in Galaxy. Most of the tools
> in Galaxy are compiled programs and are entirely separate.
> 
> Peter
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>  http://lists.bx.psu.edu/
> 
> To search Galaxy mailing lists use the unified search at:
>  http://galaxyproject.org/search/mailinglists/

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-08 Thread Peter Cock
On Mon, Jul 8, 2013 at 11:21 PM, Robert Baertsch  wrote:
> I respectfully disagree,  If you want an extensible system, you should
> always wrap primitive system level calls.
>
> Any tools that opens a file that could be compressed would be affected.
> That is a huge number of tools. Do you really want a cottage industry of
> tools that have different methods of dealing with compression?

But defining a Python helper function within the Galaxy Python
libraries doesn't achieve that.

Are you talking about patching the OS level POSIX open functions
or something? The tools available in Galaxy are written in a range
of languages including C, Perl, R, etc. Yes, some are in Python,
but of those most are independent of Galaxy and can be used
separately from Galaxy.

> Encoding the gzip status in the datatype will create an explosion of
> datatypes. Compression is not actually a datatype, it tells you nothing
> about the content data that is stored in the file.

What we'd previously discussed was a dual system, holding
the file type as now (e.g. FASTA, SAM, GFF3, etc) and any
compression (e.g., None, normal GZIP, BGZF which is a
GZIP variant, BZIP2, etc).

Galaxy tool wrappers currently define input files with a list
of file types - they'd also have to give a list of supported
compression types (defaulting to none). Likewise for any
output files - if they are already compressed the XML for
the tool wrapper would have to tell Galaxy this.

> It is up to the galaxy team to provide a standard way to interact
> with compressed files.

That is my preference too - although this could be driven by
the Galaxy community rather than the core team? I see
defining new datatypes like 'gzippedfastq' as a stop gap
special case (but a very practical route for now).

> My proposed solution, is a very small change that could
> be phased in over time. Any tools that uses open would not support
> compressed files, but they would not break on uncompressed files.
>
> Do others have an opinion?

Either I don't understand your plan, or it would only help in
a tiny minority of cases.

Regards,

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-08 Thread Peter Cock
On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch
 wrote:
> Peter and Dan,
> I like the idea of replacing all open() with galaxy_open() in all tools. You
> can tell the format by looking at the first 4 byes (see C code below from
> the UCSC browser team). Is there some pythonic way of overriding open?

There is monkey patching (replace the current 'open' function with
your modified version), but that is not a good idea in general.

In any case, this would only affect the small number of Python
tools which happen to use the Galaxy parsing libraries - which
is a very small fraction of the tools in Galaxy. Most of the tools
in Galaxy are compiled programs and are entirely separate.

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] gzipped fastq reader

2013-07-08 Thread Peter Cock
On Thu, Jul 4, 2013 at 9:49 PM, Robert Baertsch
 wrote:
> Dan,
> Do these readers support gzip files?
>
>reader = fastqVerboseErrorReader
> reader = fastqReader

Presumably you are writing a Python script using this library?
The answer is a qualified yes. Instead of passing them a normal
file handle using open("example.fastq") you instead use
gzip.open("example.fastq") via import gzip.

> Do I have to define a special type in galaxy for gzipped files or will the 
> fastq type be ok?
>

This needs a special file format - but you are not the first person to
look at this, some groups have defined custom gzipped variants of
the FASTQ formats within their own Galaxy instances. I've not
done this but there should be some useful emails in the archive.

Note you'd also need to modify any tool definitions to that they
can accept a gzipped FASTQ file.

> Ideally, I would like to keep my files zipped and not have galaxy unzip them, 
> since they triple in size when unzipped.
>
> I'm happy to do a push request if you don't support this but I want to make 
> sure I'm in line with your roadmap.

Personally I would like a more general system in Galaxy for
potentially any file type to be held compressed in a range of
formats (e.g. using gzip, bgzf, xy, bz2, etc), with exclusions
for things like BAM which are already compressed. This way
naive tools would get the gzipped file file uncompressed to a
temporary folder before use (i.e. no change for the tool wrapper),
but if a tool accepts a gzipped file it will get that (less disk IO
and CPU usage, but requires updating tool wrappers).

That idea is quite ambitious through ;)

> I have written a simple tool to convert Illumina fastq to mapsplice fastq. 
> Does that already exist already somewhere?
>

I don't know.

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/