Re: [galaxy-user] Patch for better FASTQ description handling

Eric Cabot Thu, 20 Oct 2011 06:17:11 -0700

Florent Angly wrote:

Peter and Daniel, thanks for the comments.
On 19/10/11 23:49, Peter Cock wrote:
On Wed, Oct 19, 2011 at 2:31 PM, Daniel Blankenberg<d...@bx.psu.edu>wrote:
Hi Florent,
Sorry for the delay. I did try the patch out shortly after youcontributedit, but it caused the functional to fail. I was able to fix theissue andallow the existing tests to start passing, but I've been bogged downlatelyand haven't been able to perform a more thorough review of the code.If youcould provide tests with files (e.g. for the tools affected) thattest the
new functionality, that would be a great help.
I'll have a look at that.
The use of partition removes python compatibility for<2.5, althoughthis is
a lesser/non-concern.
I guess you could use split, but special case on there being no space.
Also, I'm not entirely sold on having the "Identifier line" beingparsed as"identifier" +<space> + "description" instead a single identifierline.
That is the normal convention, just like with FASTA.
http://dx.doi.org/10.1093/nar/gkp1137
The Bioperl and Biopython projects use this convention for FASTA andFASTQ files.
This would mean that identifiers could not themselves contain spaces,
but "There is no standardization for identifiers" (so they couldtechnicallyhave spaces?). Could two different reads be identified as "Read A"and "ReadB", but then would no longer be uniquely identifiable as each wouldthen be
identified as "Read".  If this added functionalilty were introduced as
optional behavior (e.g. a user needs to click a checkbox on the tools to
apply the id line splitting), these concerns can be mitigated.
That is expected, "@Read A" and "@Read B" have the same identifier,"Read".
Peter, Florent, anyone else: I'd be very interested to hear yourthoughts onthe above, particularly in respect to know real-world data. For now,lets
discount SRA data from this discussion.
See also the new Illumina 1.8 naming convention where they dropped
the /1 and /2 and hit it in the description. It should be tested, butI thinkFlorent's patch will work here (while the current Galaxy behaviourwon't).
Peter
I was not aware of this new naming. It seems like a terrible decisionfrom Illumina because now both reads in a pair technically have the sameID (but a different description).

This is not quite the case. Here are two fastq header lines for a pair ofreads produced by Illumina's CASAVA 1.8:


@XYZZY:123:D0ABCDEFG:7:1101:1445:2057 1:N:0:CTTGTA
@XYZZY:123:D0ABCDEFG:7:1101:1445:2057 2:N:0:CTTGTA

The two key things to note, relevant to this discussion are:

1. A space character is used to split the fields into two groups.

This is actually a good thing, because that particular character can NEVERappear in either a sequence or a quality line. This make it easy to detectname lines as those beginning with "@" (a valid quality character) andalso having a space. If you are writing a parser for the new Illuminafastq format, please don't break the names on spaces!

2. Appart from the read number, encoded as the digit immediately followingthe space, the two lines are identical--as they were with earlier CASAVAversions. Why is this worse than two lines differing by "/1" vs. "/2"?

An additional improvement with the new naming convention is that flowcelland run ID's, as well as a flag for not passing filters (where N meansdoes PF), are now included.




Eric L. Cabot
Biotechnology Center
University of Wisconsin-Madison




___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

Re: [galaxy-user] Patch for better FASTQ description handling

Reply via email to