Hi David,
Your files appear to be of the Sanger FASTQ variant. As you have noticed, the
info blurb provided by the Grooming tool provides information that should be
utilized to confirm input types. While the 'Illumina 1.3+' FASTQ format does
encode scores using a different ASCII range, it is my understanding that the
scripts provided by the manufacturer to create FASTQ formatted files were
enhanced to write out Sanger encoded quality scores.
The correct Grooming path for your data is Sanger -- Sanger. Please let us
know if we can provide further assistance.
Thanks for using Galaxy,
Dan
On Mar 21, 2011, at 9:42 AM, David K Crossman wrote:
Hello!
I am fairly new to using Galaxy and have a question about the
FASTQ Groomer feature. I have 4 RNA-Seq raw data files that were just
recently generated from Illumina’s NGS instruments. I am aware that the
first step to perform in Galaxy is FASTQ Groomer to convert the format to
FASTQ Sanger. I presume that I would choose Illumina 1.3+ in the “Input
FASTQ quality scores type” box. However, if I look at the raw data reads, I
notice that Line 4 (which encodes the quality values for sequence in Line 2)
has values outside of the Illumina 1.3+ range (some of them fall into the
Sanger format. I am enclosing the Quality Score Comparison figure along with
some of the raw RNA-Seq data):
Quality Score Comparison
SS
...III
..
!#$%'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
| ||| |
|
3359 64 73104
126
S - Sanger Phred+33, 93 values (0, 93) (0 to 60 expected in raw reads)
I - Illumina 1.3 Phred+64, 62 values (0, 62) (0 to 40 expected in raw reads)
X - Solexa Solexa+64, 67 values (-5, 62) (-5 to 40 expected in raw
reads)
Diagram adapted from http://en.wikipedia.org/wiki/FASTQ_format
RNA-Seq raw data
@HWI-ST156_294:7:1:1058:2165:0/1
CACCAACTCACAGCCACTCCGTGAGGCCAGCAAGGCAAGAACATTCATCTC
+
FGGHHHGFHHFHHEGHCGGGEB.EE9D?DD4FFFCBB/.C=D
@HWI-ST156_294:7:1:1184:2191:0/1
CGTAAATCCATGTCTGACTTCTGGATAGCAAACACCAGCACCGCGTGGATG
+
EE;E=ECEEBE@=GBFGF/GFFCFA;:@8AEABBA#
@HWI-ST156_294:7:1:1018:2200:0/1
NCTGATTAAGGATAATGAGTAGTAGAACTAATGATGTTATTCCTTGG
+
###
@HWI-ST156_294:7:1:1225:2217:0/1
GTGACTACACAAAGCACCCTTCTAAACCAGACCATTCTGGAGAATGA
+
FFCEFFFE?FEBDC?987::,3:-9145,DA:C9;+?
As a test in FASTQ Groomer, I chose either the Sanger or
Illumina 1.3+ as the input quality scores type and these are the results I
got:
FASTQ Groomer on tn-read1 (using Sanger as input)
6.1 Gb
format: fastqsanger, database:mm9
Info: Groomed 45868679 sanger reads into sanger reads. Based upon quality and
sequence, the input data is valid for: sanger Input ASCII range: '#'(35) -
'I'(73) Input decimal range: 2 - 40
FASTQ Groomer on tn-read1 (using Illumina1.3+ as input)
6.1 Gb
format: fastqsanger, database:mm9
Info: Groomed 45868679 illumina reads into sanger reads. Based upon quality
and sequence, the input data is valid for: sanger Input ASCII range: '#'(35)
- 'I'(73) Input decimal range: -29 - 9
Which one is right (I presume the Illumina 1.3+ one, but I can’t find any
sort of explanation)? I noticed that the “input decimal range” had different
values (although they spanned the same length) in relation to which input was
chosen. What would happen downstream in TopHat if Sanger was used instead of
Illumina 1.3+ for these files? Is there any other reading
material/websites/etc… out there that might help me better understand the
quality score and which to use? Any info/help would be greatly appreciated.
Thanks,
David
David K. Crossman, Ph.D.
Systems Biologist/Analyst/Statistician
Heflin Center for Genomic Science
University of Alabama at Birmingham
720 20th Street South
Kaul Room 420
Birmingham, AL 35294-0024
(205) 996-4045
(205) 996-4056 (fax)
David K. Crossman, Ph.D.
Heflin Center for Genomic Science
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using reply all in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this