Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy

2011-04-08 Thread Mike Dufault
Sean, Anton and Jen,
 
Thanks for all of the suggestions (in separate replies) on how to better 
analyze my SelectSure captured Exome data. My original work-flow is below in 
the e-mail string.
 
Based on the suggestions, I plan to change my work-flow by increasing my 
quality filter from 20 to 25-30 and increasing my minimum coverage from 3x to 
~20x. I will use the Join function to compare the SNPs that are in common with 
the samples from two family members to filter (narrow down) what they have in 
common, since I am looking for a hereditary disease. Then i will use the 
Join function again with the SNPs from build (131) to characterize the SNPs.
 
Sean suggested realignment around indels and potentially quality score 
recalibration. Is that even possible with Galaxy at the moment?
 
Where in the flow can I perform Indel analysis? Will I need to process my data 
separately for SNPs and Indel analysis, or can they be done sequentially in the 
same linear work-flow? I am still a little unsure of the best way to hand this.
 
Please let me know if you have any more suggestions or comments before I 
re-launch the analysis later this evening. Once I get a flow that works, I hope 
to be able to publish it for everyone to benefit from.
 
Thanks to the Galaxy team for an outstanding platform and support!
 
Mike
--- On Tue, 4/5/11, Sean Davis sdav...@mail.nih.gov wrote:


From: Sean Davis sdav...@mail.nih.gov
Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
To: Mike Dufault dufau...@yahoo.com
Cc: galaxy-user galaxy-user@lists.bx.psu.edu
Date: Tuesday, April 5, 2011, 4:39 PM


Hi, Mike.  See my couple of comments below 


Sean


On Tue, Apr 5, 2011 at 2:22 PM, Mike Dufault dufau...@yahoo.com wrote:






Hi all, 
  
Like many people on this e-mail chain, I have been looking for advice on how to 
process Exome data. Below, I have described in detail what I have done with the 
hope of getting some clarification. Hopefully it will be helpful to many of us! 
  
I have SureSelect Exome captured data. The data was delivered to me as two 
separate files (/1)  (/2). Each file has ~33 million reads; 7.2 GB each. I am 
looking for SNPs from a family with cancer. Eventually I plan to compare the 
date from multiple members of the same family to find a related disease SNP. 
  
Below is the workflow that I used to process my data. I adapted it from the 
Screencast titles: Mapping Illumina Reads: Paired Ends Example. I used all of 
the same default parameters as in the screencast. 
  
At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 
14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set 
the filter as C7=1 to remove all of the 0 (zero) values for Quality SNP. I 
figured that if they have a value of zero, they must not be real SNPs. This 
left me with ~180,000 SNPs. 
  
1: Get Data: Illumina 1.3+ file (/1) 
2: Get Data: Illumina 1.3+ file (/2) 
3: FASTQ Groomer on data 1 
4: FASTQ Groomer on data 2 
5: FASTQ Summary Statistics on data 3 
6: FASTQ Summary Statistics on data 4 
7: Box plot on data 5 
8: Box plot on data 6 
9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads


This might not be the best choice, as bowtie does not allow gapped alignment.  
See here for a discussion of indels and SNV calling:


http://bioinformatics.oxfordjournals.org/content/26/6/722.long


You will probably also want to consider local realignment around indels and 
potentially quality score recalibration.  
 






10: Filter Sam on data 9 
11: SAM-to-BAM on data 10: converted to BAM 
12: Generate pileup on data 11: converted pileup 
13: Filter pileup on data 12 
14: Filter data on 13 (c7=1) 
15: Sort on data 15 (C7; descending order) 
  
First, if anyone has ideas on how to improve the workflow, I would be open to 
suggestions; especially from people experienced with Galaxy. 
  
Second, I am concerned that many/most of the SNPs are known. Should I filter my 
data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)


Keep in mind that, depending on the version of dbSNP, there are many 
cancer-associated SNPs contaminating the database.








  
Third, as suggested in the screencast, I did not trim or filter my FASTQ 
Groomed data because I was interested in SNPs and I could filter on Quality 
later in the workflow. Would implementing a filtering step on phred quality 
(~20) at this step save me the step of filtering later on. Currently it takes 
multiple hours (~16) to process the data from start to finish, would filtering 
at this step reduce the amount of time that it takes to process my data? 
Presumably, there would be less data to process. I do this on the AWS Cloud and 
time is money! 
 


Adding a gapped alignment algorithm, indel realignment, and quality 
recalibration can easily increase this time to a couple of days per sample.
 






Fifth, when using Galaxy on the AWS cloud, does adding additional cores or 

Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy

2011-04-08 Thread Anton Nekrutenko
Mike:

Realignment and recalibration is not yet possible on the main site. However, we 
are working on several re-sequencing projects in house where these tools are 
used and will bring them to Galaxy by ISMB conference in Vienna. 

The indel analysis at the moment is rather simplistic (yet still very useful) 
and is based on processing on CIGAR strings in aligned SAM files. You can 
simply run datasets generated by BWA through our indels tools.

Thanks and let us know if you have more questions.


anton
galaxy team


On Apr 8, 2011, at 7:42 AM, Mike Dufault wrote:

 Sean, Anton and Jen,
  
 Thanks for all of the suggestions (in separate replies) on how to better 
 analyze my SelectSure captured Exome data. My original work-flow is below in 
 the e-mail string.
  
 Based on the suggestions, I plan to change my work-flow by increasing my 
 quality filter from 20 to 25-30 and increasing my minimum coverage from 3x to 
 ~20x. I will use the Join function to compare the SNPs that are in common 
 with the samples from two family members to filter (narrow down) what they 
 have in common, since I am looking for a hereditary disease. Then i will use 
 the Join function again with the SNPs from build (131) to characterize the 
 SNPs.
  
 Sean suggested realignment around indels and potentially quality score 
 recalibration. Is that even possible with Galaxy at the moment?
  
 Where in the flow can I perform Indel analysis? Will I need to process my 
 data separately for SNPs and Indel analysis, or can they be done sequentially 
 in the same linear work-flow? I am still a little unsure of the best way to 
 hand this.
  
 Please let me know if you have any more suggestions or comments before I 
 re-launch the analysis later this evening. Once I get a flow that works, I 
 hope to be able to publish it for everyone to benefit from.
  
 Thanks to the Galaxy team for an outstanding platform and support!
  
 Mike
 --- On Tue, 4/5/11, Sean Davis sdav...@mail.nih.gov wrote:
 
 From: Sean Davis sdav...@mail.nih.gov
 Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
 To: Mike Dufault dufau...@yahoo.com
 Cc: galaxy-user galaxy-user@lists.bx.psu.edu
 Date: Tuesday, April 5, 2011, 4:39 PM
 
 Hi, Mike.  See my couple of comments below
 
 Sean
 
 On Tue, Apr 5, 2011 at 2:22 PM, Mike Dufault dufau...@yahoo.com wrote:
 Hi all,
  
 Like many people on this e-mail chain, I have been looking for advice on how 
 to process Exome data. Below, I have described in detail what I have done 
 with the hope of getting some clarification. Hopefully it will be helpful to 
 many of us!
  
 I have SureSelect Exome captured data. The data was delivered to me as two 
 separate files (/1)  (/2). Each file has ~33 million reads; 7.2 GB each. I 
 am looking for SNPs from a family with cancer. Eventually I plan to compare 
 the date from multiple members of the same family to find a related disease 
 SNP.
  
 Below is the workflow that I used to process my data. I adapted it from the 
 Screencast titles: Mapping Illumina Reads: Paired Ends Example. I used all 
 of the same default parameters as in the screencast.
  
 At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in 
 step 14, I filtered on column 7 (c7) which I believe is the Quality SNP 
 value. I set the filter as C7=1 to remove all of the 0 (zero) values for 
 Quality SNP. I figured that if they have a value of zero, they must not be 
 real SNPs. This left me with ~180,000 SNPs.
  
 1: Get Data: Illumina 1.3+ file (/1)
 2: Get Data: Illumina 1.3+ file (/2)
 3: FASTQ Groomer on data 1
 4: FASTQ Groomer on data 2
 5: FASTQ Summary Statistics on data 3
 6: FASTQ Summary Statistics on data 4
 7: Box plot on data 5
 8: Box plot on data 6
 9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads
 
 This might not be the best choice, as bowtie does not allow gapped alignment. 
  See here for a discussion of indels and SNV calling:
 
 http://bioinformatics.oxfordjournals.org/content/26/6/722.long
 
 You will probably also want to consider local realignment around indels and 
 potentially quality score recalibration.  
  
 10: Filter Sam on data 9
 11: SAM-to-BAM on data 10: converted to BAM
 12: Generate pileup on data 11: converted pileup
 13: Filter pileup on data 12
 14: Filter data on 13 (c7=1)
 15: Sort on data 15 (C7; descending order)
  
 First, if anyone has ideas on how to improve the workflow, I would be open to 
 suggestions; especially from people experienced with Galaxy.
  
 Second, I am concerned that many/most of the SNPs are known. Should I filter 
 my data against the known SNPdb? If so, how can I do this in Galaxy (in 
 Bowtie?)
 
 Keep in mind that, depending on the version of dbSNP, there are many 
 cancer-associated SNPs contaminating the database.
 
  
 Third, as suggested in the screencast, I did not trim or filter my FASTQ 
 Groomed data because I was interested in SNPs and I could filter on Quality 
 later in the 

Re: [galaxy-user] RNA seq analysis and GTF files

2011-04-08 Thread David K Crossman
Jeremy,

Thank you very much for this information.  One quick question.  
I added the gene_id values to the 10th column of my patched GTF file.  After 
uploading it to Galaxy, the column doesn't have a name (i.e. column 1 = 
Seqname; column 2 = Source; etc...).  Do I need to assign it a name (i.e. 
gene_name or gene_id) for it to be recognized and if so, how do you assign 
column names to GTF files?

Thanks,
David


From: Jeremy Goecks [mailto:jeremy.goe...@emory.edu]
Sent: Thursday, April 07, 2011 9:40 PM
To: David K Crossman
Cc: galaxy-user
Subject: Re: [galaxy-user] RNA seq analysis and GTF files

David,

Your analysis looks reasonable. In fact, in your isoform tracking FPKM file you 
get nearest_ref_id, so that's promising. What I think is needed is the addition 
of an attribute called gene_name to your reference file; you can use whatever 
value you want for gene name, and using the same value as gene_id probably 
makes sense.

Rerun your analysis with the further-patched GTF file, and let us know if this 
doesn't solve the problem. Also note that even using this attribute, some gene 
name/ids and some nearest_ref_id columns will not be populated in some cuffdiff 
files. See the post from Howie in this thread for an explanation from a 
Cufflinks developer: http://seqanswers.com/forums/showthread.php?t=6288

Best,
J.

On Apr 7, 2011, at 5:00 PM, David K Crossman wrote:


Jeremy,

I've shared it with you using your email address.

Thanks,
David


From: Jeremy Goecks [mailto:jeremy.goe...@emory.edu]
Sent: Thursday, April 07, 2011 3:42 PM
To: David K Crossman
Cc: galaxy-user
Subject: Re: [galaxy-user] RNA seq analysis and GTF files

David, can you please share your history with me and I'll take a look (History 
Options -- Share/Publish -- Share with User -- my email?

Thanks,
J.

On Apr 7, 2011, at 3:23 PM, David K Crossman wrote:



Hello!

I would like to ask a question related to this thread below.  I 
ran into the same issues as below and was unaware of having to swap some 
columns around in the GTF file.  So, after 'swapping the gene name from the 
complete table (name2 value, column 12) into the GFT file's gene_id value 
(which by default is the same as transcript_id), I uploaded this patched 
file (mm9) into Galaxy and ran Cufflinks, CuffCompare and CuffDiff using this 
patched GTF file as the reference annotation.  For both Cufflinks and 
CuffCompare, the gene_id was present in their respective columns.  The problem 
I have encountered now is that in all of the output files in CuffDiff, the 
gene_id column is blank (contains a -; highlighted in yellow below).  This 
example is from the CuffDiff gene expression output file:

test_id

gene

locus

sample_1

sample_2

status

value_1

value_2

ln(fold_change)

test_stat

p_value

significant

XLOC_01

-

chr1:4797973-4836816

q1

q2

OK

73.1908

82.1567

0.115559

-0.71896

0.472168

no

XLOC_02

-

chr1:4847774-4887990

q1

q2

OK

81.7264

53.1165

-0.43089

2.44474

0.014496

no

XLOC_03

-

chr1:5073253-5152630

q1

q2

OK

408.289

333.749

-0.20159

2.73173

0.0063

no

XLOC_04

-

chr1:5578573-5596214

q1

q2

NOTEST

2.34764

4.79772

0.71473

-0.89735

0.369532

no


What am I doing wrong?  I am interested in the differentially 
expressed genes in this RNA-Seq dataset (as well as calling variants, which is 
my next step, but want to get this answered first before moving on).  Any info, 
suggestions or help would be greatly appreciated.

Thanks,
David


-Original Message-
From: 
galaxy-user-boun...@lists.bx.psu.edumailto:galaxy-user-boun...@lists.bx.psu.edu
 [mailto:galaxy-user-boun...@lists.bx.psu.edu] On Behalf Of Jeremy Goecks
Sent: Friday, April 01, 2011 8:47 AM
To: ssa...@ccib.mgh.harvard.edumailto:ssa...@ccib.mgh.harvard.edu
Cc: galaxy-user
Subject: Re: [galaxy-user] RNA seq analysis and GTF files



On Mar 31, 2011, at 12:30 PM, 
ssa...@ccib.mgh.harvard.edumailto:ssa...@ccib.mgh.harvard.edu 
ssa...@ccib.mgh.harvard.edumailto:ssa...@ccib.mgh.harvard.edu wrote:

 Hi Jeremy,
 I used your exercise to perform an RNA-seq analysis. First I encountered a 
 problem where the gene IDs were missing from the results. Jen from the Galaxy 
 team suggested this:

 Yes, the team has taken a look and there are a few things going on.

 The first is that when running the Cuffcompare program, a reference 
 annotation file in GTF format should be used in order to obtain the same 
 results as in Jeremy's exercise. This seemed to be missing from your runs, 
 which resulted in badly formatted output that later resulted in a poor result 
 when Cuffdiff was used.

 The second has to do with the reference GTF file itself. For the best 
 results, the GTF file must have the gene_id attribute defined in the 9th 
 column of the file and the chromosome names must be in the same format as the 
 genome native to Galaxy. Depending on the source of the reference GTF, one of 
 these