Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
Sean, Anton and Jen, Thanks for all of the suggestions (in separate replies) on how to better analyze my SelectSure captured Exome data. My original work-flow is below in the e-mail string. Based on the suggestions, I plan to change my work-flow by increasing my quality filter from 20 to 25-30 and increasing my minimum coverage from 3x to ~20x. I will use the Join function to compare the SNPs that are in common with the samples from two family members to filter (narrow down) what they have in common, since I am looking for a hereditary disease. Then i will use the Join function again with the SNPs from build (131) to characterize the SNPs. Sean suggested realignment around indels and potentially quality score recalibration. Is that even possible with Galaxy at the moment? Where in the flow can I perform Indel analysis? Will I need to process my data separately for SNPs and Indel analysis, or can they be done sequentially in the same linear work-flow? I am still a little unsure of the best way to hand this. Please let me know if you have any more suggestions or comments before I re-launch the analysis later this evening. Once I get a flow that works, I hope to be able to publish it for everyone to benefit from. Thanks to the Galaxy team for an outstanding platform and support! Mike --- On Tue, 4/5/11, Sean Davis sdav...@mail.nih.gov wrote: From: Sean Davis sdav...@mail.nih.gov Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy To: Mike Dufault dufau...@yahoo.com Cc: galaxy-user galaxy-user@lists.bx.psu.edu Date: Tuesday, April 5, 2011, 4:39 PM Hi, Mike. See my couple of comments below Sean On Tue, Apr 5, 2011 at 2:22 PM, Mike Dufault dufau...@yahoo.com wrote: Hi all, Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us! I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP. Below is the workflow that I used to process my data. I adapted it from the Screencast titles: Mapping Illumina Reads: Paired Ends Example. I used all of the same default parameters as in the screencast. At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs. 1: Get Data: Illumina 1.3+ file (/1) 2: Get Data: Illumina 1.3+ file (/2) 3: FASTQ Groomer on data 1 4: FASTQ Groomer on data 2 5: FASTQ Summary Statistics on data 3 6: FASTQ Summary Statistics on data 4 7: Box plot on data 5 8: Box plot on data 6 9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads This might not be the best choice, as bowtie does not allow gapped alignment. See here for a discussion of indels and SNV calling: http://bioinformatics.oxfordjournals.org/content/26/6/722.long You will probably also want to consider local realignment around indels and potentially quality score recalibration. 10: Filter Sam on data 9 11: SAM-to-BAM on data 10: converted to BAM 12: Generate pileup on data 11: converted pileup 13: Filter pileup on data 12 14: Filter data on 13 (c7=1) 15: Sort on data 15 (C7; descending order) First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy. Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?) Keep in mind that, depending on the version of dbSNP, there are many cancer-associated SNPs contaminating the database. Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money! Adding a gapped alignment algorithm, indel realignment, and quality recalibration can easily increase this time to a couple of days per sample. Fifth, when using Galaxy on the AWS cloud, does adding additional cores or
Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
Mike: Realignment and recalibration is not yet possible on the main site. However, we are working on several re-sequencing projects in house where these tools are used and will bring them to Galaxy by ISMB conference in Vienna. The indel analysis at the moment is rather simplistic (yet still very useful) and is based on processing on CIGAR strings in aligned SAM files. You can simply run datasets generated by BWA through our indels tools. Thanks and let us know if you have more questions. anton galaxy team On Apr 8, 2011, at 7:42 AM, Mike Dufault wrote: Sean, Anton and Jen, Thanks for all of the suggestions (in separate replies) on how to better analyze my SelectSure captured Exome data. My original work-flow is below in the e-mail string. Based on the suggestions, I plan to change my work-flow by increasing my quality filter from 20 to 25-30 and increasing my minimum coverage from 3x to ~20x. I will use the Join function to compare the SNPs that are in common with the samples from two family members to filter (narrow down) what they have in common, since I am looking for a hereditary disease. Then i will use the Join function again with the SNPs from build (131) to characterize the SNPs. Sean suggested realignment around indels and potentially quality score recalibration. Is that even possible with Galaxy at the moment? Where in the flow can I perform Indel analysis? Will I need to process my data separately for SNPs and Indel analysis, or can they be done sequentially in the same linear work-flow? I am still a little unsure of the best way to hand this. Please let me know if you have any more suggestions or comments before I re-launch the analysis later this evening. Once I get a flow that works, I hope to be able to publish it for everyone to benefit from. Thanks to the Galaxy team for an outstanding platform and support! Mike --- On Tue, 4/5/11, Sean Davis sdav...@mail.nih.gov wrote: From: Sean Davis sdav...@mail.nih.gov Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy To: Mike Dufault dufau...@yahoo.com Cc: galaxy-user galaxy-user@lists.bx.psu.edu Date: Tuesday, April 5, 2011, 4:39 PM Hi, Mike. See my couple of comments below Sean On Tue, Apr 5, 2011 at 2:22 PM, Mike Dufault dufau...@yahoo.com wrote: Hi all, Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us! I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP. Below is the workflow that I used to process my data. I adapted it from the Screencast titles: Mapping Illumina Reads: Paired Ends Example. I used all of the same default parameters as in the screencast. At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs. 1: Get Data: Illumina 1.3+ file (/1) 2: Get Data: Illumina 1.3+ file (/2) 3: FASTQ Groomer on data 1 4: FASTQ Groomer on data 2 5: FASTQ Summary Statistics on data 3 6: FASTQ Summary Statistics on data 4 7: Box plot on data 5 8: Box plot on data 6 9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads This might not be the best choice, as bowtie does not allow gapped alignment. See here for a discussion of indels and SNV calling: http://bioinformatics.oxfordjournals.org/content/26/6/722.long You will probably also want to consider local realignment around indels and potentially quality score recalibration. 10: Filter Sam on data 9 11: SAM-to-BAM on data 10: converted to BAM 12: Generate pileup on data 11: converted pileup 13: Filter pileup on data 12 14: Filter data on 13 (c7=1) 15: Sort on data 15 (C7; descending order) First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy. Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?) Keep in mind that, depending on the version of dbSNP, there are many cancer-associated SNPs contaminating the database. Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the
Re: [galaxy-user] RNA seq analysis and GTF files
Jeremy, Thank you very much for this information. One quick question. I added the gene_id values to the 10th column of my patched GTF file. After uploading it to Galaxy, the column doesn't have a name (i.e. column 1 = Seqname; column 2 = Source; etc...). Do I need to assign it a name (i.e. gene_name or gene_id) for it to be recognized and if so, how do you assign column names to GTF files? Thanks, David From: Jeremy Goecks [mailto:jeremy.goe...@emory.edu] Sent: Thursday, April 07, 2011 9:40 PM To: David K Crossman Cc: galaxy-user Subject: Re: [galaxy-user] RNA seq analysis and GTF files David, Your analysis looks reasonable. In fact, in your isoform tracking FPKM file you get nearest_ref_id, so that's promising. What I think is needed is the addition of an attribute called gene_name to your reference file; you can use whatever value you want for gene name, and using the same value as gene_id probably makes sense. Rerun your analysis with the further-patched GTF file, and let us know if this doesn't solve the problem. Also note that even using this attribute, some gene name/ids and some nearest_ref_id columns will not be populated in some cuffdiff files. See the post from Howie in this thread for an explanation from a Cufflinks developer: http://seqanswers.com/forums/showthread.php?t=6288 Best, J. On Apr 7, 2011, at 5:00 PM, David K Crossman wrote: Jeremy, I've shared it with you using your email address. Thanks, David From: Jeremy Goecks [mailto:jeremy.goe...@emory.edu] Sent: Thursday, April 07, 2011 3:42 PM To: David K Crossman Cc: galaxy-user Subject: Re: [galaxy-user] RNA seq analysis and GTF files David, can you please share your history with me and I'll take a look (History Options -- Share/Publish -- Share with User -- my email? Thanks, J. On Apr 7, 2011, at 3:23 PM, David K Crossman wrote: Hello! I would like to ask a question related to this thread below. I ran into the same issues as below and was unaware of having to swap some columns around in the GTF file. So, after 'swapping the gene name from the complete table (name2 value, column 12) into the GFT file's gene_id value (which by default is the same as transcript_id), I uploaded this patched file (mm9) into Galaxy and ran Cufflinks, CuffCompare and CuffDiff using this patched GTF file as the reference annotation. For both Cufflinks and CuffCompare, the gene_id was present in their respective columns. The problem I have encountered now is that in all of the output files in CuffDiff, the gene_id column is blank (contains a -; highlighted in yellow below). This example is from the CuffDiff gene expression output file: test_id gene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value significant XLOC_01 - chr1:4797973-4836816 q1 q2 OK 73.1908 82.1567 0.115559 -0.71896 0.472168 no XLOC_02 - chr1:4847774-4887990 q1 q2 OK 81.7264 53.1165 -0.43089 2.44474 0.014496 no XLOC_03 - chr1:5073253-5152630 q1 q2 OK 408.289 333.749 -0.20159 2.73173 0.0063 no XLOC_04 - chr1:5578573-5596214 q1 q2 NOTEST 2.34764 4.79772 0.71473 -0.89735 0.369532 no What am I doing wrong? I am interested in the differentially expressed genes in this RNA-Seq dataset (as well as calling variants, which is my next step, but want to get this answered first before moving on). Any info, suggestions or help would be greatly appreciated. Thanks, David -Original Message- From: galaxy-user-boun...@lists.bx.psu.edumailto:galaxy-user-boun...@lists.bx.psu.edu [mailto:galaxy-user-boun...@lists.bx.psu.edu] On Behalf Of Jeremy Goecks Sent: Friday, April 01, 2011 8:47 AM To: ssa...@ccib.mgh.harvard.edumailto:ssa...@ccib.mgh.harvard.edu Cc: galaxy-user Subject: Re: [galaxy-user] RNA seq analysis and GTF files On Mar 31, 2011, at 12:30 PM, ssa...@ccib.mgh.harvard.edumailto:ssa...@ccib.mgh.harvard.edu ssa...@ccib.mgh.harvard.edumailto:ssa...@ccib.mgh.harvard.edu wrote: Hi Jeremy, I used your exercise to perform an RNA-seq analysis. First I encountered a problem where the gene IDs were missing from the results. Jen from the Galaxy team suggested this: Yes, the team has taken a look and there are a few things going on. The first is that when running the Cuffcompare program, a reference annotation file in GTF format should be used in order to obtain the same results as in Jeremy's exercise. This seemed to be missing from your runs, which resulted in badly formatted output that later resulted in a poor result when Cuffdiff was used. The second has to do with the reference GTF file itself. For the best results, the GTF file must have the gene_id attribute defined in the 9th column of the file and the chromosome names must be in the same format as the genome native to Galaxy. Depending on the source of the reference GTF, one of these