Hi Alec,
Following your suggestions, I have tried the two methods but they did not work with some errors.
One is to set CREATE_INDEX=FALSE when using MarkDuplicates. As a result, the MarkDuplicates running is ok. Then following GATK snp calling pipeline for RNA-Seq data, I run SplitNCigarReads but the same error occurred again (ERROR MESSAGE: Exception when processing alignment for BAM index HS2:410:C21E6ACXX:4:1109:16160:63599 2/2 99b aligned read.). Because I did not create index in the MarkDuplicates step, I created index using "samtools index” instead of using Picard Tools and the index creation running seems ok. Then I re-run SplitNCigarReads but the same error still occurred (see attached).
The other is to merge mapped bam file with unmapped bam file for Tophat2 output. Firstly, I fixed up the unmapped bam file using the script “fix_tophat_unmapped_reads.py”. Then, I used MergeBamAlignment to merge but an error occurred (Second read from pair not found in unmapped bam: HS2:410:C21E6ACXX:4:1101:1421:96739, HS2:410:C21E6ACXX:4:1101:1422:64244). (see attached)
But indeed, I checked these two reads and found they were in the unmapped bam file. The following is an example.
The read is found in the unmapped bam file: HS2:410:C21E6ACXX:4:1101:1421:96739 133 7 53097688 0 * = 1 0 CCTAGGATATCTTTTATGGTATAGTATGGGTGGAATGGAATTTTATCTGNGTTGG BBBFFFFFFFFFFIIFIIIFIIIIFFIIIIBFFBFFIIIIIIIIFIIII#07FFI
The read is found in the mapped bam file: HS2:410:C21E6ACXX:4:1101:1421:96739 89 7 53097688 50 99M * 0 0 TAAAGTGAAAAGCAAAAAATCGAGTTAGTGTAGCTTTATCAACGGAGAATCCTCCTCAAATTCATTCTACTAGTGTACTTCCGATATATGGGATGGCAG BFFFFBBFFFFFFFFFFFFBFFFFFFFFFFIIIIIFIFBIIIIIIIFIFFIFFIFFIIIFBFIIIFIFFFFIIIIFFIIFIIIIIIFFFFFFFFFFBBB AS:i:-81 XN:i:0 XM:i:15 XO:i:0 XG:i:0 NM:i:15 MD:Z:10T2G2G5G8G8T2T28A0A2G2A2A8A2A2T1 YT:Z:UU NH:i:1
Because just 2 reads were reported to have an error, I removed these two reads from mapped and unmapped bam files. Then I re-run MergeBamAlignment, the same error still occurred with two new reads.
I do not know how to resolve this problem.
Indeed, I have another question. As you commented previously, "the problem is that apparently your have a contig that is longer than 512MB. The BAM index format cannot handle a contig this size". I am thinking that, even if I merged mapped and unmapped files together and then rerun all of the steps, I guess that this error for creating index still would occur because I am still using the same reference genome with some contigs longer than 512MB. My question is: is there any methods to divide long contigs into less than 512MB? I do not know whether this idea is feasible.
Thanks again for your help. The attached are running log for SplitNCigarReads and MergeBamAlignment.
Yibo
|
running_log_for_MergeBamAlignment
Description: Binary data
INFO 13:58:57,983 HelpFormatter -
--------------------------------------------------------------------------------
INFO 13:58:57,987 HelpFormatter - The Genome Analysis Toolkit (GATK)
v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21
INFO 13:58:57,988 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 13:58:57,988 HelpFormatter - For support and documentation go to
http://www.broadinstitute.org/gatk
INFO 13:58:57,996 HelpFormatter - Program Args: -T SplitNCigarReads -R
/work2/ybhu/reference/Mondom.74sorted.fa -I /work2/ybhu/Picard/1A_rg_dedup.bam
-o /work2/ybhu/Picard/1A_split.bam -U ALLOW_N_CIGAR_READS -rf
ReassignOneMappingQuality -RMQF 255 -RMQT 60
INFO 13:58:58,018 HelpFormatter - Executing as [email protected] on
Linux 2.6.32-431.17.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM
1.7.0_55-mockbuild_2014_04_16_12_11-b00.
INFO 13:58:58,019 HelpFormatter - Date/Time: 2014/07/17 13:58:57
INFO 13:58:58,019 HelpFormatter -
--------------------------------------------------------------------------------
INFO 13:58:58,020 HelpFormatter -
--------------------------------------------------------------------------------
INFO 13:58:58,287 GenomeAnalysisEngine - Strictness is SILENT
INFO 13:58:58,430 GenomeAnalysisEngine - Downsampling Settings: No
downsampling
INFO 13:58:58,445 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 13:58:58,479 SAMDataSource$SAMReaders - Done initializing BAM readers:
total time 0.03
INFO 13:58:58,873 GenomeAnalysisEngine - Preparing for traversal over 1 BAM
files
INFO 13:58:58,882 GenomeAnalysisEngine - Done preparing for traversal
INFO 13:58:58,883 ProgressMeter - [INITIALIZATION COMPLETE; STARTING
PROCESSING]
INFO 13:58:58,884 ProgressMeter - Location processed.reads runtime
per.1M.reads completed total.runtime remaining
INFO 13:58:58,918 ReadShardBalancer$1 - Loading BAM index data
INFO 13:58:58,925 ReadShardBalancer$1 - Done loading BAM index data
INFO 13:59:28,890 ProgressMeter - 1:24313241 2.00e+05 30.0 s
2.5 m 0.7% 74.1 m 73.6 m
INFO 14:00:02,727 ProgressMeter - 1:52082807 5.00e+05 63.0 s
2.1 m 1.4% 72.7 m 71.6 m
INFO 14:00:32,730 ProgressMeter - 1:75456414 8.00e+05 93.0 s
117.0 s 2.1% 74.1 m 72.5 m
INFO 14:01:02,733 ProgressMeter - 1:100920730 1.00e+06 2.1 m
2.1 m 2.8% 73.2 m 71.2 m
................................
INFO 15:05:30,632 ProgressMeter - 8:299312639 3.46e+07 66.5 m
115.0 s 94.6% 70.3 m 3.8 m
INFO 15:06:00,889 ProgressMeter - Un:17956273 3.49e+07 67.0 m
115.0 s 95.4% 70.2 m 3.2 m
INFO 15:06:30,892 ProgressMeter - Un:28164052 3.51e+07 67.5 m
115.0 s 95.7% 70.6 m 3.0 m
INFO 15:07:05,821 ProgressMeter - Un:45402843 3.54e+07 68.1 m
115.0 s 96.2% 70.8 m 2.7 m
INFO 15:07:35,824 ProgressMeter - Un:67626491 3.58e+07 68.6 m
114.0 s 96.8% 70.9 m 2.3 m
INFO 15:08:05,826 ProgressMeter - X:12145491 3.60e+07 69.1 m
115.0 s 98.1% 70.4 m 78.0 s
INFO 15:08:37,816 ProgressMeter - X:69964759 3.63e+07 69.6 m
114.0 s 99.7% 69.8 m 10.0 s
INFO 15:09:07,819 ProgressMeter - X:79276821 3.65e+07 70.1 m
115.0 s 100.0% 70.1 m 0.0 s
INFO 15:09:37,822 ProgressMeter - X:79276821 3.65e+07 70.6 m
115.0 s 100.0% 70.6 m 0.0 s
INFO 15:10:07,824 ProgressMeter - X:79276821 3.65e+07 71.1 m
116.0 s 100.0% 71.1 m 0.0 s
INFO 15:10:37,827 ProgressMeter - X:79276821 3.65e+07 71.6 m
117.0 s 100.0% 71.6 m 0.0 s
INFO 15:11:07,830 ProgressMeter - X:79276821 3.65e+07 72.1 m
118.0 s 100.0% 72.1 m 0.0 s
INFO 15:11:26,032 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR
------------------------------------------------------------------------------------------
##### ERROR A BAM ERROR has occurred (version 3.1-1-g07a4bf8):
##### ERROR
##### ERROR This means that there is something wrong with the BAM file(s) you
provided.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers
to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum until you have
followed these instructions:
##### ERROR - Make sure that your BAM file is well-formed by running Picard's
validator on it
##### ERROR (see
http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for
details)
##### ERROR - Ensure that your BAM index is not corrupted: delete the current
one and regenerate it with 'samtools index'
##### ERROR
##### ERROR MESSAGE: Exception when processing alignment for BAM index
HS2:410:C21E6ACXX:4:1109:16160:63599 2/2 99b aligned read.
##### ERROR
------------------------------------------------------------------------------------------
Hi Yibo,
First of all, FYI this is a stack trace:
Exception in thread "main" net.sf.samtools.SAMException:
Exception when processing alignment for BAM index
HS2:410:C21E6ACXX:4:2316:13634:18930 1/2 97b aligned read.
at
net.sf.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:120)
at
net.sf.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:168)
at
net.sf.picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:100)
at
net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at
net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
at
net.sf.picard.sam.AddOrReplaceReadGroups.main(AddOrReplaceReadGroups.java:66)
Caused by: net.sf.samtools.SAMException: Exception creating
BAM index for record HS2:410:C21E6ACXX:4:2316:13634:18930 1/2 97b
aligned read.
at
net.sf.samtools.BAMIndexer.processAlignment(BAMIndexer.java:94)
at
net.sf.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:117)
... 5 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 32770
at
net.sf.samtools.BAMIndexer$BAMIndexBuilder.processAlignment(BAMIndexer.java:276)
at
net.sf.samtools.BAMIndexer.processAlignment(BAMIndexer.java:92)
... 6 more
Regarding your original question, the problem is that apparently
your have a contig that is longer than 512MB. The BAM index format
cannot handle a contig this size. If you don't need a BAM index for
your process, then you can pass CREATE_INDEX=False to all the Picard
programs that write a BAM file, and this problem should go away. If
you need a BAM index, because you're going to query by genomic
locus, then you've got a problem.
Regarding using MergeBamAlignment, I believe it should be able to
handle the multiple alignments produced by Tophat. I've never tried
it with STAR, but I would suggest you try it without removing
multiple mappings, then run ValidateSamFile on the BAM produced, and
if that succeeds then you can assume it handles the multiple
mappings produced by the aligner correctly.
-Alec
On 7/17/14, 2:23 PM, 胡义波 wrote:
Hi Alec,
Thank you very much for your reply and suggestion. I am not
sure what the stack trace is, and I only had the running log
reporting the error message (see attached). Yes, for Tophat2
output, there is an unmapped BAM file, but for STAR, there is no
unmapped BAM file. Before I try to merge them, I have a
question: I am using unique mapping results from the mapping
output. So, should I remove multiple mapping results and then
merge with unmapped file, or not remove and merge? Thanks again.
Yibo
Hi Yibo,
It would help if you send the stack trace that should be
printed along with the error. However, the fact that you
need to reduce the validation stringency on the various
PIcard tools means that you are likely to run into various
problems. I suggest you change your workflow as follows:
rather than using the output of your aligner directly, use
Picard MergeBamAlignment to combine the aligner output with
a valid unmapped BAM. This should eliminate all the Picard
validation issues. If you don't have an unmapped BAM, you
can create one with Picard FastqToSam.
-Alec
On 7/16/14, 3:18 PM, Yibo
wrote:
Dear all,
I am using Picard tools but encountered some problems. I want to use GATK to call snp for RNA-seq data. I have used two methods, Tophat2 and STAR, to perform RNA-seq mapping. Because I do not have my own reference genome, I used the genome of a related species as the reference (different genus). However, for the Tophat2 and STAR mapping output (I used only unique mapping output), when using Picardtools (SortSam, AddOrReplaceReadGroups or MarkDuplicates), some errors occurred.
I am using Java 1.7.0_45 and Picard-tools-1.117 (or 1.77). The error messages are as follows:
(1) For STAR output, I successfully used AddOrReplaceReadGroups to add read groups and sort by coordinate. However, when using MarkDuplicates, an error occurred as follows (running log attached):
"Exception in thread "main" htsjdk.samtools.SAMException: Exception when processing alignment for BAM index HS2:410:C21E6ACXX:4:1109:16160:63599 2/2 99b aligned read."
I checked this read and it is a read pair and seems ok. I attached the paired-end reads information as follows:
HS2:410:C21E6ACXX:4:1109:16160:63599 163 1 536940431 255 9S90M = 536940464 99 CAGAAAATGAACTGTTTGGAGAAAGATGCCTGCAGAAACCTACATAGCAGCAAGAAGATCCAGAATGAATTTTGGGGTACAATTGATTGAACTGAAGGG BBBFFFFFFFFFFIFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFF<BBFFFFFFFFFBFFFFFFFFF RG:Z:LANE11A NH:i:1 HI:i:1 jI:B:i,-1 jM:B:c,-1 nM:i:10 AS:i:134
HS2:410:C21E6ACXX:4:1109:16160:63599 83 1 536940464 255 66M33S = 536940431 -99 CATAGCAGCAAGAAGATCCAGAATGAATTTTGGGGTACAATTGATTGAACTGAAGGGGGCTGAACACAATTATTTTGAATGTAAACTCTTATGCCAAAG FFFFFFFBFFFFFFFFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFIIIIIIIIIIIIIIIIIIFIFFFFFFFFFFBBB RG:Z:LANE11A NH:i:1 HI:i:1 jI:B:i,-1 jM:B:c,-1 nM:i:10 AS:i:134
Additionally, I used ValidateSamFile tool to check my sam file, and found as follows:
Error Type Count
ERROR:MATE_NOT_FOUND 37
WARNING:MISSING_TAG_NM 48684343
Because only 37 reads is mate_not_found, I deleted these reads from my sam file. Then rerunning MarkDuplicates, the same error still occurred.
My scripts are as follows:
java -Xmx4G -jar /u/home/y/ybhu/picard-tools-1.117/AddOrReplaceReadGroups.jar INPUT=${path_input}/${samp}_uniquelymapped.sam OUTPUT=${path_input}/${samp}_rg.sam SORT_ORDER=coordinate RGID=LANE11A RGPL=ILLUMINA RGLB=LIB1A RGPU=LANE1 RGSM=1A
java -Xmx4G -jar /u/home/y/ybhu/picard-tools-1.117/MarkDuplicates.jar I=${path_input}/${samp}_rg.bam O=${path_input}/${samp}_dedup.bam METRICS_FILE=${path_input}/${samp}_dedupmetrics.txt REMOVE_DUPLICATES=TRUE CREATE_INDEX=TRUE ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=SILENT
(2) For Tophat2 output, I used different steps (first use SortSam to sort by coordinate, then AddOrReplaceReadGroups to add read groups). However, for these two steps, error messages also occurred (running log as attached).
For SortSam tool, 11449 reads were reported "Insert size out of range". Then I changed the argument "VALIDATION_STRINGENCY=STRICT" to "LENIENT", the SortSam can ignore these errors and finish sorting.
But when running AddOrReplaceReadGroups, a same error occurred (similar to STAR running error):
"Exception in thread "main" net.sf.samtools.SAMException: Exception when processing alignment for BAM index HS2:410:C21E6ACXX:4:2316:13634:18930 1/2 97b aligned read."
I pasted the read information here:
HS2:410:C21E6ACXX:4:2316:13634:18930 73 1 536905899 50 32M2D57M4I4M * 0 0 GTAAATTTTTGTAAATGTTCAATGAGGTGCTGAAACATGTATGTTCTTTAGCTTTCCCATTTAGAAGACAACATAAATCTTAATTTTTCCACTAATT B<BFFFFFFFFFFFF0<BFFIIIF0BFBFFFIIIIIBFIBFFFBFFFIFFBFFFFFFFFFFIIBFFIIIFIIIIIFIFFFFFFFFBFFFBBBFFFBF AS:i:-126 XN:i:0 XM:i:20 XO:i:2 XG:i:6 NM:i:26 MD:Z:2C1G10A3T0G2A0T0A3A0A1^AA3T9G4A1A0G2T24T1A0A5G2 YT:Z:UU NH:i:1.
My scripts are as follows:
java -Xmx3G -jar /u/local/apps/picard-tools/1.77/SortSam.jar INPUT=${path_input}/${samp}_uniquelymapped.sam OUTPUT=${path_input}/${samp}_unique_sorted.sam SORT_ORDER=coordinate CREATE_INDEX=TRUE VALIDATION_STRINGENCY=LENIENT
java -Xmx3G -jar /u/local/apps/picard-tools/1.77/AddOrReplaceReadGroups.jar INPUT=${path_input}/${samp}_unique_sorted.bam OUTPUT=${path_input}/${samp}_rg.bam SORT_ORDER=coordinate CREATE_INDEX=TRUE RGID=LANE11A RGPL=ILLUMINA RGLB=LIB1A RGPU=LANE1 RGSM=1A VALIDATION_STRINGENCY=LENIENT
In summary, the errors for STAR and Tophat2 seem to be caused by creating BAM index for these reported reads. But I do no know why and how to resolve it.
Your comments are very appreciated, and thanks in advance.
Yibo
Department of Ecology and Evolutionary Biology,
University of California, Los Angeles
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help
|
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help