Hi Yibo,
If you have a contig longer than 512MB you will have problems.
Yes, you could split your contig into piece smaller than 512MB, with the
understanding that reads that straddle the breakpoint will not align
properly. I think this is just a matter of editing your reference fasta
and splitting that sequence into pieces and given each piece a different
name. You'll need to realign your reads with the new reference fasta.
-Alec
On 7/23/14, 4:08 PM, 胡义波 wrote:
Hi Alec,
Following your suggestions, I have tried the two methods but they did
not work with some errors.
One is to set CREATE_INDEX=FALSE when using MarkDuplicates. As a
result, the MarkDuplicates running is ok. Then following GATK snp
calling pipeline for RNA-Seq data, I run SplitNCigarReads but the same
error occurred again (ERROR MESSAGE: Exception when processing
alignment for BAM index HS2:410:C21E6ACXX:4:1109:16160:63599 2/2 99b
aligned read.). Because I did not create index in the MarkDuplicates
step, I created index using "samtools index” instead of using Picard
Tools and the index creation running seems ok. Then I re-run
SplitNCigarReads but the same error still occurred (see attached).
The other is to merge mapped bam file with unmapped bam file for
Tophat2 output. Firstly, I fixed up the unmapped bam file using the
script “fix_tophat_unmapped_reads.py”. Then, I used MergeBamAlignment
to merge but an error occurred (Second read from pair not found in
unmapped bam: HS2:410:C21E6ACXX:4:1101:1421:96739,
HS2:410:C21E6ACXX:4:1101:1422:64244). (see attached)
But indeed, I checked these two reads and found they were in the
unmapped bam file. The following is an example.
The read is found in the unmapped bam file:
HS2:410:C21E6ACXX:4:1101:1421:967391337530976880*=10CCTAGGATATCTTTTATGGTATAGTATGGGTGGAATGGAATTTTATCTGNGTTGGBBBFFFFFFFFFFIIFIIIFIIIIFFIIIIBFFBFFIIIIIIIIFIIII#07FFI
The read is found in the mapped bam file:
HS2:410:C21E6ACXX:4:1101:1421:96739897530976885099M*00TAAAGTGAAAAGCAAAAAATCGAGTTAGTGTAGCTTTATCAACGGAGAATCCTCCTCAAATTCATTCTACTAGTGTACTTCCGATATATGGGATGGCAGBFFFFBBFFFFFFFFFFFFBFFFFFFFFFFIIIIIFIFBIIIIIIIFIFFIFFIFFIIIFBFIIIFIFFFFIIIIFFIIFIIIIIIFFFFFFFFFFBBBAS:i:-81XN:i:0XM:i:15XO:i:0XG:i:0NM:i:15MD:Z:10T2G2G5G8G8T2T28A0A2G2A2A8A2A2T1YT:Z:UUNH:i:1
Because just 2 reads were reported to have an error, I removed these
two reads from mapped and unmapped bam files. Then I re-run
MergeBamAlignment, the same error still occurred with two new reads.
I do not know how to resolve this problem.
Indeed, I have another question. As you commented previously, "the
problem is that apparently your have a contig that is longer than
512MB. The BAM index format cannot handle a contig this size". I am
thinking that, even if I merged mapped and unmapped files together and
then rerun all of the steps, I guess that this error for creating
index still would occur because I am still using the same reference
genome with some contigs longer than 512MB. My question is: is there
any methods to divide long contigs into less than 512MB? I do not know
whether this idea is feasible.
Thanks again for your help. The attached are running log for
SplitNCigarReads and MergeBamAlignment.
Yibo
On Jul 17, 2014, at 11:34, Alec Wysoker <[email protected]
<mailto:[email protected]>> wrote:
Hi Yibo,
First of all, FYI this is a stack trace:
Exception in thread "main" net.sf.samtools.SAMException: Exception
when processing alignment for BAM index
HS2:410:C21E6ACXX:4:2316:13634:18930 1/2 97b aligned read.
at
net.sf.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:120)
at
net.sf.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:168)
at
net.sf.picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:100)
at
net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at
net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
at
net.sf.picard.sam.AddOrReplaceReadGroups.main(AddOrReplaceReadGroups.java:66)
Caused by: net.sf.samtools.SAMException: Exception creating BAM index
for record HS2:410:C21E6ACXX:4:2316:13634:18930 1/2 97b aligned read.
at net.sf.samtools.BAMIndexer.processAlignment(BAMIndexer.java:94)
at
net.sf.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:117)
... 5 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 32770
at
net.sf.samtools.BAMIndexer$BAMIndexBuilder.processAlignment(BAMIndexer.java:276)
at net.sf.samtools.BAMIndexer.processAlignment(BAMIndexer.java:92)
... 6 more
Regarding your original question, the problem is that apparently your
have a contig that is longer than 512MB. The BAM index format cannot
handle a contig this size. If you don't need a BAM index for your
process, then you can pass CREATE_INDEX=False to all the Picard
programs that write a BAM file, and this problem should go away. If
you need a BAM index, because you're going to query by genomic locus,
then you've got a problem.
Regarding using MergeBamAlignment, I believe it should be able to
handle the multiple alignments produced by Tophat. I've never tried
it with STAR, but I would suggest you try it without removing
multiple mappings, then run ValidateSamFile on the BAM produced, and
if that succeeds then you can assume it handles the multiple mappings
produced by the aligner correctly.
-Alec
On 7/17/14, 2:23 PM, 胡义波 wrote:
Hi Alec,
Thank you very much for your reply and suggestion. I am not sure
what the stack trace is, and I only had the running log reporting
the error message (see attached). Yes, for Tophat2 output, there is
an unmapped BAM file, but for STAR, there is no unmapped BAM file.
Before I try to merge them, I have a question: I am using unique
mapping results from the mapping output. So, should I remove
multiple mapping results and then merge with unmapped file, or not
remove and merge? Thanks again.
Yibo
On Jul 17, 2014, at 6:11, Alec Wysoker <[email protected]
<mailto:[email protected]>> wrote:
Hi Yibo,
It would help if you send the stack trace that should be printed
along with the error. However, the fact that you need to reduce
the validation stringency on the various PIcard tools means that
you are likely to run into various problems. I suggest you change
your workflow as follows: rather than using the output of your
aligner directly, use Picard MergeBamAlignment to combine the
aligner output with a valid unmapped BAM. This should eliminate
all the Picard validation issues. If you don't have an unmapped
BAM, you can create one with Picard FastqToSam.
-Alec
On 7/16/14, 3:18 PM, Yibo wrote:
Dear all,
I am using Picard tools but encountered some problems. I want to use GATK to
call snp for RNA-seq data. I have used two methods, Tophat2 and STAR, to
perform RNA-seq mapping. Because I do not have my own reference genome, I used
the genome of a related species as the reference (different genus). However,
for the Tophat2 and STAR mapping output (I used only unique mapping output),
when using Picardtools (SortSam, AddOrReplaceReadGroups or MarkDuplicates),
some errors occurred.
I am using Java 1.7.0_45 and Picard-tools-1.117 (or 1.77). The error messages
are as follows:
(1) For STAR output, I successfully used AddOrReplaceReadGroups to add read
groups and sort by coordinate. However, when using MarkDuplicates, an error
occurred as follows (running log attached):
"Exception in thread "main" htsjdk.samtools.SAMException: Exception when processing
alignment for BAM index HS2:410:C21E6ACXX:4:1109:16160:63599 2/2 99b aligned read."
I checked this read and it is a read pair and seems ok. I attached the
paired-end reads information as follows:
HS2:410:C21E6ACXX:4:1109:16160:63599 163 1 536940431 255
9S90M = 536940464 99
CAGAAAATGAACTGTTTGGAGAAAGATGCCTGCAGAAACCTACATAGCAGCAAGAAGATCCAGAATGAATTTTGGGGTACAATTGATTGAACTGAAGGG
BBBFFFFFFFFFFIFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFF<BBFFFFFFFFFBFFFFFFFFF
RG:Z:LANE11A NH:i:1 HI:i:1 jI:B:i,-1 jM:B:c,-1 nM:i:10 AS:i:134
HS2:410:C21E6ACXX:4:1109:16160:63599 83 1 536940464 255
66M33S = 536940431 -99
CATAGCAGCAAGAAGATCCAGAATGAATTTTGGGGTACAATTGATTGAACTGAAGGGGGCTGAACACAATTATTTTGAATGTAAACTCTTATGCCAAAG
FFFFFFFBFFFFFFFFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFIIIIIIIIIIIIIIIIIIFIFFFFFFFFFFBBB
RG:Z:LANE11A NH:i:1 HI:i:1 jI:B:i,-1 jM:B:c,-1 nM:i:10
AS:i:134
Additionally, I used ValidateSamFile tool to check my sam file, and found as
follows:
Error Type Count
ERROR:MATE_NOT_FOUND 37
WARNING:MISSING_TAG_NM 48684343
Because only 37 reads is mate_not_found, I deleted these reads from my sam
file. Then rerunning MarkDuplicates, the same error still occurred.
My scripts are as follows:
java -Xmx4G -jar /u/home/y/ybhu/picard-tools-1.117/AddOrReplaceReadGroups.jar
INPUT=${path_input}/${samp}_uniquelymapped.sam
OUTPUT=${path_input}/${samp}_rg.sam SORT_ORDER=coordinate RGID=LANE11A
RGPL=ILLUMINA RGLB=LIB1A RGPU=LANE1 RGSM=1A
java -Xmx4G -jar /u/home/y/ybhu/picard-tools-1.117/MarkDuplicates.jar
I=${path_input}/${samp}_rg.bam O=${path_input}/${samp}_dedup.bam
METRICS_FILE=${path_input}/${samp}_dedupmetrics.txt REMOVE_DUPLICATES=TRUE
CREATE_INDEX=TRUE ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=SILENT
(2) For Tophat2 output, I used different steps (first use SortSam to sort by
coordinate, then AddOrReplaceReadGroups to add read groups). However, for these
two steps, error messages also occurred (running log as attached).
For SortSam tool, 11449 reads were reported "Insert size out of range". Then I changed the argument
"VALIDATION_STRINGENCY=STRICT" to "LENIENT", the SortSam can ignore these errors and
finish sorting.
But when running AddOrReplaceReadGroups, a same error occurred (similar to STAR
running error):
"Exception in thread "main" net.sf.samtools.SAMException: Exception when processing
alignment for BAM index HS2:410:C21E6ACXX:4:2316:13634:18930 1/2 97b aligned read."
I pasted the read information here:
HS2:410:C21E6ACXX:4:2316:13634:18930 73 1 536905899 50
32M2D57M4I4M * 0 0
GTAAATTTTTGTAAATGTTCAATGAGGTGCTGAAACATGTATGTTCTTTAGCTTTCCCATTTAGAAGACAACATAAATCTTAATTTTTCCACTAATT
B<BFFFFFFFFFFFF0<BFFIIIF0BFBFFFIIIIIBFIBFFFBFFFIFFBFFFFFFFFFFIIBFFIIIFIIIIIFIFFFFFFFFBFFFBBBFFFBF
AS:i:-126 XN:i:0 XM:i:20 XO:i:2 XG:i:6 NM:i:26
MD:Z:2C1G10A3T0G2A0T0A3A0A1^AA3T9G4A1A0G2T24T1A0A5G2 YT:Z:UU NH:i:1.
My scripts are as follows:
java -Xmx3G -jar /u/local/apps/picard-tools/1.77/SortSam.jar
INPUT=${path_input}/${samp}_uniquelymapped.sam
OUTPUT=${path_input}/${samp}_unique_sorted.sam SORT_ORDER=coordinate
CREATE_INDEX=TRUE VALIDATION_STRINGENCY=LENIENT
java -Xmx3G -jar /u/local/apps/picard-tools/1.77/AddOrReplaceReadGroups.jar
INPUT=${path_input}/${samp}_unique_sorted.bam
OUTPUT=${path_input}/${samp}_rg.bam SORT_ORDER=coordinate CREATE_INDEX=TRUE
RGID=LANE11A RGPL=ILLUMINA RGLB=LIB1A RGPU=LANE1 RGSM=1A
VALIDATION_STRINGENCY=LENIENT
In summary, the errors for STAR and Tophat2 seem to be caused by creating BAM
index for these reported reads. But I do no know why and how to resolve it.
Your comments are very appreciated, and thanks in advance.
Yibo
Department of Ecology and Evolutionary Biology,
University of California, Los Angeles
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help
------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls.
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help