Hi James,
thank you for a detailed analysis.
James Bonfield wrote:
> On Thu, Jul 28, 2016 at 10:53:53AM +0200, Martin MOKREJ? wrote:
>> $ samtools view dedup.realigned.calmd.bam | grep
>> HWI-xxxxx:xxx:xxxxxxxxx:1:1106:17285:41358
>> HWI-xxxxx:xxx:xxxxxxxxx:1:1106:17285:41358 145 1 9932 0
>> 74M 20 59443385 0
>> GGGGGGTGGGGGTCAGAGCACCACAGGGGGGGGGGGAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAACC
>>
>> ########################################################)))))))))))))))*)*
>>
>> BD:Z:MMMMSRQPPOOOONNNMMMMMMMMMMMMNNNNNNMMMMMMLLLLKKKKKKKKKKKKJJJJKKKKKKKMKFNNMM
>> PG:Z:MarkDuplicates RG:Z:some-sample1
>> BI:Z:FFFFCA@@AAAABCDEFGHIJJKKLLLLLLLLLMMMMMNNNNMMMMMMMLLLLLLLKKKKKKKKKKKLLHHGFF
>> MQ:i:96 PQ:i:187 UQ:i:38 XQ:i:0
>> ZQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@TUTRVSSUTWVUHB@H@Q
>> NM:i:70
>> MD:Z:0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0T4
>
> This looks correct to me.
>
>> My questions are:
>>
>> Is the message 'different NM for read' because I picked up a wrong
>> reference file than the one used initially by bwa aligner?
>
> Either that or a bug in the aligner itself. I've seen some very odd
> calculations of MD/NM from some aligners, eg ones that treat N as A
> for hash purposes, etc. I don't know how bwa handles N, but it may
I heard that as well from Heng Li on the bio-bwa-help list but cannot
find easily the particular email thread.
But manual page also states that:
NOTES ON SHORT-READ ALIGNMENT
Alignment Accuracy
...
When gapped alignment is disabled, BWA is expected to generate the
same alignment as Eland version 1, the Illumina alignment program. However, as
BWA change `N' in the database sequence to random nucleotides, hits to these
random sequences will also be counted. As a
consequence, BWA may mark a unique hit as a repeat, if the random
sequences happen to be identical to the sequences which should be unique in the
database.
By default, if the best hit is not highly repetitive (controlled by
-R), BWA also finds all hits contains one more mismatch; otherwise, BWA finds
all equally best hits only. Base quality is NOT considered in evaluating hits.
In the paired-end mode, BWA pairs all hits
it found. It further performs Smith-Waterman alignment for unmapped
reads to rescue reads with a high error rate, and for high-quality anomalous
pairs to fix potential alignment errors.
> have also changed over time so you'd need to check the @PG headers to
> see the release number too.
>
>> In my eyes
>> MD:Z:0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0N0T4
>> means that now the reference contains N's while it did not have them in
>> the past.
>
> Correct.
>
> $ samtools faidx hs37d5.fa 1:9932-10005
>> 1:9932-10005
> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> NNNNNNNNNTAACC
>
> calmd is claiming all bar the last 4 bases differ and should be
> N{69}T, with the last AACC matching.
>
>> If the @MD tag is correct it means the read should not be
>> considered as mapped to the chromosome 1 at all, and theoretically
>> could have a better match elsewhere. So, keeping it badly aligned in
>> this locus is wrong I think. So should I try to re-align all reads
>> mentioned on the 'different NM for read' line again? How? Remove
>> them from BAM, mapping newly while creating a new BAM file, then
>> merging the BAM files? Sound I would loose some of the optional
>> columns so the merged BAM file would not have all columns for all
>> reads. Is that a problem?
>
> Again, this depends on the aligner and options used. I'd agree though
> that it looks like a "courageous" alignment, but possibly it's the
> best (maybe even correct) alignment that places it the expected
> distance apart from a uniquely placed mate pair.
So I conclude bwa mem placed it there maybe thank to its other mate read,
and maybe also thanks to a randomly rewritten sequence at first.
Probably the randomly rewritten sequence should not or is not already
considered if bwa mem know the approximate position and sequence thank
to the other mate. I think these read counts just increase the number of
mapped, maybe even of properly mapped read in the final statistics.
This particular read received '74M' instead of '*'. Am I right that only
thanks to the MD: tag I could get rid of it? Based on that the other mate
could be made more accessible to a re-alignment by some other tool in turn
(probably no tool will try to fiddle with properly placed pair, right?).
>
> At least it got a very low mapping quality.
So do you think generating MD: tag is necessary to get rid of these reads
deemed to be aligned, or do you think the low mapping quality is enough?
>
> Realigning is a little messy. You'd have to come up with a way to
> filter out the specific read pairs you want to realign (maybe even
> something as dumb as an egrep if there are just a few), align and sort
> them, and then merge your filtered and newly aligned files back again.
>
> (I don't have time to construct such a command line.)
Don't worry, but thank you for detailed answers.
Here is a simpler case which had one change (look for the second mate in the
pair, it contains one N):
[bam_fillmd1] different NM for read
'HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038': 0 -> 1
$ samtools view mysample-PB.bam | grep HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 99 chr17 19712705
60 74M = 19712712 81
GAAGTCCTGGAATGAAAATCATTACATAGTCCTGGGTCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCA
<=9?4?@<<?;;7=;<<=7?7877>:88@7?@=>>@6?*=?@0@B6:B<C>B?B=C==CA1<>=8:CC==<=@7
PG:Z:MarkDuplicates RG:Z:mysample-PB NM:i:0 SM:i:0 MQ:i:0 PQ:i:0
UQ:i:0 XQ:i:0
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 147 chr17 19712712
60 74M = 19712705 -81
TGGAATGAAAATCATTACATAGTCCTGGGNCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCAGGCTGTG
8B>877A99;8=?9>78>88=?=BC7A<4!981?C6B?88@=A=@@><A9<AA=A=?@<@C7?5@><???4>7<
PG:Z:MarkDuplicates RG:Z:mysample-PB NM:i:0 SM:i:0 MQ:i:0 PQ:i:0
UQ:i:0 XQ:i:0
$
$ samtools view mysample-PB.realignedtogether.BQSR.namesorted.fixmate.bam |
grep HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 99 chr17 19712705
60 74M = 19712712 81
GAAGTCCTGGAATGAAAATCATTACATAGTCCTGGGTCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCA
>?A?9CACCG679F7<<B9E6976C>99E7ECCGGE:E-CEB0ED;;G;GBGFE?G?BFB1:G;8:HBBEAB<<
MC:Z:74M
BD:Z:MMNNSQQQQQPOOPPNFFOPPPNMPMPNOPOOOPPMPOPNNPOPPPNNNNMNOLNONNNQPOPNOQQSRRPPPP
PG:Z:MarkDuplicates RG:Z:mysample-PB
BI:Z:DDCFKMMLMMKKMMJKKKMMKLLINKLKNLMMKLLLIKIGJKIJKGFJGIFHGGDEBBDB@?>=><>>>BBEFC
NM:i:0 SM:i:0 PQ:i:0 UQ:i:0 XQ:i:0 MQ:i:60
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 147 chr17 19712712
60 74M = 19712705 -81
TGGAATGAAAATCATTACATAGTCCTGGGNCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCAGGCTGTG
7:=<73E>D?:<F8@39D::AE<EG3DA4!9:3GG4DF:;EAFACCABD8BDF=E:FEBDF7E5EHAED@2>6=
MC:Z:74M
BD:Z:PONNSSPHHPPQQPNOMPMNOPOPQPMMMMOOPPQPPONNNNNOMONNNNPPNNMLNMOPPONOOOOPRRMMMM
PG:Z:MarkDuplicates RG:Z:mysample-PB
BI:Z:CEED@<><<<=;?@??BDEAFGFHJGJIIIJIHKLILLKKMLMLNNNMNMMNNLNMNNMMOLNLNNLMMNCEDD
NM:i:0 SM:i:0 PQ:i:0 UQ:i:0 XQ:i:0 MQ:i:60
$
$ samtools view mysample-PB.realignedtogether.BQSR.namesorted.fixmate.calmd.bam
| grep HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 99 chr17 19712705
60 74M = 19712712 81
GAAGTCCTGGAATGAAAATCATTACATAGTCCTGGGTCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCA
>?A?9CACCG679F7<<B9E6976C>99E7ECCGGE:E-CEB0ED;;G;GBGFE?G?BFB1:G;8:HBBEAB<<
MC:Z:74M
BD:Z:MMNNSQQQQQPOOPPNFFOPPPNMPMPNOPOOOPPMPOPNNPOPPPNNNNMNOLNONNNQPOPNOQQSRRPPPP
PG:Z:MarkDuplicates RG:Z:mysample-PB
BI:Z:DDCFKMMLMMKKMMJKKKMMKLLINKLKNLMMKLLLIKIGJKIJKGFJGIFHGGDEBBDB@?>=><>>>BBEFC
NM:i:0 SM:i:0 PQ:i:0 UQ:i:0 XQ:i:0 MQ:i:60
ZQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
MD:Z:74
HWI-xxxxx:xxx:xxxxxxxxx:2:1101:1110:65038 147 chr17 19712712
60 74M = 19712705 -81
TGGAATGAAAATCATTACATAGTCCTGGGNCAACCTGCAAGAGAGGGAGAAGCTCCGGAGCTGACCAGGCTGTG
7:=<73E>D?:<F8@39D::AE<EG3DA4!9:3GG4DF:;EAFACCABD8BDF=E:FEBDF7E5EHAED@2>6=
MC:Z:74M
BD:Z:PONNSSPHHPPQQPNOMPMNOPOPQPMMMMOOPPQPPONNNNNOMONNNNPPNNMLNMOPPONOOOOPRRMMMM
PG:Z:MarkDuplicates RG:Z:mysample-PB
BI:Z:CEED@<><<<=;?@??BDEAFGFHJGJIIIJIHKLILLKKMLMLNNNMNMMNNLNMNNMMOLNLNNLMMNCEDD
SM:i:0 PQ:i:0 UQ:i:0 XQ:i:0 MQ:i:60
ZQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
NM:i:1 MD:Z:29T44
$
So, there is one 'N' which was replaced by bwa during alignment randomly and
maybe it was incidentally an exact match (25% probab.).
Could samtools calmd apply the following logic for bwa-processed input?
Get positions of all N's in the read.
Do not complain about those positions which are based on N's.
Do report other positions.
OK, I know "samtools calmd" reports a total sum of the differences before and
after but quite likely there will be only a "0 -> 0" to be reported. So I will
get less warning on the screen, which is always good. ;)
bwa-0.7.13-r1126, samtools-1.3 (using htslib 1.3).
Thank you,
Martin
------------------------------------------------------------------------------
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help