Hi James,
I was calling SNPs at repetitive regions recently and found that pileup
information was very useful to figure out for what I needed. So I think it is
great to fix this issue.
And can we have both the solution 1 and 2 so people can select? I like to use
the second solution for Illumina.
> 1) One solution would be to create bam_pileup2_t (I assume that's what
> the mysterious 1 suffix appended to everything was meant to deal with)
> that has extra data, along with a duplicate set of functions to
> manipulate it, but it's a bit messy.
>
> 2) Another solution is to simply accept the slowness and recompute the
> cached cigar index every time we need it in samtools. This is only
> at an insertion, so it's not going to be every call. It would give us
> fixed output, but at the cost of speed.
best,
Hideo
On 17 Feb 2016, at 12:52, James Bonfield wrote:
> Sorry this is long, so I'll get to the question upfront.
>
> Do people actually CARE about the various breakages in the string
> pileup format? If so read on, if not then fine.
>
>
> Consider an alignment like this:
>
> Pos 12345 6789
> Ref AAAAA TTTAA
> Seq1 AAAAACCTTTAA (cigar 5M 2I 5M)
> Seq2 AAAAA ***AA (cigar 5M 3D 2M)
> Seq3 AAAAACC***AA (cigar 5M 2I 3D 2M)
>
> At pos 5 pileup will format for sq1 a string of A+2CC indicating the 2
> bp insertion. Seq2 will show A-3TTT indicating the 3bp deletion.
> Positions 6, 7 and 8 will have "*" present as the base for seq2.
>
> All good so far, but what about seq3? It currently shows A+2CC.
> Shouldn't it be A+2CC-3TTT to indicate the deletion following the
> insertion? The next row in the pileup output is pos 6, which is
> already too late, and contrast seq2 to seq3 which both have 3 bases
> deleted at pos 6-8, yet only one of them admits what it is.
>
> The pileup format explains +N<base> and -N<base> notations and nowhere
> does it exclude having both present at the same point, so I believe
> this to be a bug.
>
> However, how to fix it? This is the hard part. pileup_seq() in
> bam_plcmd.c generates these +N<base> and -N<base> strings, along with
> other markers for ref-skip (cigar "N"). It does so by looking at the
> p->indel and p->isdel variables. p->indel is the size of the indel;
> positive for insertion and negative for deletion. This fundamentally
> makes it impossible to report insertion followed by deletion as there
> simply isn't enough information.
>
> Ideally after an insertion we'd check the next cigar op to see if it's
> a deletion so we can print that up too, as we know the next loop
> around bam_mplp_auto() will have moved onto the deletion site. So we
> could just check the cigar opcodes to find out what is next. It's
> possible to convert a position P in the sequence to the N-th location
> in the K-th cigar op, but it requires scanning the entire cigar string
> each time to do that conversion. The original pileup implementation
> did this, but it was horribly slow on long sequences with complex
> cigar strings (eg pacbio). It now caches this data in a cstate_t
> struct:
>
> typedef struct {
> int k, x, y, end;
> } cstate_t;
>
> This in turn is held in lbnode_t:
>
> typedef struct __linkbuf_t {
> bam1_t b;
> int32_t beg, end;
> cstate_t s;
> struct __linkbuf_t *next;
> } lbnode_t;
>
> Which ultimately is linked to in the pileup iterator:
>
> struct __bam_plp_t {
> mempool_t *mp;
> lbnode_t *head, *tail;
> int32_t tid, pos, max_tid, max_pos;
> int is_eof, max_plp, error, maxcnt;
> uint64_t id;
> bam_pileup1_t *plp;
> // for the "auto" interface only
> bam1_t *b;
> bam_plp_auto_f func;
> void *data;
> olap_hash_t *overlaps;
> };
>
> "bam_pileup1_t *plp" here is the only exposed part of the interface
> and it is this which is given to us after each iterator call. Ideally
> bam_pileup1_t would contain cstate_t so that functions iterating over
> the pileup could access information on the local cigar operations.
> This would also fix this example:
>
> Ref AAAAA TTTAA
> Seq1 AAAAACGTTTAA (cigar 5M 2I 5M)
> Seq2 AAAAAC*TTTAA (cigar 5M 1I 1P 5M)
> Seq3 AAAAA*GTTTAA (cigar 5M 1P 1I 5M)
>
> We'd want a pileup string at pos 5 to be (spaced out for clarity):
> A+2CG A+2C* A+2*G
>
> Instead we get the less useful:
>
> A+2CG A+1C A+1G
>
> Again, it's impossible (or at least highly inefficient) to fix this
> and add support for the P operator as the pileup callback simply has
> no cigar context to go on.
>
> Changing bam_pileup1_t though would break ABI compatibility as it is
> commonly supplied in arrays. Indeed I have an htslib pull request
> (https://github.com/samtools/htslib/pull/157) which implements support
> for P operator by extending bam_pileup1_t, stalled due to
> compatibility.
>
> One solution would be to create bam_pileup2_t (I assume that's what
> the mysterious 1 suffix appended to everything was meant to deal with)
> that has extra data, along with a duplicate set of functions to
> manipulate it, but it's a bit messy.
>
> Another solution is to simply accept the slowness and recompute the
> cached cigar index every time we need it in samtools. This is only
> at an insertion, so it's not going to be every call. It would give us
> fixed output, but at the cost of speed.
>
> The final solution, employed so far, is just to say "bof!" and ignore
> the problems. By now I'm sure everyone who is parsing the old pileup
> format instead of using, say, gvcf will have dealt with all these
> breakages anyway? Indeed maybe you consider the current bizareness to
> be "the format" and therefore any fix is infact a bug. Ie go and fix
> something real! :-)
>
> Any views?
>
> James
>
> --
> James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
> | Plurima gyrabant gymbolitare vabo;
> A Staden Package developer: | Et Borogovorum mimzebant undique formae,
> https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Samtools-help mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/samtools-help
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help