Hi James,
I was calling SNPs at repetitive regions recently and found that pileup 
information was very useful to figure out for what I needed. So I think it is 
great to fix this issue.
And can we have both the solution 1 and 2 so people can select? I like to use 
the second solution for Illumina.  
  
> 1) One solution would be to create bam_pileup2_t (I assume that's what
> the mysterious 1 suffix appended to everything was meant to deal with)
> that has extra data, along with a duplicate set of functions to
> manipulate it, but it's a bit messy.
> 
> 2) Another solution is to simply accept the slowness and recompute the
> cached cigar index every time we need it in samtools.  This is only
> at an insertion, so it's not going to be every call.  It would give us
> fixed output, but at the cost of speed.
best,
Hideo

On 17 Feb 2016, at 12:52, James Bonfield wrote:

> Sorry this is long, so I'll get to the question upfront.
> 
> Do people actually CARE about the various breakages in the string
> pileup format?  If so read on, if not then fine.
> 
> 
> Consider an alignment like this:
> 
> Pos  12345  6789
> Ref  AAAAA  TTTAA
> Seq1 AAAAACCTTTAA (cigar 5M 2I    5M)
> Seq2 AAAAA  ***AA (cigar 5M    3D 2M)
> Seq3 AAAAACC***AA (cigar 5M 2I 3D 2M)
> 
> At pos 5 pileup will format for sq1 a string of A+2CC indicating the 2
> bp insertion.  Seq2 will show A-3TTT indicating the 3bp deletion.
> Positions 6, 7 and 8 will have "*" present as the base for seq2.
> 
> All good so far, but what about seq3?  It currently shows A+2CC.
> Shouldn't it be A+2CC-3TTT to indicate the deletion following the
> insertion?  The next row in the pileup output is pos 6, which is
> already too late, and contrast seq2 to seq3 which both have 3 bases
> deleted at pos 6-8, yet only one of them admits what it is.
> 
> The pileup format explains +N<base> and -N<base> notations and nowhere
> does it exclude having both present at the same point, so I believe
> this to be a bug.
> 
> However, how to fix it?  This is the hard part.  pileup_seq() in
> bam_plcmd.c generates these +N<base> and -N<base> strings, along with
> other markers for ref-skip (cigar "N").  It does so by looking at the
> p->indel and p->isdel variables.  p->indel is the size of the indel;
> positive for insertion and negative for deletion.  This fundamentally
> makes it impossible to report insertion followed by deletion as there
> simply isn't enough information.
> 
> Ideally after an insertion we'd check the next cigar op to see if it's
> a deletion so we can print that up too, as we know the next loop
> around bam_mplp_auto() will have moved onto the deletion site.  So we
> could just check the cigar opcodes to find out what is next.  It's
> possible to convert a position P in the sequence to the N-th location
> in the K-th cigar op, but it requires scanning the entire cigar string
> each time to do that conversion.  The original pileup implementation
> did this, but it was horribly slow on long sequences with complex
> cigar strings (eg pacbio).  It now caches this data in a cstate_t
> struct:
> 
> typedef struct {
>    int k, x, y, end;
> } cstate_t;
> 
> This in turn is held in lbnode_t:
> 
> typedef struct __linkbuf_t {
>    bam1_t b;
>    int32_t beg, end;
>    cstate_t s;
>    struct __linkbuf_t *next;
> } lbnode_t;
> 
> Which ultimately is linked to in the pileup iterator:
> 
> struct __bam_plp_t {
>    mempool_t *mp;
>    lbnode_t *head, *tail;
>    int32_t tid, pos, max_tid, max_pos;
>    int is_eof, max_plp, error, maxcnt;
>    uint64_t id;
>    bam_pileup1_t *plp;
>    // for the "auto" interface only
>    bam1_t *b;
>    bam_plp_auto_f func;
>    void *data;
>    olap_hash_t *overlaps;
> };
> 
> "bam_pileup1_t *plp" here is the only exposed part of the interface
> and it is this which is given to us after each iterator call.  Ideally
> bam_pileup1_t would contain cstate_t so that functions iterating over
> the pileup could access information on the local cigar operations.
> This would also fix this example:
> 
> Ref  AAAAA  TTTAA
> Seq1 AAAAACGTTTAA (cigar 5M 2I    5M)
> Seq2 AAAAAC*TTTAA (cigar 5M 1I 1P 5M)
> Seq3 AAAAA*GTTTAA (cigar 5M 1P 1I 5M)
> 
> We'd want a pileup string at pos 5 to be (spaced out for clarity):
> A+2CG A+2C* A+2*G
> 
> Instead we get the less useful:
> 
> A+2CG A+1C A+1G
> 
> Again, it's impossible (or at least highly inefficient) to fix this
> and add support for the P operator as the pileup callback simply has
> no cigar context to go on.
> 
> Changing bam_pileup1_t though would break ABI compatibility as it is
> commonly supplied in arrays.  Indeed I have an htslib pull request
> (https://github.com/samtools/htslib/pull/157) which implements support
> for P operator by extending bam_pileup1_t, stalled due to
> compatibility.
> 
> One solution would be to create bam_pileup2_t (I assume that's what
> the mysterious 1 suffix appended to everything was meant to deal with)
> that has extra data, along with a duplicate set of functions to
> manipulate it, but it's a bit messy.
> 
> Another solution is to simply accept the slowness and recompute the
> cached cigar index every time we need it in samtools.  This is only
> at an insertion, so it's not going to be every call.  It would give us
> fixed output, but at the cost of speed.
> 
> The final solution, employed so far, is just to say "bof!" and ignore
> the problems.  By now I'm sure everyone who is parsing the old pileup
> format instead of using, say, gvcf will have dealt with all these
> breakages anyway?  Indeed maybe you consider the current bizareness to
> be "the format" and therefore any fix is infact a bug.  Ie go and fix
> something real! :-)
> 
> Any views?
> 
> James
> 
> -- 
> James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
>                                  | Plurima gyrabant gymbolitare vabo;
>  A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
> https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi. 
> 
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE. 
> 
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Samtools-help mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/samtools-help



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to