Hi Jonathan,

new versions of mpileup have the -t DPR,INFO/DPR option which outputs
depth for each observed base.

Cheers,
petr


On Wed, 2014-10-01 at 12:49 -0400, jbr950 wrote: 
> Hi Petr,
>     Thanks for your time thus far.  I'm trying to query some bam files
> at specified locations to have a look at the allele frequencies.  The
> motivation is to 'validate' variants called in one dataset, in a novel
> dataset using a more simple approach.  Ideally, I'd get the raw counts
> of the number of reads mapping to A,T,C,G, and/or indel (if detected).
> 
> 
> I tried mpileup two ways (with -cg in bcftools view to call variants,
> and without), and  neither seems to provide me the data I would like
> in the corresponding output:
> 
> samtools mpileup  -BQ0 -d10000000 -l $variantBed -uf ${ref} $bamFile |
> $bcftools view - > no_cg.vcf
> 
> samtools mpileup  -BQ0 -d10000000 -l $variantBed -uf ${ref} $bamFile |
> $bcftools view -cg -  > cg.vcf
> 
> 
> 
> Then I check the context around 2 known variants:
> 
> 
> Variant 1 
> grep -C 1 105213319 cg.vcf
> 
> 14      105213318       .      A        .       117  .
> DP=29;AF1=0;AC1=0;DP4=14,15,0,0;MQ=20;FQ=-114   PL      0
> 14      105213319       .      G        C       25   .
> DP=28;VDB=0.0398;AF1=0.5;AC1=1;DP4=10,9,4,5;MQ=20;FQ=28;PV4=1,1,1,1
> GT:PL:GQ        0/1:55,0,128:58
> 14      105213320       .      T        .       114  .
> DP=28;AF1=0;AC1=0;DP4=14,14,0,0;MQ=20;FQ=-111   PL      0
> 
> 
> Here, the variant was called and the DP4 field tells me the number of
> reference and non-reference (presumably "C") alleles.  I am assuming
> that there are 0 T's and A's.
> 
> 
> 
> 
> grep -C 1 105213319 no_cg.vcf
> 14      105213318       .      A        X       0    .
> DP=29;I16=14,15,0,0,979,34421,0,0,580,11600,0,0,511,11223,0,0   PL
>    0,87,198
> 14      105213319       .      G        C,T,X   0    .
> DP=28;I16=10,9,4,5,645,23047,338,12742,380,7600,180,3600,339,7223,177,4179;VDB=0.0398
>    PL      55,0,128,96,140,195,112,152,198,201
> 14      105213320       .      T        X       0    .
> DP=28;I16=14,14,0,0,924,32326,0,0,560,11200,0,0,520,11560,0,0   PL
>    0,84,195
> 
> 
> Here, I see that that the non-ref alleles consist of C's ant T's,
> although I don't see any way to determine the relative proportion or
> absolute number of each.
> 
> 
> 
> 
> 
> Trying Variant 2:
>  grep -C 1 154574018 cg.vcf
> 1       154574017       .      T        .       283  .
> DP=274;VDB=0.0365;AF1=0;AC1=0;DP4=146,128,0,0;MQ=20;FQ=-282     PL
>    0
> 1       154574018       .      C        .       113  .
> DP=273;VDB=0.0365;AF1=0;AC1=0;DP4=110,100,31,32;MQ=20;FQ=-110;PV4=0.67,1,1,1  
>   PL      0
> 1       154574019       .      T        .       283  .
> DP=278;VDB=0.0365;AF1=0;AC1=0;DP4=142,136,0,0;MQ=20;FQ=-282     PL
>    0
> 
> 
> This variant was not called at all.  I may be willing to "assume" that
> the 63 'non-ref' reads reported in the DP4 field correspond to my
> variant, but it would be nice to be sure.
> 
> 
> 
> 
> 
> grep -C 1 154574018 no_cg.vcf
> 1       154574017       .      T        X       0    .
> DP=274;I16=146,128,0,0,10042,376816,0,0,5463,109209,0,0,5159,117099,0,0;VDB=0.0365
>       PL      0,255,255
> 1       154574018       .      C        G,X     0    .
> DP=273;I16=110,100,31,32,7610,283160,2311,86963,4183,83609,1260,25200,3918,88780,1242,28290;VDB=0.0365
>   PL      0,83,117,255,255,220
> 1       154574019       .      T        X       0    .
> DP=278;I16=142,136,0,0,10104,376490,0,0,5543,110809,0,0,5167,116999,0,0;VDB=0.0365
>       PL      0,255,255
> 
> 
> 
> 
> Here, the C to G is reported, but I again I only have total read depth
> via the DP field, no breakdown by nucleotide (or am I missing it?).
> 
> 
> It's true I'm using Samtools 0.1.18.  If upgrading to the latest would
> give me a better option, I'm happy to do that, but after scouring the
> available documentation, I don't see a way to do this. 
> 
> 
> In other words, I'd like to inquire at many positions the proportion
> of A's, T's, C's, G's and/or detected indels.  Mpileup (at least how
> I'm currently using it) gets me close but not all the way there.
> 
> 
> 
> 
> Did it make sense?    I'd appreciate any suggestions.
> 
> 
> 
> Best regards,
> Jonathan
> 
> 
> 
> On Tue, Sep 23, 2014 at 11:47 AM, Petr Danecek <[email protected]>
> wrote:
>         Hi Jonathan,
>         
>         On Tue, 2014-09-23 at 11:05 -0400, jbr950 wrote:
>         > Hi Petr,
>         >    The workflow documentation looks great - this effort is
>         undoubtedly
>         > going to help a lot of researchers.  Thanks for doing it.
>         
>         the credits for the workflow should not go to me, the
>         htslib.org page
>         has been put together by Martin Pollard, Thomas Keane and
>         others.
>         :-)
>         
>         > In this instance, I actually don't need statistics and the
>         raw pileups
>         > are fine.  Which brings me back to the original issue.  With
>         the new
>         > version, the vcf outputs from mpileup, using the same .bed
>         as input,
>         > are still of varying  lengths.  Do you know of a way to
>         force mpileup
>         > to report data at every position in the .bed file for every
>         run?  Or
>         > alternatively, do you know why some positions report a DP=0
>         and other
>         > positions are omitted altogether?
>         
>         These DP=0 sites are emitted at places where there is an
>         aligned read
>         with an indel. The indel may not appear in the output when the
>         mpileup
>         machinery decides it is not significant enough.
>         
>         I hope this helps,
>         Petr 
>         
>         
>         > Thanks again,
>         > Jonathan
>         >
>         >
>         >
>         > On Tue, Sep 23, 2014 at 9:26 AM, Petr Danecek
>         <[email protected]>
>         > wrote:
>         >         There is a slowly growing documentation on the new
>         website,
>         >         please check
>         >         the variant calling workflow
>         >         http://www.htslib.org/workflow/
>         >
>         >         You need to use `bcftools call`, not `bcftools
>         view`.
>         >
>         >         petr
>         >
>         >
>         >         On Tue, 2014-09-23 at 09:25 -0400, jbr950 wrote:
>         >         > Good catch - I was trying to use an old version of
>         bcftools
>         >         with the
>         >         > new version of samtools..
>         >         >
>         >         >
>         >         > So I changed that to point to the new version of
>         bcftools,
>         >         and now:
>         >         >
>         >         >
>         >         >
>         samtools-bcftools-htslib-1.0_x64-linux/bin/samtools  mpileup
>         >         -BQ0
>         >         > -d10000000 -l $variantBed -uf ${ref} $bamFile >
>         test.bcf
>         >         >
>         >         >
>         >         >
>         >         > So far so good..
>         >         >
>         >         >
>         >         > And then:
>         >         > cat test.bcf |
>         >         samtools-bcftools-htslib-1.0_x64-linux/bin/bcftools
>         >         > view -cg -
>         >         > Error: Could not parse --min-ac g
>         >         >
>         >         >
>         >         >
>         >         > On the other hand, calling: .
>         >         > cat test.bcf |
>         >         samtools-bcftools-htslib-1.0_x64-linux/bin/bcftools
>         >         > view-
>         >         >
>         >         >
>         >         >
>         >         > does create a nice output, so I'm inclined to let
>         it go with
>         >         the
>         >         > defaults. I don't see equivalent arguments for the
>         -c and -g
>         >         flags and
>         >         > I'm not particularly attached to the use of
>         Bayesian
>         >         inference,
>         >         > although it would be good to understand what
>         differences I
>         >         can expect
>         >         > between the old method (using -c) and the new
>         (sans -c)?
>         >         >
>         >         >
>         >         > Thanks for your time and assistance!
>         >         >
>         >         >
>         >         > Jonathan
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >         > On Tue, Sep 23, 2014 at 2:40 AM, Petr Danecek
>         >         <[email protected]>
>         >         > wrote:
>         >         >         This does not look like output from
>         samtools 1.0 -
>         >         it outputs
>         >         >         VCFv4.2
>         >         >
>         >         >         petr
>         >         >
>         >         >         On Mon, 2014-09-22 at 14:53 -0400, jbr950
>         wrote:
>         >         >         > Hi Petr,
>         >         >         >     Thanks for your reply.  I grabbed
>         the
>         >         >         > samtools-bcftools-htslib-1.0_x64-linux
>         binary and
>         >         tried
>         >         >         again.
>         >         >         >
>         >         >         >
>         >         >         > When I'd used Samtools 0.1.18, the issue
>         I emailed
>         >         about was
>         >         >         that the
>         >         >         > number of lines in output varied by .bam
>         file, and
>         >         I didn't
>         >         >         understand
>         >         >         > why lines were being ommitted, and why
>         not in a
>         >         common
>         >         >         manner.
>         >         >         >
>         >         >         >
>         >         >         > Using version 1.0 and the same command
>         (I checked
>         >         on the
>         >         >         older
>         >         >         > samtools and it is producing output), I
>         get no
>         >         output at
>         >         >         all.  Mpileup
>         >         >         > writes a header and then stops.  I have
>         copied and
>         >         pasted
>         >         >         the output.
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         > My .bed is 3 columns: chr   start   stop
>         >         >         > with no header.
>         >         >         >
>         >         >         >
>         >         >         > Any advice appreciated, thanks!
>         >         >         >
>         >         >         >
>         >         >         > Jonathan
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         > [mpileup] 1 samples in 1 input files
>         >         >         > (mpileup) Max depth is above 1M.
>         Potential memory
>         >         hog!
>         >         >         > [bcf_sync] incorrect number of fields
>         (0 != 5) at
>         >         0:0
>         >         >         > [afs] 0:0.000
>         >         >         > ##fileformat=VCFv4.1
>         >         >         >
>         >         ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw
>         read
>         >         >         depth">
>         >         >         >
>         >         ##INFO=<ID=DP4,Number=4,Type=Integer,Description="#
>         >         >         high-quality
>         >         >         > ref-forward bases, ref-reverse,
>         alt-forward and
>         >         alt-reverse
>         >         >         bases">
>         >         >         >
>         >         >
>         >
>         ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square
>         >         >         > mapping quality of covering reads">
>         >         >         >
>         >         ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred
>         >         >         probability of
>         >         >         > all samples being the same">
>         >         >         >
>         >         >
>         >
>         ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood
>         >         >         > estimate of the first ALT allele
>         frequency
>         >         (assuming HWE)">
>         >         >         >
>         >         >
>         >
>         ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood
>         >         >         > estimate of the first ALT allele count
>         (no HWE
>         >         assumption)">
>         >         >         >
>         ##INFO=<ID=G3,Number=3,Type=Float,Description="ML
>         >         estimate
>         >         >         of genotype
>         >         >         > frequencies">
>         >         >         >
>         >
>          ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based
>         >         >         HWE test
>         >         >         > P-value based on G3">
>         >         >         >
>         >
>          ##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio
>         >         >         of
>         >         >         > genotype likelihoods with and without
>         the
>         >         constraint">
>         >         >         >
>         >         ##INFO=<ID=UGT,Number=1,Type=String,Description="The
>         most
>         >         >         probable
>         >         >         > unconstrained genotype configuration in
>         the trio">
>         >         >         >
>         >         ##INFO=<ID=CGT,Number=1,Type=String,Description="The
>         most
>         >         >         probable
>         >         >         > constrained genotype configuration in
>         the trio">
>         >         >         >
>         >
>          ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for
>         >         >         strand
>         >         >         > bias, baseQ bias, mapQ bias and tail
>         distance
>         >         bias">
>         >         >         >
>         >
>          ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates
>         >         >         that the
>         >         >         > variant is an INDEL.">
>         >         >         >
>         >
>          ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred
>         >         >         probability of
>         >         >         > the nonRef allele frequency in group1
>         samples
>         >         being larger
>         >         >         (,smaller)
>         >         >         > than in group2.">
>         >         >         >
>         >
>          ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior
>         >         >         weighted
>         >         >         > chi^2 P-value for testing the
>         association between
>         >         group1 and
>         >         >         group2
>         >         >         > samples.">
>         >         >         >
>         >
>          ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred
>         >         >         scaled
>         >         >         > PCHI2.">
>         >         >         >
>         ##INFO=<ID=PR,Number=1,Type=Integer,Description="#
>         >         >         permutations
>         >         >         > yielding a smaller PCHI2.">
>         >         >         >
>         >
>          ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant
>         >         >         Distance
>         >         >         > Bias">
>         >         >         >
>         >
>          ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
>         >         >         >
>         >
>          ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype
>         >         >         Quality">
>         >         >         >
>         >
>          ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods
>         >         >         for
>         >         >         > RR,RA,AA genotypes (R=ref,A=alt)">
>         >         >         >
>         >         ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="#
>         >         >         high-quality
>         >         >         > bases">
>         >         >         >
>         >         >
>         >
>         ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand
>         >         >         > bias P-value">
>         >         >         >
>         >
>          ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of
>         >         >         > Phred-scaled genotype likelihoods">
>         >         >         > #CHROM  POS     ID      REF     ALT
>          QUAL
>         >         FILTER  INFO
>         >         >         FORMAT
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         > On Thu, Sep 4, 2014 at 5:20 AM, Petr
>         Danecek
>         >         >         <[email protected]> wrote:
>         >         >         >         Hi Jonathan,
>         >         >         >
>         >         >         >         these are good questions. Could
>         you please
>         >         try with
>         >         >         the latest
>         >         >         >         release,
>         >         >         >         I am happy to answer any
>         remaining issues
>         >         not solved
>         >         >         by the
>         >         >         >         upgrade.
>         >         >         >
>         >         >         >         Cheers,
>         >         >         >         Petr
>         >         >         >
>         >         >         >         On Sun, 2014-08-17 at 01:34
>         -0400, Jo R
>         >         wrote:
>         >         >         >         > Hello,
>         >         >         >         >     I'm attemping to get
>         nucleotide
>         >         frequency
>         >         >         information at
>         >         >         >         a number
>         >         >         >         > of positions across a number
>         of samples,
>         >         and am
>         >         >         having
>         >         >         >         difficulty
>         >         >         >         > interpreting some output.  Any
>         insights
>         >         would be
>         >         >         >         appreciated.
>         >         >         >         >
>         >         >         >         >
>         >         >         >         > I'm running the following
>         command:
>         >         >         >         >
>         >         >         >         >
>         >         >         >         > samtools mpileup  -BQ0
>         -d10000000 -l
>         >         >         VariantBed.bed -uf
>         >         >         >         $refFile $bam
>         >         >         >         > | bcftools view -bcg - |
>         bcftools view -
>         >         >
>         >         >         >         > ${sampleName}_validation.vcf
>         >         >         >         >
>         >         >         >         >
>         >         >         >         >
>         >         >         >         > I notice that this command
>         creates an
>         >         output file
>         >         >         with
>         >         >         >         > an unpredictable number of
>         rows.
>         >         Running the
>         >         >         command using
>         >         >         >         the same
>         >         >         >         > bed file on a set of
>         different .bam
>         >         files creates
>         >         >         a set of
>         >         >         >         output vcf
>         >         >         >         > files with a wide distribution
>         in
>         >         numbers of rows.
>         >         >         >         >
>         >         >         >         >
>         >         >         >         > I presumed that the difference
>         in row
>         >         numbers
>         >         >         means that
>         >         >         >         some
>         >         >         >         > positions drop out on
>         some .bam files
>         >         because
>         >         >         those samples
>         >         >         >         lacked
>         >         >         >         > coverage where other samples
>         had
>         >         coverage.
>         >         >         >         >
>         >         >         >         >
>         >         >         >         > If that's the case, though, I
>         don't know
>         >         what to
>         >         >         make of
>         >         >         >         lines like
>         >         >         >         > the following one:
>         >         >         >         >
>         >         >         >         >
>         >         >         >         > 1       2160881 .       G
>            .
>         >          28.2    .
>         >         >         >         >
>         DP=0;VDB=0.0003;;AC1=2;FQ=-30   PL
>         >         0
>         >         >         >         >
>         >         >         >         >
>         >         >         >         >
>         >         >         >         > here, it looks like DP=0, but
>         this
>         >         position still
>         >         >         got
>         >         >         >         reported in the
>         >         >         >         > vcf output.  I also don't see
>         AC1 in the
>         >         legend
>         >         >         for INFO
>         >         >         >         tags in the
>         >         >         >         > samtools specification page,
>         so I don't
>         >         know what
>         >         >         to make of
>         >         >         >         a value
>         >         >         >         > of 2.
>         >         >         >         >
>         >         >         >         >
>         >         >         >         > So, I am confused.  Positions
>         with a
>         >         positive
>         >         >         value of DP
>         >         >         >         and DP4 make
>         >         >         >         > sense to me.  But why are some
>         positions
>         >         >         completely ommitted
>         >         >         >         from the
>         >         >         >         > vcf output, and other
>         positions
>         >         reporting a DP=0?
>         >         >         >         >
>         >         >         >         >
>         >         >         >         > Thanks for any advice.
>         >         >         >         >
>         >         >         >         >
>         >         >         >         > Best regards,
>         >         >         >         > Jonathan
>         >         >         >         >
>         >         >         >         >
>         >         >         >         >
>         >         >         >         >
>         >         >         >
>         >         >         >         >
>         >         >         >
>         >         >
>         >
>          
> ------------------------------------------------------------------------------
>         >         >         >         >
>         >         _______________________________________________
>         >         >         >         > Samtools-help mailing list
>         >         >         >         >
>         [email protected]
>         >         >         >         >
>         >         >
>         >
>         https://lists.sourceforge.net/lists/listinfo/samtools-help
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         >         --
>         >         >         >          The Wellcome Trust Sanger
>         Institute is
>         >         operated by
>         >         >         Genome
>         >         >         >         Research
>         >         >         >          Limited, a charity registered
>         in England
>         >         with
>         >         >         number 1021457
>         >         >         >         and a
>         >         >         >          company registered in England
>         with number
>         >         2742969,
>         >         >         whose
>         >         >         >         registered
>         >         >         >          office is 215 Euston Road,
>         London, NW1
>         >         2BE.
>         >         >         >
>         >         >         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >         --
>         >         >          The Wellcome Trust Sanger Institute is
>         operated by
>         >         Genome
>         >         >         Research
>         >         >          Limited, a charity registered in England
>         with
>         >         number 1021457
>         >         >         and a
>         >         >          company registered in England with number
>         2742969,
>         >         whose
>         >         >         registered
>         >         >          office is 215 Euston Road, London, NW1
>         2BE.
>         >         >
>         >         >
>         >         >
>         >
>         >
>         >
>         >
>         >         --
>         >          The Wellcome Trust Sanger Institute is operated by
>         Genome
>         >         Research
>         >          Limited, a charity registered in England with
>         number 1021457
>         >         and a
>         >          company registered in England with number 2742969,
>         whose
>         >         registered
>         >          office is 215 Euston Road, London, NW1 2BE.
>         >
>         >
>         >
>         
>         
>         
>         
>         --
>          The Wellcome Trust Sanger Institute is operated by Genome
>         Research
>          Limited, a charity registered in England with number 1021457
>         and a
>          company registered in England with number 2742969, whose
>         registered
>          office is 215 Euston Road, London, NW1 2BE.
>         
> 
> 




-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to