Hi Jonathan,
new versions of mpileup have the -t DPR,INFO/DPR option which outputs
depth for each observed base.
Cheers,
petr
On Wed, 2014-10-01 at 12:49 -0400, jbr950 wrote:
> Hi Petr,
> Thanks for your time thus far. I'm trying to query some bam files
> at specified locations to have a look at the allele frequencies. The
> motivation is to 'validate' variants called in one dataset, in a novel
> dataset using a more simple approach. Ideally, I'd get the raw counts
> of the number of reads mapping to A,T,C,G, and/or indel (if detected).
>
>
> I tried mpileup two ways (with -cg in bcftools view to call variants,
> and without), and neither seems to provide me the data I would like
> in the corresponding output:
>
> samtools mpileup -BQ0 -d10000000 -l $variantBed -uf ${ref} $bamFile |
> $bcftools view - > no_cg.vcf
>
> samtools mpileup -BQ0 -d10000000 -l $variantBed -uf ${ref} $bamFile |
> $bcftools view -cg - > cg.vcf
>
>
>
> Then I check the context around 2 known variants:
>
>
> Variant 1
> grep -C 1 105213319 cg.vcf
>
> 14 105213318 . A . 117 .
> DP=29;AF1=0;AC1=0;DP4=14,15,0,0;MQ=20;FQ=-114 PL 0
> 14 105213319 . G C 25 .
> DP=28;VDB=0.0398;AF1=0.5;AC1=1;DP4=10,9,4,5;MQ=20;FQ=28;PV4=1,1,1,1
> GT:PL:GQ 0/1:55,0,128:58
> 14 105213320 . T . 114 .
> DP=28;AF1=0;AC1=0;DP4=14,14,0,0;MQ=20;FQ=-111 PL 0
>
>
> Here, the variant was called and the DP4 field tells me the number of
> reference and non-reference (presumably "C") alleles. I am assuming
> that there are 0 T's and A's.
>
>
>
>
> grep -C 1 105213319 no_cg.vcf
> 14 105213318 . A X 0 .
> DP=29;I16=14,15,0,0,979,34421,0,0,580,11600,0,0,511,11223,0,0 PL
> 0,87,198
> 14 105213319 . G C,T,X 0 .
> DP=28;I16=10,9,4,5,645,23047,338,12742,380,7600,180,3600,339,7223,177,4179;VDB=0.0398
> PL 55,0,128,96,140,195,112,152,198,201
> 14 105213320 . T X 0 .
> DP=28;I16=14,14,0,0,924,32326,0,0,560,11200,0,0,520,11560,0,0 PL
> 0,84,195
>
>
> Here, I see that that the non-ref alleles consist of C's ant T's,
> although I don't see any way to determine the relative proportion or
> absolute number of each.
>
>
>
>
>
> Trying Variant 2:
> grep -C 1 154574018 cg.vcf
> 1 154574017 . T . 283 .
> DP=274;VDB=0.0365;AF1=0;AC1=0;DP4=146,128,0,0;MQ=20;FQ=-282 PL
> 0
> 1 154574018 . C . 113 .
> DP=273;VDB=0.0365;AF1=0;AC1=0;DP4=110,100,31,32;MQ=20;FQ=-110;PV4=0.67,1,1,1
> PL 0
> 1 154574019 . T . 283 .
> DP=278;VDB=0.0365;AF1=0;AC1=0;DP4=142,136,0,0;MQ=20;FQ=-282 PL
> 0
>
>
> This variant was not called at all. I may be willing to "assume" that
> the 63 'non-ref' reads reported in the DP4 field correspond to my
> variant, but it would be nice to be sure.
>
>
>
>
>
> grep -C 1 154574018 no_cg.vcf
> 1 154574017 . T X 0 .
> DP=274;I16=146,128,0,0,10042,376816,0,0,5463,109209,0,0,5159,117099,0,0;VDB=0.0365
> PL 0,255,255
> 1 154574018 . C G,X 0 .
> DP=273;I16=110,100,31,32,7610,283160,2311,86963,4183,83609,1260,25200,3918,88780,1242,28290;VDB=0.0365
> PL 0,83,117,255,255,220
> 1 154574019 . T X 0 .
> DP=278;I16=142,136,0,0,10104,376490,0,0,5543,110809,0,0,5167,116999,0,0;VDB=0.0365
> PL 0,255,255
>
>
>
>
> Here, the C to G is reported, but I again I only have total read depth
> via the DP field, no breakdown by nucleotide (or am I missing it?).
>
>
> It's true I'm using Samtools 0.1.18. If upgrading to the latest would
> give me a better option, I'm happy to do that, but after scouring the
> available documentation, I don't see a way to do this.
>
>
> In other words, I'd like to inquire at many positions the proportion
> of A's, T's, C's, G's and/or detected indels. Mpileup (at least how
> I'm currently using it) gets me close but not all the way there.
>
>
>
>
> Did it make sense? I'd appreciate any suggestions.
>
>
>
> Best regards,
> Jonathan
>
>
>
> On Tue, Sep 23, 2014 at 11:47 AM, Petr Danecek <[email protected]>
> wrote:
> Hi Jonathan,
>
> On Tue, 2014-09-23 at 11:05 -0400, jbr950 wrote:
> > Hi Petr,
> > The workflow documentation looks great - this effort is
> undoubtedly
> > going to help a lot of researchers. Thanks for doing it.
>
> the credits for the workflow should not go to me, the
> htslib.org page
> has been put together by Martin Pollard, Thomas Keane and
> others.
> :-)
>
> > In this instance, I actually don't need statistics and the
> raw pileups
> > are fine. Which brings me back to the original issue. With
> the new
> > version, the vcf outputs from mpileup, using the same .bed
> as input,
> > are still of varying lengths. Do you know of a way to
> force mpileup
> > to report data at every position in the .bed file for every
> run? Or
> > alternatively, do you know why some positions report a DP=0
> and other
> > positions are omitted altogether?
>
> These DP=0 sites are emitted at places where there is an
> aligned read
> with an indel. The indel may not appear in the output when the
> mpileup
> machinery decides it is not significant enough.
>
> I hope this helps,
> Petr
>
>
> > Thanks again,
> > Jonathan
> >
> >
> >
> > On Tue, Sep 23, 2014 at 9:26 AM, Petr Danecek
> <[email protected]>
> > wrote:
> > There is a slowly growing documentation on the new
> website,
> > please check
> > the variant calling workflow
> > http://www.htslib.org/workflow/
> >
> > You need to use `bcftools call`, not `bcftools
> view`.
> >
> > petr
> >
> >
> > On Tue, 2014-09-23 at 09:25 -0400, jbr950 wrote:
> > > Good catch - I was trying to use an old version of
> bcftools
> > with the
> > > new version of samtools..
> > >
> > >
> > > So I changed that to point to the new version of
> bcftools,
> > and now:
> > >
> > >
> > >
> samtools-bcftools-htslib-1.0_x64-linux/bin/samtools mpileup
> > -BQ0
> > > -d10000000 -l $variantBed -uf ${ref} $bamFile >
> test.bcf
> > >
> > >
> > >
> > > So far so good..
> > >
> > >
> > > And then:
> > > cat test.bcf |
> > samtools-bcftools-htslib-1.0_x64-linux/bin/bcftools
> > > view -cg -
> > > Error: Could not parse --min-ac g
> > >
> > >
> > >
> > > On the other hand, calling: .
> > > cat test.bcf |
> > samtools-bcftools-htslib-1.0_x64-linux/bin/bcftools
> > > view-
> > >
> > >
> > >
> > > does create a nice output, so I'm inclined to let
> it go with
> > the
> > > defaults. I don't see equivalent arguments for the
> -c and -g
> > flags and
> > > I'm not particularly attached to the use of
> Bayesian
> > inference,
> > > although it would be good to understand what
> differences I
> > can expect
> > > between the old method (using -c) and the new
> (sans -c)?
> > >
> > >
> > > Thanks for your time and assistance!
> > >
> > >
> > > Jonathan
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Sep 23, 2014 at 2:40 AM, Petr Danecek
> > <[email protected]>
> > > wrote:
> > > This does not look like output from
> samtools 1.0 -
> > it outputs
> > > VCFv4.2
> > >
> > > petr
> > >
> > > On Mon, 2014-09-22 at 14:53 -0400, jbr950
> wrote:
> > > > Hi Petr,
> > > > Thanks for your reply. I grabbed
> the
> > > > samtools-bcftools-htslib-1.0_x64-linux
> binary and
> > tried
> > > again.
> > > >
> > > >
> > > > When I'd used Samtools 0.1.18, the issue
> I emailed
> > about was
> > > that the
> > > > number of lines in output varied by .bam
> file, and
> > I didn't
> > > understand
> > > > why lines were being ommitted, and why
> not in a
> > common
> > > manner.
> > > >
> > > >
> > > > Using version 1.0 and the same command
> (I checked
> > on the
> > > older
> > > > samtools and it is producing output), I
> get no
> > output at
> > > all. Mpileup
> > > > writes a header and then stops. I have
> copied and
> > pasted
> > > the output.
> > > >
> > > >
> > > >
> > > > My .bed is 3 columns: chr start stop
> > > > with no header.
> > > >
> > > >
> > > > Any advice appreciated, thanks!
> > > >
> > > >
> > > > Jonathan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > [mpileup] 1 samples in 1 input files
> > > > (mpileup) Max depth is above 1M.
> Potential memory
> > hog!
> > > > [bcf_sync] incorrect number of fields
> (0 != 5) at
> > 0:0
> > > > [afs] 0:0.000
> > > > ##fileformat=VCFv4.1
> > > >
> > ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw
> read
> > > depth">
> > > >
> > ##INFO=<ID=DP4,Number=4,Type=Integer,Description="#
> > > high-quality
> > > > ref-forward bases, ref-reverse,
> alt-forward and
> > alt-reverse
> > > bases">
> > > >
> > >
> >
> ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square
> > > > mapping quality of covering reads">
> > > >
> > ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred
> > > probability of
> > > > all samples being the same">
> > > >
> > >
> >
> ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood
> > > > estimate of the first ALT allele
> frequency
> > (assuming HWE)">
> > > >
> > >
> >
> ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood
> > > > estimate of the first ALT allele count
> (no HWE
> > assumption)">
> > > >
> ##INFO=<ID=G3,Number=3,Type=Float,Description="ML
> > estimate
> > > of genotype
> > > > frequencies">
> > > >
> >
> ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based
> > > HWE test
> > > > P-value based on G3">
> > > >
> >
> ##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio
> > > of
> > > > genotype likelihoods with and without
> the
> > constraint">
> > > >
> > ##INFO=<ID=UGT,Number=1,Type=String,Description="The
> most
> > > probable
> > > > unconstrained genotype configuration in
> the trio">
> > > >
> > ##INFO=<ID=CGT,Number=1,Type=String,Description="The
> most
> > > probable
> > > > constrained genotype configuration in
> the trio">
> > > >
> >
> ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for
> > > strand
> > > > bias, baseQ bias, mapQ bias and tail
> distance
> > bias">
> > > >
> >
> ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates
> > > that the
> > > > variant is an INDEL.">
> > > >
> >
> ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred
> > > probability of
> > > > the nonRef allele frequency in group1
> samples
> > being larger
> > > (,smaller)
> > > > than in group2.">
> > > >
> >
> ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior
> > > weighted
> > > > chi^2 P-value for testing the
> association between
> > group1 and
> > > group2
> > > > samples.">
> > > >
> >
> ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred
> > > scaled
> > > > PCHI2.">
> > > >
> ##INFO=<ID=PR,Number=1,Type=Integer,Description="#
> > > permutations
> > > > yielding a smaller PCHI2.">
> > > >
> >
> ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant
> > > Distance
> > > > Bias">
> > > >
> >
> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> > > >
> >
> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype
> > > Quality">
> > > >
> >
> ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods
> > > for
> > > > RR,RA,AA genotypes (R=ref,A=alt)">
> > > >
> > ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="#
> > > high-quality
> > > > bases">
> > > >
> > >
> >
> ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand
> > > > bias P-value">
> > > >
> >
> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of
> > > > Phred-scaled genotype likelihoods">
> > > > #CHROM POS ID REF ALT
> QUAL
> > FILTER INFO
> > > FORMAT
> > > >
> > > >
> > > >
> > > > On Thu, Sep 4, 2014 at 5:20 AM, Petr
> Danecek
> > > <[email protected]> wrote:
> > > > Hi Jonathan,
> > > >
> > > > these are good questions. Could
> you please
> > try with
> > > the latest
> > > > release,
> > > > I am happy to answer any
> remaining issues
> > not solved
> > > by the
> > > > upgrade.
> > > >
> > > > Cheers,
> > > > Petr
> > > >
> > > > On Sun, 2014-08-17 at 01:34
> -0400, Jo R
> > wrote:
> > > > > Hello,
> > > > > I'm attemping to get
> nucleotide
> > frequency
> > > information at
> > > > a number
> > > > > of positions across a number
> of samples,
> > and am
> > > having
> > > > difficulty
> > > > > interpreting some output. Any
> insights
> > would be
> > > > appreciated.
> > > > >
> > > > >
> > > > > I'm running the following
> command:
> > > > >
> > > > >
> > > > > samtools mpileup -BQ0
> -d10000000 -l
> > > VariantBed.bed -uf
> > > > $refFile $bam
> > > > > | bcftools view -bcg - |
> bcftools view -
> > >
> > > > > ${sampleName}_validation.vcf
> > > > >
> > > > >
> > > > >
> > > > > I notice that this command
> creates an
> > output file
> > > with
> > > > > an unpredictable number of
> rows.
> > Running the
> > > command using
> > > > the same
> > > > > bed file on a set of
> different .bam
> > files creates
> > > a set of
> > > > output vcf
> > > > > files with a wide distribution
> in
> > numbers of rows.
> > > > >
> > > > >
> > > > > I presumed that the difference
> in row
> > numbers
> > > means that
> > > > some
> > > > > positions drop out on
> some .bam files
> > because
> > > those samples
> > > > lacked
> > > > > coverage where other samples
> had
> > coverage.
> > > > >
> > > > >
> > > > > If that's the case, though, I
> don't know
> > what to
> > > make of
> > > > lines like
> > > > > the following one:
> > > > >
> > > > >
> > > > > 1 2160881 . G
> .
> > 28.2 .
> > > > >
> DP=0;VDB=0.0003;;AC1=2;FQ=-30 PL
> > 0
> > > > >
> > > > >
> > > > >
> > > > > here, it looks like DP=0, but
> this
> > position still
> > > got
> > > > reported in the
> > > > > vcf output. I also don't see
> AC1 in the
> > legend
> > > for INFO
> > > > tags in the
> > > > > samtools specification page,
> so I don't
> > know what
> > > to make of
> > > > a value
> > > > > of 2.
> > > > >
> > > > >
> > > > > So, I am confused. Positions
> with a
> > positive
> > > value of DP
> > > > and DP4 make
> > > > > sense to me. But why are some
> positions
> > > completely ommitted
> > > > from the
> > > > > vcf output, and other
> positions
> > reporting a DP=0?
> > > > >
> > > > >
> > > > > Thanks for any advice.
> > > > >
> > > > >
> > > > > Best regards,
> > > > > Jonathan
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > > >
> > > >
> > >
> >
>
> ------------------------------------------------------------------------------
> > > > >
> > _______________________________________________
> > > > > Samtools-help mailing list
> > > > >
> [email protected]
> > > > >
> > >
> >
> https://lists.sourceforge.net/lists/listinfo/samtools-help
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > The Wellcome Trust Sanger
> Institute is
> > operated by
> > > Genome
> > > > Research
> > > > Limited, a charity registered
> in England
> > with
> > > number 1021457
> > > > and a
> > > > company registered in England
> with number
> > 2742969,
> > > whose
> > > > registered
> > > > office is 215 Euston Road,
> London, NW1
> > 2BE.
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > --
> > > The Wellcome Trust Sanger Institute is
> operated by
> > Genome
> > > Research
> > > Limited, a charity registered in England
> with
> > number 1021457
> > > and a
> > > company registered in England with number
> 2742969,
> > whose
> > > registered
> > > office is 215 Euston Road, London, NW1
> 2BE.
> > >
> > >
> > >
> >
> >
> >
> >
> > --
> > The Wellcome Trust Sanger Institute is operated by
> Genome
> > Research
> > Limited, a charity registered in England with
> number 1021457
> > and a
> > company registered in England with number 2742969,
> whose
> > registered
> > office is 215 Euston Road, London, NW1 2BE.
> >
> >
> >
>
>
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome
> Research
> Limited, a charity registered in England with number 1021457
> and a
> company registered in England with number 2742969, whose
> registered
> office is 215 Euston Road, London, NW1 2BE.
>
>
>
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help