Hi,
I am a bioinformaticist for the Henry Doorly Zoo Genetics Department in Omaha,
NE. Currently, we are doing a variety of genomics studies using NGS data
sequenced from different lemur species of Madagascar. I had some questions
regarding SAMtools, more specifically the consensus tool in the new
bcftools/1.2-4-g1fedb8b.
For a particular study using partial genomic NGS data, we are trying to develop
consensus gene CDS sequences across individual samples for further downstream
analysis. For a particular gene cds sequence, I have gathered the reads from
the sample that have a high blast alignment score to the reference cds, align
those reads to the reference using bwa mem, and then call the variants and
create a consensus sequence using samtools/bcftools. Using "bcftools
consensus", I seem to get a consensus sequence just fine. However, after
viewing the bam file using IGV, I made some observations:
Any positions that had variants in the reads that differed from the reference
position had the variant in the consensus sequence and not the reference base,
which was good. However, the variant was being used in the consensus even if
the majority of the reads aligned to that position had the same base as the
reference and only a few reads had that variant (95% including reference had A,
4% had T, 1% had C, T is used in the consensus). It seems to be creating the
consensus position if there is any variant in any of the reads, based on the
highest amount of found at the position in the reads regardless if it is only a
small percent of reads that have this variant.
With that being said, is there a way around this issue given that a small
number of variants probably means a sequencing error on a few reads and not an
actual SNP found in the sample? Another observation I made is using the
bcftools consensus -i option for showing IUPAC ambiguity. It seems to give the
IUPAC symbol based on the greatest amount of a certain variant found in the
reads position and the actual reference position (60% of reads had A, 40% had
T, reference had C, consensus is called M for A or C). Is there a possibility
of creating an IUPAC symbol from just the read variants without influence of
the reference position (52% of reads has A, 48% has T, reference has C,
consensus is W for A or T)?
Any suggestions you may have would be greatly appreciated, only when it is
convenient for you of course. Using bwa and samtools, I have about 12 command
lines being used for this process which I could provide if it helps with this
question, just let me know and I could send those to you in another email.
Thank you,
Julian Egger
HDZ Genetics Department
[email protected]
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help