Hello Everyone,
I came across a "transcript-based" VCF file, meaning a variant can be
present multiple times but belonging to a different transcript. See
"FIle 1" below as an example. I am finding myself in the unfortunate
situation of having to intersect ("File 2") and retain all records
with the same position and REF/ALT ("Desired output").
Long shot: Is that possible?
Thanks,
Thomas
File 1
##fileformat=VCFv4.2
##fileDate=20090805
##contig=<ID=20>
##INFO=<ID=TRANSCRIPT_ID,Number=1,Type=Integer,Description="ID of transcript">
##INFO=<ID=GENE_ID,Number=1,Type=Integer,Description="ID of gene
associated with transcript">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS
TRANSCRIPT_ID=1;GENE_ID=1; GT:GQ:DP:HQ 0|0:48:1:51,51
1|0:48:8:51,51 1/1:43:5:.,.
20 14370 rs6054257 G A 29 PASS
TRANSCRIPT_ID=2;GENE_ID=1; GT:GQ:DP:HQ 0|0:48:1:51,51
1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10
TRANSCRIPT_ID=1;GENE_ID=2; GT:GQ:DP:HQ 0|0:49:3:58,50
0|1:3:5:65,3 0/0:41:3
20 17330 . T A 3 q10
TRANSCRIPT_ID=2;GENE_ID=2; GT:GQ:DP:HQ 0|0:49:3:58,50
0|1:3:5:65,3 0/0:41:3
File 2
##fileformat=VCFv4.2
##fileDate=20090805
##contig=<ID=20>
##INFO=<ID=TRANSCRIPT_ID,Number=1,Type=Integer,Description="ID of transcript">
##INFO=<ID=GENE_ID,Number=1,Type=Integer,Description="ID of gene
associated with transcript">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS
TRANSCRIPT_ID=1;GENE_ID=1; GT:GQ:DP:HQ 0|0:48:1:51,51
1|0:48:8:51,51 1/1:43:5:.,.
Desired output
##fileformat=VCFv4.2
##fileDate=20090805
##contig=<ID=20>
##INFO=<ID=TRANSCRIPT_ID,Number=1,Type=Integer,Description="ID of transcript">
##INFO=<ID=GENE_ID,Number=1,Type=Integer,Description="ID of gene
associated with transcript">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS
TRANSCRIPT_ID=1;GENE_ID=1; GT:GQ:DP:HQ 0|0:48:1:51,51
1|0:48:8:51,51 1/1:43:5:.,.
20 14370 rs6054257 G A 29 PASS
TRANSCRIPT_ID=2;GENE_ID=1; GT:GQ:DP:HQ 0|0:48:1:51,51
1|0:48:8:51,51 1/1:43:5:.,
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help