Re: [Tutor] vcf_files and strings
http://www.1000genomes.org/node/101 From: Alan Gauld To: tutor@python.org Sent: Tuesday, October 11, 2011 1:52 PM Subject: Re: [Tutor] vcf_files and strings On 11/10/11 18:16, Hs Hs wrote: > > VCF - Variant Call Format > ... > > Nothing special except that a genetics or any biologist will understand > this - nothing special! The problem is that this list, being for beginners to Python, is a bit short on Geneticists and Biologists! :-) So you need to explain your problem in general terms, that the rest of us can make sense of, or else rely on the few who might understand being available/willing to respond. -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] vcf_files and strings
On 11/10/11 18:16, Hs Hs wrote: VCF - Variant Call Format ... Nothing special except that a genetics or any biologist will understand this - nothing special! The problem is that this list, being for beginners to Python, is a bit short on Geneticists and Biologists! :-) So you need to explain your problem in general terms, that the rest of us can make sense of, or else rely on the few who might understand being available/willing to respond. -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] vcf_files and strings
VCF - Variant Call Format VCF files are nothing special but tab delim files describing the genetic mutations, friquencies and other base information (bases here mean ATGC pertaining to DNA). These files are generated by variety of genome sequence data analysis pipelines. MIT and Haplotype Mapping Project consortium developed this format. Nothing special except that a genetics or any biologist will understand this - nothing special! hth cheers From: Steven D'Aprano To: tutor@python.org Sent: Sunday, October 9, 2011 10:04 PM Subject: Re: [Tutor] vcf_files and strings Anna Olofsson wrote: > Hi, > > I'm a beginner at Python and would really like some help in how to > extract information from a vcf file. > > The attached file consists of a lot of information on mutations, this > one though is just 2 rows and 10 columns (the real one has a lot more > rows). What do you mean by a VCF file? On my computer, a VCF file is an electronic business card, which tries to open in an Address Book application (which obviously fails). I don't know how to interpret the contents of your VCF file. After opening it in a hex editor, I can *guess* that it is a tab-separated file: each row takes one line, with the columns separated by tab characters. Column 7 appears to be a great big ugly blob with sub-fields separated by semi-colons. Am I right? Can you link us to a description of the vcf file format? > I want to extract the mRNA ID only if the mutation is missense. These > two rows (mutations) that I have attached happens to be missense but > how do I say that I'm not interested in the mutations that's not > missense (they might be e.g. synonymous). Also, how do I say that if > a mutation starts with a # symbol I don't want to include it > (sometimes the chr starts with a hash). What chr? Where is the mutation? I'm afraid your questions are assuming familiarity with your data that we don't have. > vcf file: 2 rows, 10 columns. > col 0 col 1 col 2 > col 3 col 4 col5 col6 > col7 col8 > col9 chromosome position . > Reference ALT position . some statistics > and the ID:s not important not important > > The important column is 7 where the ID is, i.e. > refseq.functionalClass=missense. It's a missense mutation, so then I > want to extract refseq.name=NM_003137492, or I want to extract only > the ID, which in this case is NM_003137492. This is what I *think* you want to do. Am I right? * read each line of the file * for each line, split on tabs * extract the 7th column and split it on semi-colons * inspect the refseq.functionalClass field * if it matches, extract the ID from the refseq.name and store it in a list for later (I have completely ignored the part about the #, because I don't understand what you mean by it.) Here's some code to do it: ids = [] f = open('vcf_file.vcf', 'r') for row in f: columns = row.split('\t') # Split on tabs data = columns[7] # Huge ugly blob of data values = data.split(';') # Split on semi-colons if values[25] == "refseq.functionalClass=missense": name_chunk = values[28] # looks like "refseq.name=..." a, b = name_chunk.split("=") if a != "refseq.name": raise ValueError('expected refseq.name but got %s' % a) ids.append(b) f.close() print(ids) Does this help? -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] vcf_files and strings
Anna Olofsson wrote: Hi, I'm a beginner at Python and would really like some help in how to extract information from a vcf file. The attached file consists of a lot of information on mutations, this one though is just 2 rows and 10 columns (the real one has a lot more rows). What do you mean by a VCF file? On my computer, a VCF file is an electronic business card, which tries to open in an Address Book application (which obviously fails). I don't know how to interpret the contents of your VCF file. After opening it in a hex editor, I can *guess* that it is a tab-separated file: each row takes one line, with the columns separated by tab characters. Column 7 appears to be a great big ugly blob with sub-fields separated by semi-colons. Am I right? Can you link us to a description of the vcf file format? I want to extract the mRNA ID only if the mutation is missense. These two rows (mutations) that I have attached happens to be missense but how do I say that I'm not interested in the mutations that's not missense (they might be e.g. synonymous). Also, how do I say that if a mutation starts with a # symbol I don't want to include it (sometimes the chr starts with a hash). What chr? Where is the mutation? I'm afraid your questions are assuming familiarity with your data that we don't have. vcf file: 2 rows, 10 columns. col 0 col 1col 2 col 3 col 4 col5col6 col7 col8 col9 chromosome position . Reference ALT position . some statistics and the ID:s not importantnot important The important column is 7 where the ID is, i.e. refseq.functionalClass=missense. It's a missense mutation, so then I want to extract refseq.name=NM_003137492, or I want to extract only the ID, which in this case is NM_003137492. This is what I *think* you want to do. Am I right? * read each line of the file * for each line, split on tabs * extract the 7th column and split it on semi-colons * inspect the refseq.functionalClass field * if it matches, extract the ID from the refseq.name and store it in a list for later (I have completely ignored the part about the #, because I don't understand what you mean by it.) Here's some code to do it: ids = [] f = open('vcf_file.vcf', 'r') for row in f: columns = row.split('\t') # Split on tabs data = columns[7] # Huge ugly blob of data values = data.split(';') # Split on semi-colons if values[25] == "refseq.functionalClass=missense": name_chunk = values[28] # looks like "refseq.name=..." a, b = name_chunk.split("=") if a != "refseq.name": raise ValueError('expected refseq.name but got %s' % a) ids.append(b) f.close() print(ids) Does this help? -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] vcf_files and strings
Hi, I still don't know how to make a loop that makes it work for all the mutations. Best, Anna Date: Fri, 7 Oct 2011 13:17:07 -0700 From: ilhs...@yahoo.com Subject: Re: [Tutor] vcf_files and strings To: olofsson_anna...@hotmail.com; tutor@python.org if col[x] == 'missense':print col[withRefSeqID] hth From: Anna Olofsson To: tutor@python.org Sent: Friday, October 7, 2011 12:12 PM Subject: [Tutor] vcf_files and strings Hi, I'm a beginner at Python and would really appreciate some help in how to extract information from a vcf file. The attached file consists of a lot of information on mutations, this one though is just 2 rows and 10 columns (the real one has a lot more rows). I want to extract the mRNA ID only if the mutation is missense. These two rows (mutations) that I have attached happens to be missense but how do I say that I'm not interested in the mutations that's not missense (they might be e.g. synonymous). Also, how do I say that if a mutation starts with a # symbol I don't want to include it (sometimes the chr starts with a hash). vcf file: 2 rows, 10 columns. col 0 col 1col 2 col 3 col 4 col5col6 col7 col8 col9 chromosome position . Reference ALT position . some statistics and the ID:s not importantnot important The important column is 7 where the ID is, i.e. refseq.functionalClass=missense. It's a missense mutation, so then I want to extract refseq.name=NM_003137492, or I want to extract only the ID, which in this case is NM_003137492. Then I want to do exactly the same thing for all the other mutations, but only for the missense mutations not the other ones. How do I accomplish that? Where do I start? Best, Anna ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] vcf_files and strings
[I have already answered your first post but it seems you missed it] On 2011-10-07 18:12, Anna Olofsson wrote: Hi, I'm a beginner at Python and would really appreciate some help in how to extract information from a vcf file. What does "beginner" mean? Do you have experience in other languages? Do you understand how different datatypes work (strings, integers, lists, dictionaries, ...)? Do you know the basic programming concepts (for-loops, if-then-else conditions, ...)? The important column is 7 where the ID is, i.e. refseq.functionalClass=missense. It's a missense mutation, so then I want to extract refseq.name=NM_003137492, or I want to extract only the ID, which in this case is NM_003137492. Then I want to do exactly the same thing for all the other mutations, but only for the missense mutations not the other ones. How do I accomplish that? Where do I start? Can you show us some code snippets of your attempts? Bye, Andreas ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] vcf_files and strings
if col[x] == 'missense': print col[withRefSeqID] hth From: Anna Olofsson To: tutor@python.org Sent: Friday, October 7, 2011 12:12 PM Subject: [Tutor] vcf_files and strings Hi, I'm a beginner at Python and would really appreciate some help in how to extract information from a vcf file. The attached file consists of a lot of information on mutations, this one though is just 2 rows and 10 columns (the real one has a lot more rows). I want to extract the mRNA ID only if the mutation is missense. These two rows (mutations) that I have attached happens to be missense but how do I say that I'm not interested in the mutations that's not missense (they might be e.g. synonymous). Also, how do I say that if a mutation starts with a # symbol I don't want to include it (sometimes the chr starts with a hash). vcf file: 2 rows, 10 columns. col 0 col 1 col 2 col 3 col 4 col5 col6 col7 col8 col9 chromosome position . Reference ALT position . some statistics and the ID:s not important not important The important column is 7 where the ID is, i.e. refseq.functionalClass=missense. It's a missense mutation, so then I want to extract refseq.name=NM_003137492, or I want to extract only the ID, which in this case is NM_003137492. Then I want to do exactly the same thing for all the other mutations, but only for the missense mutations not the other ones. How do I accomplish that? Where do I start? Best, Anna ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] vcf_files and strings
Hi, I'm a beginner at Python and would really appreciate some help in how to extract information from a vcf file. The attached file consists of a lot of information on mutations, this one though is just 2 rows and 10 columns (the real one has a lot more rows). I want to extract the mRNA ID only if the mutation is missense. These two rows (mutations) that I have attached happens to be missense but how do I say that I'm not interested in the mutations that's not missense (they might be e.g. synonymous). Also, how do I say that if a mutation starts with a # symbol I don't want to include it (sometimes the chr starts with a hash). vcf file: 2 rows, 10 columns. col 0 col 1col 2 col 3 col 4 col5col6 col7 col8 col9 chromosome position . Reference ALT position . some statistics and the ID:s not importantnot important The important column is 7 where the ID is, i.e. refseq.functionalClass=missense. It's a missense mutation, so then I want to extract refseq.name=NM_003137492, or I want to extract only the ID, which in this case is NM_003137492. Then I want to do exactly the same thing for all the other mutations, but only for the missense mutations not the other ones. How do I accomplish that? Where do I start? Best, Anna 4 69345 . C T 32 . 1kg.AD=0;1kg.AF=0.8865;1kg.AN=345;AC=1;AF=0.50;AN=2;BaseQRankSum=-6.432;DP=327;DS;Dels=0.00;FS=6.435;HRun=0;HaplotypeScore=5.0380;MQ=54.34;MQ0=13;MQRankSum=-5.3457;QD=2.65;ReadPosRankSum=-0.321;SB=-321.04;dbsnp.ID=rs43032118;dbsnp.dbSNPBuildID=112;hgid.AF=0.6754;refseq.changesAA=true;refseq.codingCoordStr=c.345C>T;refseq.codonCoord=234;refseq.functionalClass=missense;refseq.inCodingRegion=true;refseq.mrnaCoord=321;refseq.name=NM_003137492;refseq.name2=PQ3D2;refseq.positionType=CFD;refseq.proteinCoordStr=p.C132T;refseq.referenceAA=BRT;refseq.referenceCodon=DGF;refseq.spliceDist=321;refseq.transcriptStrand=+;refseq.variantAA=Ala;refseq.variantCodon=GCA GT:AD:DP:GQ:PL 1/1:134,32:765:99:56,1,4576 4 87342 . A G 57.7 . 1kg.AD=0;1kg.AF=1.0;1kg.AN=345;AC=2;AF=0.00;AN=4;DP=2;Dels=0.00;FS=0.000;HRun=2;HaplotypeScore=0.;MQ=29.00;MQ0=0;QD=54.97;SB=-52.65;dbsnp.ID=rs6732096;dbsnp.dbSNPBuildID=136;hgid.AF=0.9567;refseq.changesAA=true;refseq.codingCoordStr=c.3207A>G;refseq.codonCoord=349;refseq.functionalClass=missense;refseq.inCodingRegion=true;refseq.mrnaCoord=4509;refseq.name=NM_132768;refseq.name2=PQR321;refseq.positionType=CDS;refseq.proteinCoordStr=p.O125R;refseq.referenceAA=Trp;refseq.referenceCodon=ATG;refseq.spliceDist=-45;refseq.transcriptStrand=+;refseq.variantAA=Arg;refseq.variantCodon=CTT GT:AD:DP:GQ:PL 0/1:0,5:5:2.02:67,9,1 ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] vcf_files and strings
On 2011-10-05 21:29, Anna Olofsson wrote: vcf file: 2 rows, 10 columns. The important column is 7 where the ID is, i.e. refseq.functionalClass=missense. It's a missense mutation, so then I want to extract refseq.name=NM_003137492, or I want to extract only the ID, which in this case is NM_003137492. Then I want to do exactly the same thing for all the other mutations, but only for the missense mutations not the other ones. How do I accomplish that? Where do I start? I would split the rows into the columns (analyze your file to find the seperator), then look for "missense" in the 7th column in every row and if found regex for the name/ID. Are you able to code that yourself or do you need more hints? Bye, Andreas ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] vcf_files and strings
Hi, I'm a beginner at Python and would really like some help in how to extract information from a vcf file. The attached file consists of a lot of information on mutations, this one though is just 2 rows and 10 columns (the real one has a lot more rows). I want to extract the mRNA ID only if the mutation is missense. These two rows (mutations) that I have attached happens to be missense but how do I say that I'm not interested in the mutations that's not missense (they might be e.g. synonymous). Also, how do I say that if a mutation starts with a # symbol I don't want to include it (sometimes the chr starts with a hash). vcf file: 2 rows, 10 columns. col 0 col 1col 2 col 3 col 4 col5col6 col7 col8 col9 chromosome position . Reference ALT position . some statistics and the ID:s not importantnot important The important column is 7 where the ID is, i.e. refseq.functionalClass=missense. It's a missense mutation, so then I want to extract refseq.name=NM_003137492, or I want to extract only the ID, which in this case is NM_003137492. Then I want to do exactly the same thing for all the other mutations, but only for the missense mutations not the other ones. How do I accomplish that? Where do I start? Best, Anna 4 69345 . C T 32 . 1kg.AD=0;1kg.AF=0.8865;1kg.AN=345;AC=1;AF=0.50;AN=2;BaseQRankSum=-6.432;DP=327;DS;Dels=0.00;FS=6.435;HRun=0;HaplotypeScore=5.0380;MQ=54.34;MQ0=13;MQRankSum=-5.3457;QD=2.65;ReadPosRankSum=-0.321;SB=-321.04;dbsnp.ID=rs43032118;dbsnp.dbSNPBuildID=112;hgid.AF=0.6754;refseq.changesAA=true;refseq.codingCoordStr=c.345C>T;refseq.codonCoord=234;refseq.functionalClass=missense;refseq.inCodingRegion=true;refseq.mrnaCoord=321;refseq.name=NM_003137492;refseq.name2=PQ3D2;refseq.positionType=CFD;refseq.proteinCoordStr=p.C132T;refseq.referenceAA=BRT;refseq.referenceCodon=DGF;refseq.spliceDist=321;refseq.transcriptStrand=+;refseq.variantAA=Ala;refseq.variantCodon=GCA GT:AD:DP:GQ:PL 1/1:134,32:765:99:56,1,4576 4 87342 . A G 57.7 . 1kg.AD=0;1kg.AF=1.0;1kg.AN=345;AC=2;AF=0.00;AN=4;DP=2;Dels=0.00;FS=0.000;HRun=2;HaplotypeScore=0.;MQ=29.00;MQ0=0;QD=54.97;SB=-52.65;dbsnp.ID=rs6732096;dbsnp.dbSNPBuildID=136;hgid.AF=0.9567;refseq.changesAA=true;refseq.codingCoordStr=c.3207A>G;refseq.codonCoord=349;refseq.functionalClass=missense;refseq.inCodingRegion=true;refseq.mrnaCoord=4509;refseq.name=NM_132768;refseq.name2=PQR321;refseq.positionType=CDS;refseq.proteinCoordStr=p.O125R;refseq.referenceAA=Trp;refseq.referenceCodon=ATG;refseq.spliceDist=-45;refseq.transcriptStrand=+;refseq.variantAA=Arg;refseq.variantCodon=CTT GT:AD:DP:GQ:PL 0/1:0,5:5:2.02:67,9,1 ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor