Re: [Tutor] vcf_files and strings

2011-10-11 Thread Hs Hs


VCF - Variant Call Format

VCF files are nothing special but tab delim files describing the genetic 
mutations, friquencies and other base information (bases here mean ATGC 
pertaining to DNA). 

These files are generated by variety of genome sequence data analysis 
pipelines.  MIT and Haplotype Mapping Project consortium developed this format. 


Nothing special except that a genetics or any biologist will understand this - 
nothing special!

hth

cheers




From: Steven D'Aprano st...@pearwood.info
To: tutor@python.org
Sent: Sunday, October 9, 2011 10:04 PM
Subject: Re: [Tutor] vcf_files and strings

Anna Olofsson wrote:
 Hi,
 
 I'm a beginner at Python and would really like some help in how to
 extract information from a vcf file.
 
 The attached file consists of a lot of information on mutations, this
 one though is just 2 rows and 10 columns (the real one has a lot more
 rows).

What do you mean by a VCF file? On my computer, a VCF file is an electronic 
business card, which tries to open in an Address Book application (which 
obviously fails).

I don't know how to interpret the contents of your VCF file. After opening it 
in a hex editor, I can *guess* that it is a tab-separated file: each row takes 
one line, with the columns separated by tab characters. Column 7 appears to be 
a great big ugly blob with sub-fields separated by semi-colons. Am I right? Can 
you link us to a description of the vcf file format?



 I want to extract the mRNA ID only if the mutation is missense. These
 two rows (mutations) that I have attached happens to be missense but
 how do I say that I'm not interested in the mutations that's not
 missense (they might be e.g. synonymous).   Also, how do I say that if
 a mutation starts with a # symbol I don't want to include it
 (sometimes the chr starts with a hash).

What chr? Where is the mutation? I'm afraid your questions are assuming 
familiarity with your data that we don't have.


 vcf file: 2 rows, 10 columns.
 col 0                         col 1            col 2
 col 3              col 4      col5            col6
 col7                                     col8
 col9 chromosome          position           .
 Reference       ALT      position          .          some statistics
 and the ID:s         not important        not important
 
 The important column is 7 where the ID is, i.e.
 refseq.functionalClass=missense. It's a missense mutation, so then I
 want to extract refseq.name=NM_003137492, or I want to extract only
 the ID, which in this case is NM_003137492.


This is what I *think* you want to do. Am I right?

* read each line of the file
* for each line, split on tabs
* extract the 7th column and split it on semi-colons
* inspect the refseq.functionalClass field
* if it matches, extract the ID from the refseq.name and store it in a list for 
later

(I have completely ignored the part about the #, because I don't understand 
what you mean by it.)


Here's some code to do it:

ids = []
f = open('vcf_file.vcf', 'r')

for row in f:
    columns = row.split('\t')  # Split on tabs
    data = columns[7]  # Huge ugly blob of data
    values = data.split(';')  # Split on semi-colons
    if values[25] == refseq.functionalClass=missense:
        name_chunk = values[28]  # looks like refseq.name=...
        a, b = name_chunk.split(=)
        if a != refseq.name:
            raise ValueError('expected refseq.name but got %s' % a)
        ids.append(b)

f.close()
print(ids)


Does this help?



-- Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] vcf_files and strings

2011-10-11 Thread Alan Gauld

On 11/10/11 18:16, Hs Hs wrote:


VCF - Variant Call Format
...

Nothing special except that a genetics or any biologist will understand
this - nothing special!


The problem is that this list, being for beginners to Python, is a bit 
short on Geneticists and Biologists! :-)


So you need to explain your problem in general terms, that the rest of 
us can make sense of, or else rely on the few who might understand being 
available/willing to respond.



--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] vcf_files and strings

2011-10-11 Thread Hs Hs



http://www.1000genomes.org/node/101







From: Alan Gauld alan.ga...@btinternet.com
To: tutor@python.org
Sent: Tuesday, October 11, 2011 1:52 PM
Subject: Re: [Tutor] vcf_files and strings

On 11/10/11 18:16, Hs Hs wrote:
 
 VCF - Variant Call Format
 ...
 
 Nothing special except that a genetics or any biologist will understand
 this - nothing special!

The problem is that this list, being for beginners to Python, is a bit short on 
Geneticists and Biologists! :-)

So you need to explain your problem in general terms, that the rest of us can 
make sense of, or else rely on the few who might understand being 
available/willing to respond.


-- Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] vcf_files and strings

2011-10-09 Thread Anna Olofsson

Hi,

I still don't know how to make a loop that makes it work for all the mutations. 

Best,
Anna
Date: Fri, 7 Oct 2011 13:17:07 -0700
From: ilhs...@yahoo.com
Subject: Re: [Tutor] vcf_files and strings
To: olofsson_anna...@hotmail.com; tutor@python.org


if col[x] == 'missense':print col[withRefSeqID]

hth


From: Anna Olofsson olofsson_anna...@hotmail.com
To: tutor@python.org
Sent: Friday, October 7, 2011 12:12 PM
Subject: [Tutor] vcf_files and strings






Hi,

I'm a beginner at Python and would really appreciate some help in how to 
extract information from a vcf file. 

The attached file consists of a lot of information on mutations, this one 
though is just 2 rows and 10 columns (the real one has a lot more rows). 

I want to extract the mRNA ID only if the mutation is missense. These two rows 
(mutations) that I have attached happens to be missense but how do I say that 
I'm not interested in the mutations that's not missense (they might be e.g. 
synonymous). Also, how do I say that if a mutation starts with a # symbol I 
don't want to include it (sometimes the chr starts with a hash).

vcf file: 2 rows, 10 columns. 
   
col 0 col 1col 2
  col 3  col 4  col5col6
   col7 col8 col9
chromosome  position   .  Reference  
 ALT  position  .  some statistics and the ID:s not 
importantnot important

The important column is 7 where the ID is, i.e. 
refseq.functionalClass=missense. It's a missense mutation, so then I want to 
extract refseq.name=NM_003137492, or I want to extract only the ID, which in 
this case is NM_003137492. 

Then I want to do exactly the same thing for all the other mutations, but only 
for the missense mutations not the other ones. How do I accomplish that? Where 
do I start? 

Best,
Anna


  

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


  ___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] vcf_files and strings

2011-10-09 Thread Steven D'Aprano

Anna Olofsson wrote:

Hi,

I'm a beginner at Python and would really like some help in how to
extract information from a vcf file.

The attached file consists of a lot of information on mutations, this
one though is just 2 rows and 10 columns (the real one has a lot more
rows).


What do you mean by a VCF file? On my computer, a VCF file is an 
electronic business card, which tries to open in an Address Book 
application (which obviously fails).


I don't know how to interpret the contents of your VCF file. After 
opening it in a hex editor, I can *guess* that it is a tab-separated 
file: each row takes one line, with the columns separated by tab 
characters. Column 7 appears to be a great big ugly blob with sub-fields 
separated by semi-colons. Am I right? Can you link us to a description 
of the vcf file format?





I want to extract the mRNA ID only if the mutation is missense. These
two rows (mutations) that I have attached happens to be missense but
how do I say that I'm not interested in the mutations that's not
missense (they might be e.g. synonymous).   Also, how do I say that if
a mutation starts with a # symbol I don't want to include it
(sometimes the chr starts with a hash).


What chr? Where is the mutation? I'm afraid your questions are assuming 
familiarity with your data that we don't have.




vcf file: 2 rows, 10 columns.
col 0 col 1col 2
col 3  col 4  col5col6
col7 col8
col9 chromosome  position   .
Reference   ALT  position  .  some statistics
and the ID:s not importantnot important

The important column is 7 where the ID is, i.e.
refseq.functionalClass=missense. It's a missense mutation, so then I
want to extract refseq.name=NM_003137492, or I want to extract only
the ID, which in this case is NM_003137492.



This is what I *think* you want to do. Am I right?

* read each line of the file
* for each line, split on tabs
* extract the 7th column and split it on semi-colons
* inspect the refseq.functionalClass field
* if it matches, extract the ID from the refseq.name and store it in a 
list for later


(I have completely ignored the part about the #, because I don't 
understand what you mean by it.)



Here's some code to do it:

ids = []
f = open('vcf_file.vcf', 'r')

for row in f:
columns = row.split('\t')  # Split on tabs
data = columns[7]  # Huge ugly blob of data
values = data.split(';')  # Split on semi-colons
if values[25] == refseq.functionalClass=missense:
name_chunk = values[28]  # looks like refseq.name=...
a, b = name_chunk.split(=)
if a != refseq.name:
raise ValueError('expected refseq.name but got %s' % a)
ids.append(b)

f.close()
print(ids)


Does this help?



--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] vcf_files and strings

2011-10-07 Thread Anna Olofsson

Hi,

I'm a beginner at Python and would really appreciate some help in how to 
extract information from a vcf file. 

The attached file consists of a lot of information on mutations, this one 
though is just 2 rows and 10 columns (the real one has a lot more rows). 

I want to extract the mRNA ID only if the mutation is missense. These two rows 
(mutations) that I have attached happens to be missense but how do I say that 
I'm not interested in the mutations that's not missense (they might be e.g. 
synonymous). Also, how do I say that if a mutation starts with a # symbol I 
don't want to include it (sometimes the chr starts with a hash).

vcf file: 2 rows, 10 columns. 
   
col 0 col 1col 2  col 3 
 col 4  col5col6   col7 
col8 col9
chromosome  position   .  Reference   ALT   
   position  .  some statistics and the ID:s not 
importantnot important

The important column is 7 where the ID is, i.e. 
refseq.functionalClass=missense. It's a missense mutation, so then I want to 
extract refseq.name=NM_003137492, or I want to extract only the ID, which in 
this case is NM_003137492. 

Then I want to do exactly the same thing for all the other mutations, but only 
for the missense mutations not the other ones. How do I accomplish that? Where 
do I start? 

Best,
Anna


  4	69345	.	C	T	32	.	1kg.AD=0;1kg.AF=0.8865;1kg.AN=345;AC=1;AF=0.50;AN=2;BaseQRankSum=-6.432;DP=327;DS;Dels=0.00;FS=6.435;HRun=0;HaplotypeScore=5.0380;MQ=54.34;MQ0=13;MQRankSum=-5.3457;QD=2.65;ReadPosRankSum=-0.321;SB=-321.04;dbsnp.ID=rs43032118;dbsnp.dbSNPBuildID=112;hgid.AF=0.6754;refseq.changesAA=true;refseq.codingCoordStr=c.345CT;refseq.codonCoord=234;refseq.functionalClass=missense;refseq.inCodingRegion=true;refseq.mrnaCoord=321;refseq.name=NM_003137492;refseq.name2=PQ3D2;refseq.positionType=CFD;refseq.proteinCoordStr=p.C132T;refseq.referenceAA=BRT;refseq.referenceCodon=DGF;refseq.spliceDist=321;refseq.transcriptStrand=+;refseq.variantAA=Ala;refseq.variantCodon=GCA	GT:AD:DP:GQ:PL	1/1:134,32:765:99:56,1,4576
4	87342	.	A	G	57.7	.	1kg.AD=0;1kg.AF=1.0;1kg.AN=345;AC=2;AF=0.00;AN=4;DP=2;Dels=0.00;FS=0.000;HRun=2;HaplotypeScore=0.;MQ=29.00;MQ0=0;QD=54.97;SB=-52.65;dbsnp.ID=rs6732096;dbsnp.dbSNPBuildID=136;hgid.AF=0.9567;refseq.changesAA=true;refseq.codingCoordStr=c.3207AG;refseq.codonCoord=349;refseq.functionalClass=missense;refseq.inCodingRegion=true;refseq.mrnaCoord=4509;refseq.name=NM_132768;refseq.name2=PQR321;refseq.positionType=CDS;refseq.proteinCoordStr=p.O125R;refseq.referenceAA=Trp;refseq.referenceCodon=ATG;refseq.spliceDist=-45;refseq.transcriptStrand=+;refseq.variantAA=Arg;refseq.variantCodon=CTT	GT:AD:DP:GQ:PL	0/1:0,5:5:2.02:67,9,1
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] vcf_files and strings

2011-10-07 Thread Hs Hs


if col[x] == 'missense':
    print col[withRefSeqID]


hth






From: Anna Olofsson olofsson_anna...@hotmail.com
To: tutor@python.org
Sent: Friday, October 7, 2011 12:12 PM
Subject: [Tutor] vcf_files and strings


 
Hi,


I'm a beginner at Python and would really appreciate some help in how to 
extract information from a vcf file. 

The attached file consists of a lot of information on mutations, this one 
though is just 2 rows and 10 columns (the real one has a lot more rows). 

I want to extract the mRNA ID only if the mutation is missense. These two rows 
(mutations) that I have attached happens to be missense but how do I say that 
I'm not interested in the mutations that's not missense (they might be e.g. 
synonymous). Also, how do I say that if a mutation starts with a # symbol I 
don't want to include it (sometimes the chr starts with a hash).

vcf file: 2 rows, 10 columns. 
   
col 0     col 1    col 2                  col 3 
 col 4  col5    col6   
col7 col8 col9
chromosome  position       .                  Reference   
ALT  position          .      some statistics and the ID:s not 
important    not important

The important column is 7 where the ID is, i.e. 
refseq.functionalClass=missense. It's a missense mutation, so then I want to 
extract refseq.name=NM_003137492, or I want to extract only the ID, which in 
this case is NM_003137492. 

Then I want to do exactly the same thing for all the other mutations, but only 
for the missense mutations not the other ones. How do I accomplish that? Where 
do I start? 

Best,
Anna


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] vcf_files and strings

2011-10-05 Thread Anna Olofsson

Hi,

I'm a beginner at Python and would really like some help in how to extract 
information from a vcf file. 

The attached file consists of a lot of information on mutations, this one 
though is just 2 rows and 10 columns (the real one has a lot more rows). 

I want to extract the mRNA ID only if the mutation is missense. These two rows 
(mutations) that I have attached happens to be missense but how do I say that 
I'm not interested in the mutations that's not missense (they might be e.g. 
synonymous). Also, how do I say that if a mutation starts with a # symbol I 
don't want to include it (sometimes the chr starts with a hash).

vcf file: 2 rows, 10 columns. 
   
col 0 col 1col 2  col 3 
 col 4  col5col6   col7 
col8 col9
chromosome  position   .  Reference   ALT   
   position  .  some statistics and the ID:s not 
importantnot important

The important column is 7 where the ID is, i.e. 
refseq.functionalClass=missense. It's a missense mutation, so then I want to 
extract refseq.name=NM_003137492, or I want to extract only the ID, which in 
this case is NM_003137492. 

Then I want to do exactly the same thing for all the other mutations, but only 
for the missense mutations not the other ones. How do I accomplish that? Where 
do I start? 

Best,
Anna

  4	69345	.	C	T	32	.	1kg.AD=0;1kg.AF=0.8865;1kg.AN=345;AC=1;AF=0.50;AN=2;BaseQRankSum=-6.432;DP=327;DS;Dels=0.00;FS=6.435;HRun=0;HaplotypeScore=5.0380;MQ=54.34;MQ0=13;MQRankSum=-5.3457;QD=2.65;ReadPosRankSum=-0.321;SB=-321.04;dbsnp.ID=rs43032118;dbsnp.dbSNPBuildID=112;hgid.AF=0.6754;refseq.changesAA=true;refseq.codingCoordStr=c.345CT;refseq.codonCoord=234;refseq.functionalClass=missense;refseq.inCodingRegion=true;refseq.mrnaCoord=321;refseq.name=NM_003137492;refseq.name2=PQ3D2;refseq.positionType=CFD;refseq.proteinCoordStr=p.C132T;refseq.referenceAA=BRT;refseq.referenceCodon=DGF;refseq.spliceDist=321;refseq.transcriptStrand=+;refseq.variantAA=Ala;refseq.variantCodon=GCA	GT:AD:DP:GQ:PL	1/1:134,32:765:99:56,1,4576
4	87342	.	A	G	57.7	.	1kg.AD=0;1kg.AF=1.0;1kg.AN=345;AC=2;AF=0.00;AN=4;DP=2;Dels=0.00;FS=0.000;HRun=2;HaplotypeScore=0.;MQ=29.00;MQ0=0;QD=54.97;SB=-52.65;dbsnp.ID=rs6732096;dbsnp.dbSNPBuildID=136;hgid.AF=0.9567;refseq.changesAA=true;refseq.codingCoordStr=c.3207AG;refseq.codonCoord=349;refseq.functionalClass=missense;refseq.inCodingRegion=true;refseq.mrnaCoord=4509;refseq.name=NM_132768;refseq.name2=PQR321;refseq.positionType=CDS;refseq.proteinCoordStr=p.O125R;refseq.referenceAA=Trp;refseq.referenceCodon=ATG;refseq.spliceDist=-45;refseq.transcriptStrand=+;refseq.variantAA=Arg;refseq.variantCodon=CTT	GT:AD:DP:GQ:PL	0/1:0,5:5:2.02:67,9,1
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] vcf_files and strings

2011-10-05 Thread Andreas Perstinger

On 2011-10-05 21:29, Anna Olofsson wrote:

vcf file: 2 rows, 10 columns.

The important column is 7 where the ID is, i.e.
refseq.functionalClass=missense. It's a missense mutation, so then I
want to extract refseq.name=NM_003137492, or I want to extract only
the ID, which in this case is NM_003137492.

Then I want to do exactly the same thing for all the other mutations,
but only for the missense mutations not the other ones. How do I
accomplish that? Where do I start?


I would split the rows into the columns (analyze your file to find the 
seperator), then look for missense in the 7th column in every row and 
if found regex for the name/ID.


Are you able to code that yourself or do you need more hints?

Bye, Andreas

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor