Anna Olofsson wrote:
Hi,
I'm a beginner at Python and would really like some help in how to
extract information from a vcf file.
The attached file consists of a lot of information on mutations, this
one though is just 2 rows and 10 columns (the real one has a lot more
rows).
What do you mean by a VCF file? On my computer, a VCF file is an
electronic business card, which tries to open in an Address Book
application (which obviously fails).
I don't know how to interpret the contents of your VCF file. After
opening it in a hex editor, I can *guess* that it is a tab-separated
file: each row takes one line, with the columns separated by tab
characters. Column 7 appears to be a great big ugly blob with sub-fields
separated by semi-colons. Am I right? Can you link us to a description
of the vcf file format?
I want to extract the mRNA ID only if the mutation is missense. These
two rows (mutations) that I have attached happens to be missense but
how do I say that I'm not interested in the mutations that's not
missense (they might be e.g. synonymous). Also, how do I say that if
a mutation starts with a # symbol I don't want to include it
(sometimes the chr starts with a hash).
What chr? Where is the mutation? I'm afraid your questions are assuming
familiarity with your data that we don't have.
vcf file: 2 rows, 10 columns.
col 0 col 1 col 2
col 3 col 4 col5 col6
col7 col8
col9 chromosome position .
Reference ALT position . some statistics
and the ID:s not important not important
The important column is 7 where the ID is, i.e.
refseq.functionalClass=missense. It's a missense mutation, so then I
want to extract refseq.name=NM_003137492, or I want to extract only
the ID, which in this case is NM_003137492.
This is what I *think* you want to do. Am I right?
* read each line of the file
* for each line, split on tabs
* extract the 7th column and split it on semi-colons
* inspect the refseq.functionalClass field
* if it matches, extract the ID from the refseq.name and store it in a
list for later
(I have completely ignored the part about the #, because I don't
understand what you mean by it.)
Here's some code to do it:
ids = []
f = open('vcf_file.vcf', 'r')
for row in f:
columns = row.split('\t') # Split on tabs
data = columns[7] # Huge ugly blob of data
values = data.split(';') # Split on semi-colons
if values[25] == "refseq.functionalClass=missense":
name_chunk = values[28] # looks like "refseq.name=..."
a, b = name_chunk.split("=")
if a != "refseq.name":
raise ValueError('expected refseq.name but got %s' % a)
ids.append(b)
f.close()
print(ids)
Does this help?
--
Steven
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor