Re: [Tutor] vcf_files and strings

Steven D'Aprano Sun, 09 Oct 2011 19:06:56 -0700

Anna Olofsson wrote:

Hi,


I'm a beginner at Python and would really like some help in how to
extract information from a vcf file.

The attached file consists of a lot of information on mutations, this
one though is just 2 rows and 10 columns (the real one has a lot more
rows).

What do you mean by a VCF file? On my computer, a VCF file is anelectronic business card, which tries to open in an Address Bookapplication (which obviously fails).

I don't know how to interpret the contents of your VCF file. Afteropening it in a hex editor, I can *guess* that it is a tab-separatedfile: each row takes one line, with the columns separated by tabcharacters. Column 7 appears to be a great big ugly blob with sub-fieldsseparated by semi-colons. Am I right? Can you link us to a descriptionof the vcf file format?

I want to extract the mRNA ID only if the mutation is missense. These
two rows (mutations) that I have attached happens to be missense but
how do I say that I'm not interested in the mutations that's not
missense (they might be e.g. synonymous).   Also, how do I say that if
a mutation starts with a # symbol I don't want to include it
(sometimes the chr starts with a hash).

What chr? Where is the mutation? I'm afraid your questions are assumingfamiliarity with your data that we don't have.

vcf file: 2 rows, 10 columns.
col 0                         col 1            col 2
col 3              col 4      col5            col6
col7                                     col8
col9 chromosome          position           .
Reference       ALT      position          .          some statistics
and the ID:s         not important        not important

The important column is 7 where the ID is, i.e.
refseq.functionalClass=missense. It's a missense mutation, so then I
want to extract refseq.name=NM_003137492, or I want to extract only
the ID, which in this case is NM_003137492.



This is what I *think* you want to do. Am I right?

* read each line of the file
* for each line, split on tabs
* extract the 7th column and split it on semi-colons
* inspect the refseq.functionalClass field

* if it matches, extract the ID from the refseq.name and store it in alist for later

(I have completely ignored the part about the #, because I don'tunderstand what you mean by it.)



Here's some code to do it:

ids = []
f = open('vcf_file.vcf', 'r')

for row in f:
    columns = row.split('\t')  # Split on tabs
    data = columns[7]  # Huge ugly blob of data
    values = data.split(';')  # Split on semi-colons
    if values[25] == "refseq.functionalClass=missense":
        name_chunk = values[28]  # looks like "refseq.name=..."
        a, b = name_chunk.split("=")
        if a != "refseq.name":
            raise ValueError('expected refseq.name but got %s' % a)
        ids.append(b)

f.close()
print(ids)


Does this help?



--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] vcf_files and strings

Reply via email to