Re: [Tutor] vcf_files and strings

Hs Hs Tue, 11 Oct 2011 10:18:45 -0700


VCF - Variant Call Format

VCF files are nothing special but tab delim files describing the genetic 
mutations, friquencies and other base information (bases here mean ATGC 
pertaining to DNA). 

These files are generated by variety of genome sequence data analysis 
pipelines.  MIT and Haplotype Mapping Project consortium developed this format. 

Nothing special except that a genetics or any biologist will understand this - 
nothing special!

hth

cheers

________________________________
From: Steven D'Aprano <st...@pearwood.info>
To: tutor@python.org
Sent: Sunday, October 9, 2011 10:04 PM
Subject: Re: [Tutor] vcf_files and strings

Anna Olofsson wrote:
> Hi,
> 
> I'm a beginner at Python and would really like some help in how to
> extract information from a vcf file.
> 
> The attached file consists of a lot of information on mutations, this
> one though is just 2 rows and 10 columns (the real one has a lot more
> rows).

What do you mean by a VCF file? On my computer, a VCF file is an electronic 
business card, which tries to open in an Address Book application (which 
obviously fails).

I don't know how to interpret the contents of your VCF file. After opening it 
in a hex editor, I can *guess* that it is a tab-separated file: each row takes 
one line, with the columns separated by tab characters. Column 7 appears to be 
a great big ugly blob with sub-fields separated by semi-colons. Am I right? Can 
you link us to a description of the vcf file format?

> I want to extract the mRNA ID only if the mutation is missense. These
> two rows (mutations) that I have attached happens to be missense but
> how do I say that I'm not interested in the mutations that's not
> missense (they might be e.g. synonymous).   Also, how do I say that if
> a mutation starts with a # symbol I don't want to include it
> (sometimes the chr starts with a hash).

What chr? Where is the mutation? I'm afraid your questions are assuming 
familiarity with your data that we don't have.

> vcf file: 2 rows, 10 columns.
> col 0                         col 1            col 2
> col 3              col 4      col5            col6
> col7                                     col8
> col9 chromosome          position           .
> Reference       ALT      position          .          some statistics
> and the ID:s         not important        not important
> 
> The important column is 7 where the ID is, i.e.
> refseq.functionalClass=missense. It's a missense mutation, so then I
> want to extract refseq.name=NM_003137492, or I want to extract only
> the ID, which in this case is NM_003137492.

This is what I *think* you want to do. Am I right?

* read each line of the file
* for each line, split on tabs
* extract the 7th column and split it on semi-colons
* inspect the refseq.functionalClass field
* if it matches, extract the ID from the refseq.name and store it in a list for 
later

(I have completely ignored the part about the #, because I don't understand 
what you mean by it.)

Here's some code to do it:

ids = []
f = open('vcf_file.vcf', 'r')

for row in f:
    columns = row.split('\t')  # Split on tabs
    data = columns[7]  # Huge ugly blob of data
    values = data.split(';')  # Split on semi-colons
    if values[25] == "refseq.functionalClass=missense":
        name_chunk = values[28]  # looks like "refseq.name=..."
        a, b = name_chunk.split("=")
        if a != "refseq.name":
            raise ValueError('expected refseq.name but got %s' % a)
        ids.append(b)

f.close()
print(ids)

Does this help?

-- Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] vcf_files and strings

Reply via email to