On Fri, 8 Jun 2007, Bogdan wrote:
Hi,
I seem to miss the pointer to the manual/description of the
flank[-coding] regions interpretation (definitions) and conventions. I
also couldn't find anything (relevant enough) on the web or in the
mailing-list archive. If the questions I'm asking are documented
somewhere, please let me know.
the sample query I used was:
<Query virtualSchemaName = "default" header = "0" count = ""
softwareVersion = "0.5" >
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
<Attribute name = "gene_stable_id" />
<Attribute name = "coding_gene_flank" />
<Attribute name = "5utr_start" />
<Attribute name = "5utr_end" />
<Attribute name = "transcript_chrom_start" />
<Attribute name = "transcript_chrom_end" />
<Attribute name = "transcript_chrom_strand" />
<Filter name = "upstream_flank" value = "1000"/>
<Filter name = "transcript_status" value = "KNOWN"/>
<Filter name = "ensembl_gene_id" value =
"ENSRNOG00000006899,ENSRNOG00000000164"/>
</Dataset></Query>
(with variation "coding_gene_flank" instead of "gene_flank").
1. definitions first.
I would expect that
flank-coding region (gene) = flank (gene) + 5' UTR
So while getting upstream sequence as "flank (gene)" starts from the
TSS of the "leftmost transcript", "flank-coding region (gene)" should
start at the translation initiation site of the "leftmost transcript".
Am I right here?
Issuing two sample queries to biomart webservice, asking for 1 kbase
upstream of "flank (gene)" and "flank-coding region (gene)", I
expected that the resulting sequences would partially overlap (namely,
in the portion right upstream from TSS; "overlap length" = 1000 - "5'
UTR length"). This seems to be the case, when there is only one 5'UTR
region (as indicated by single 5UTR-start and 5UTR-end values, e.g. in
ENSRNOG00000000164).
However, if more than one 5'UTR is defined for the gene, then
"flank-coding" and "flank" do overlap only at higher values of the
'upstream' filter (like 5 kbases or more in e.g. ENSRNOG00000006899:
ENSRNOG00000006899|7748641;7744305|7748650;7744534|7744305|7759380|1
So it appears that in the case of multiple 5' UTRs (and "upstream"
checkbox set), the "flank-coding region (gene)" returns the sequence
starting from the "rightmost" 5' UTR of the "leftmost" transcript. Am
I right in this statement?
yes
2. conventions.
based on some previous discussions (
http://listserver.ebi.ac.uk/mailing-lists-archives/ensembl-dev/msg01227.html
)
and one of the results I got:
ENSRNOG00000007949|ENSRNOT00000010984|13052235;13052765|13052250;13052846|13037171|13052846|-1
it's still confusing to interpret.
Here, 5' UTRs appear to start at positions 13052235;13052765, and end
at 13052250;13052846. Transcript starts at 13037171, ends at 13052846.
Clearly, 5'UTRs' position is reversed for the negative strand (and
thus appears at the "end" of the gene).
Is the earlier discussed "convention" still valid, and I have to
reverse-complement the upstream sequences I get from the negative
strand genes?
Just a heads up that BioMart and ensembl sequence conventions may not
correspond as we calculate our own stuff. You are right, the 5utr need
reverse-complementing but this is done automatically by the mart software
3. the problem itself and best method
What I'm attempting to fetch is a fairly small "gene promoter" (less
than 1 kbase).
There are several different options available:
1. do a query like
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
<Attribute name = "gene_stable_id" />
<Attribute name = "gene_flank" />
<Filter name = "downstream_flank" value = "200"/>
<Filter name = "upstream_flank" value = "1000"/>
<Filter name = "transcript_status" value = "KNOWN"/>
<Filter name = "ensembl_gene_id" value = "ENSRNOG00000006899"/>
</Dataset>
but it only returns the 1 kbase of upstream sequence, and doesn't go
beyond the TSS, as I would expect.
ok - we are improving the user warning and images for the forthcoming
release :-) Downstream flank refers to the "downstream of the gene". As it
doesn't really make sense to join the upstream and downstream flanks when
just selecting flanks we disabling using them both together - it just
returns the upstream flank as you experienced. Apologies for the confusion
2. do a query like
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
<Attribute name = "gene_stable_id" />
<Attribute name = "coding_gene_flank" />
<Filter name = "upstream_flank" value = "1000"/>
<Filter name = "transcript_status" value = "KNOWN"/>
<Filter name = "ensembl_gene_id" value = "ENSRNOG00000006899"/>
</Dataset>
but as shown earlier in this email, this way I may get too much
kilobases of sequences, which is not what I want.
this should give you your 1000bp upstream of the TSS - is it not doing
this? or are you looking for something different? Let me know and will try
and help
Best wishes
Damian
3. issue two queries for each gene, like:
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
<Attribute name = "gene_stable_id" />
<Attribute name = "gene_flank" />
<Filter name = "upstream_flank" value = "1000"/>
<Filter name = "transcript_status" value = "KNOWN"/>
<Filter name = "ensembl_gene_id" value = "ENSRNOG00000006899"/>
</Dataset>
and
<Dataset name = "rnorvegicus_gene_ensembl" interface = "default" >
<Attribute name = "gene_stable_id" />
<Attribute name = "5utr" />
<Filter name = "transcript_status" value = "KNOWN"/>
<Filter name = "ensembl_gene_id" value = "ENSRNOG00000006899"/>
</Dataset>
So far this third approach looks promising, but I didn't yet try it.
Is this last method the right way to do what I need? Or there's a
different (better) way?
Thanks beforehand for your replies,