Hi Keith,

Please see below for my response:

I'm using the mart API to download summaries of human exon data. In
tsv format everything is clear; one row per exon. I also want the exon
sequences, doing the same query, with an added gene_exon attribute,
selecting FASTA instead of TSV (aided by the martview webite to
generate the script).

Intuitively, I would expect the same number of results in each (having
selected remove duplicate rows in both cases). However, I get

TSV rows: 532103
Fasta sequences: 141274


Firstly, I think your FASTA file has been truncated as you do not get the total number of possible exon sequences (297956 for release 52). I would suggest that you download again but select a compressed.gz file and see if you get the correct count.

The reason for the difference in numbers is that the TSV file contains the same exons over and over if they are in multiple transcripts. For each row of the table, there are such a large number of combinations of attributes that
each row will be unique even if the exon ID is in multiple rows.

In the sequence search, you are requesting just the unique exon sequence, so you will only get the sequence (and corresponding header) once for each exon ID. I hope that makes sense, but if I am not being clear please get back to me.

Secondly, I believed that I could use the Fasta data alone because the
header contains much of the exon metadata. I'm not so sure after
looking more closely. The header seems to be ambiguous when an exon is
shared between transcripts.

My dataset "hsapiens_gene_ensembl";

My query attributes (the same for both TSV and Fasta queries)

qw(chromosome_name
  ensembl_gene_id ensembl_transcript_id
  start_position end_position
  transcript_start transcript_end strand transcript_count
  ensembl_exon_id exon_chrom_start exon_chrom_end
  rank phase)

In addition I use gene_exon to obtain an exon sequence in the Fasta query.

An example Fasta record:

2|ENSG00000163328|ENST00000295500;ENST00000392552;ENST00000392551| 175004621|175060068|175004621;175007126|175060057;175060068|-1|3| ENSE00001073363|175038759|175038876|8;7
TCTATTGTCTGTGCTGGAATGATGATATGGAATTTTGTTAAAGAAAAAAATTTTGTTGGA

This exon is shared between transcripts
ENST00000295500;ENST00000392552;ENST00000392551

However, the transcript starts/ends are reported twice
175004621;175007126|175060057;175060068

Presumably this is because two of them share a start/end? But which
coordinates belong to which?

These results show that there are two different starts and ends for these three exons (and as you mention, two transcripts have the same start and end) Start1; Start2|End1; End2 (with Start1 and End1 being start/end of one or more transcripts and Start2 and End2 two being start/end of another transcript(s)) I agree that this is confusing and this issue with the ordering of selected attributes and results has been mentioned by users. We will try to address these issues over the coming months. For the moment, I would suggest that you take the start/end results from the TSV file,
where they are more meaningful.



Likewise with other fields e.g. the exon ranks are reported
8;7

This is because this exon can be ranked as exon 7 in some of the transcripts for this gene and ranked as exon 8 in other transcripts for this gene.


In fact the exon is in the 8th position in ENST00000392551 which is
listed last in the transcript stable ID triple, so these values aren't
even in the same order with respect to each other.
The number 8 in the first position means that it is the 8th exon in this transcript. The ;7 means it is ranked 7th in other transcripts for the same gene (i.e. ENST00000392552)
I hope that helps,
Regards,
Rhoda




Is this a bug or am I misusing the API? I've looked in the manual and
the mailing list. I read some threads relating to the Fasta header,
but didn't spot anything on this issue specifically.

thanks,

Keith

--

- Keith James - Wellcome Trust Sanger Institute, UK -


--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Rhoda Kinsella Ph.D.
Ensembl Bioinformatician,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.

Reply via email to