Hi,

I have a query about the 3'UTR ensembl sequences (hg19).

The appears that the header information does not match with the actual length
of the corresponding sequence.

See the example below:

SeqLength AdvertisedLength name advertisedLocation
488 22428 ENST00000455845 chr2:82210396-82232824


>From the whole 3'UTR file I have 15416 sequences that show this discrepancy.
I have attached a thread of emails below that you can refer to as to how I
extracted the 3'UTR sequences from UCSC table.

Thanks

Manisha





-----Original Message-----
From: Jennifer Jackson [mailto:[email protected]] 
Sent: Tuesday, February 16, 2010 7:21 PM
To: Manisha Brahmachary
Cc: [email protected]
Subject: Re: [Genome] Query about downloading 3'UTR sequence for ENSEMBL

Hello,

Yes, you are extracting the data from the Table browser correctly. It 
appears from examine the data at ENSEMBL, that the transcript data 
source UniProtKB/Swiss-Prot P27144 (KAD4_HUMAN) was very recently updated.

Last modified February 9, 2010

The new version of the transcript has a much longer 3' UTR. When I take 
the revised sequence and run a simple web BLAT, it aligns easily with 
100% identity covering the same 3' UTR region as the currently existing 
transcript plus the extra data.

Comparing to other datasets, RefSeq does not have this new variant. The 
human mRNA track has a single read that represents a portion of the 
extended UTR. Examining EST data, spliced ESTs to do not confirm the 
region but unspliced ESTs do with significant, overlapping tiling (but 
these cannot be stranded without a splice site, so it should be kept in 
mind that maybe there is another gene present on the minus strand, maybe 
even a pseudogene that lacks introns, the tiling being so complete is a 
bit suspicious). Sequence data from other species (other RefSeq, mRna, 
Est) suggest that there is evidence for some type of transcription in 
this region and it is often connected to the positive strand with splice 
sites. None of these are intron-free, as is reported in the ENSEMBL 
transcript. From examination of the Conservation data, the genomic is 
syntenically conserved at the genome level, from Chimp to mouse - and 
most mammals evolutionarily in between.

The UCSC Genes track was revised last on 2009-10-08, therefore the 
extended ENSEMBL transcript was not considered. The extended 3' UTR does 
seem very likely to be a transcribed region of genome - perhaps the 3' 
UTR of this gene - perhaps extended through 2-3 exons. A solid, 
contiguous block of this length is possible, but does not quite fit with 
the other data. But we are looking at sequence evidence only, there may 
be more evidence based on laboratory results that are not apparent from 
this analysis perspective. Perhaps a review of the other evidence at 
ENSEMBL and keeping an eye on other datasets (in particular RefSeq) as 
this data is reviewed by other teams will help to determine/confirm what 
exactly this region represents.

In summary, the new data may be a legitimate extended 3' UTR, perhaps 
multi-exon, or it may be represent confusion with a non-coding, 
unspliced, transcribed, gene/pseudogene on the minus strand. If you were 
able to correlate expression information with the region (using ENCODE 
or other microarray data) that may also provide some clues. I will leave 
that part of the analysis for you to explore.

Hopefully this helps a bit,

Jennifer

---------------------------------
Jennifer Jackson
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu/

On 2/16/10 2:17 PM, Manisha Brahmachary wrote:
> Hello,
>
>
>
> I have a query regarding downloading 3'UTR for ensembl genes for Homo
sapien.
>
>
>
> I am trying to download 3'UTR for all genes of ensembl (hg19) for Human
>
>
>
>> From the UCSC table I do the following:
>
>
>
> Clade: mammal: Genome: human assembly: GRCh37
>
> Group: Genes and gene Predictions tracks track: ensemble genes
>
> Table: ensGene
>
> Region:genome
>
> Output format: sequence
>
>
>
>> From Ensembl Genes genomic Sequence browser
>
> Sequence Retreival Region Options
>
> I choose: 3'UTR exons
>
> One FASTA record per gene
>
>
>
> When I download the sequence and compare one FASTA sequence for gene
> ENST00000327299 with the 3'UTR sequence of the same gene downloaded from
> ensembl, I see the lengths are different. The UCSC sequence appears to a
> subset of the ensemble downloaded 3'UTR sequence. (See below the two
> sequences)
>
>
>
> QUESTION:  1. Am I doing the steps right to download the entire 3'UTR
> sequence from UCSC table or am I just downloading a part of the 3'UTR
region?
>
>
>
>
>
>
>
> See below:
>
>
>
> FROM UCSC:
>
>> hg19_ensGene_ENST00000327299 range=chr1:65691861-65693173 5'pad=0 3'pad=0
> strand=+ repeatMasking=none
>
> CCCTGCCCAATGGAAGAACCAGGAAGATGTGGTCATTCATTCAATAGTGT
>
> GTGTAGTATTGGTGCTGTGTCCAAATTAGAAGCTAGCTGAGGTAGCTTGC
>
> AGCATCTTTTCTAGTTGAAATGGTGAACTGATAGGAAAACAAATGAGTAG
>
> AAAGAGTTCATGAAGAGGCCCTCCTCTGCCTTTCAAAAGGCTGGTCACCT
>
> ACACATGTTTAAGGTGTCTCTGCACATGTCTCAAGCCCATCACAAGAAAG
>
> CAAGTACAGTGTGGATTTCAAATGGTGTGTAACTTCAGCTCCAGCTGGTT
>
> TTTGACAGCTGTTGCTGTGGTAATATTTTTGACATGTGATGGTGATAGTC
>
> TCTGGTTCTCCCCATCCCCACAAAGGCTGTTGAACCACAGCACCAGGAAG
>
> CCTGAGAATGAATCCTGAGGGCTCTAGCCCAGGCTTTGTCCCAGGCTTTC
>
> TGGTGTGTGCCCTCCTGGTAACAGTGAAATTGAAGCTACTTACTCATAGT
>
> GGTTGTTTCTCTGGTCTTGAGTGACTGTGTCCACAGTTCATTTTTTTCCG
>
> GTAGGAATAACTCCTTTTCTACATCCACGCTCCATAGAGTCTCTCCTTTT
>
> CAGACATCCTGGGATGAAAGAATTTGGCTTTTTTTTTTCTTTTTTTTTTT
>
> GGACATCTGTTTTCACTCTTAGGCTTTTAAACAATAGTTATTGCTTTTAT
>
> CCCTCTCAGATTCTAATAACTGAGAGCGATGGGGCTATATTGAATCTCTG
>
> TATGCACTGAGAACTGAGCTATGAAGAGGATCTTATTAAACTGCTGGTCT
>
> GACTTTATGGATTGACACTGTTCCTTTCTTTTATTGTGAAAAAAAAAAAA
>
> AACCCTGAAAGTCTTGGGAACCCCCTAAAGTCTTTTGGGAATCCTCAAAA
>
> AGCATGGGAAGTTAAGTATTTAGCTACATAAATGTTGTAAGATCATATCT
>
> TATGTATAGAAGTAATAAGACCATTTGGAATTACTGGACTAATTGAATAG
>
> TTAAGGTTTCTATTCGGGACAATAAAATGTATTTTGAAAGTGCTGCTAAC
>
> TATTGATGCTGACAGTGTTTCACTCCTATGAGTGACCCAAACATATTATA
>
> AATATGTGGTAAAGGGAATGGAGCCTGTGGGGTTGAGCAGAATGTTGTAC
>
> TAGCTGTGCCTGGACTGAGTATAACAGCTTTATGATTATGAGAAAACAAA
>
> TTCTTTATTTTTTTTTTCTGTTCCAAAGATTCATCCTATGGGGTGGCCAT
>
> AAAGTCTAGAATTAGATACTAATATTTTGTCATTCATTATAACATATCAA
>
> TAAACCATTTGTT
>
>
>
> FROM ENSEMBL
>
>
>
>> ENSG00000162433|ENST00000327299
>
> CCCTGCCCAATGGAAGAACCAGGAAGATGTGGTCATTCATTCAATAGTGTGTGTAGTATT
>
> GGTGCTGTGTCCAAATTAGAAGCTAGCTGAGGTAGCTTGCAGCATCTTTTCTAGTTGAAA
>
> TGGTGAACTGATAGGAAAACAAATGAGTAGAAAGAGTTCATGAAGAGGCCCTCCTCTGCC
>
> TTTCAAAAGGCTGGTCACCTACACATGTTTAAGGTGTCTCTGCACATGTCTCAAGCCCAT
>
> CACAAGAAAGCAAGTACAGTGTGGATTTCAAATGGTGTGTAACTTCAGCTCCAGCTGGTT
>
> TTTGACAGCTGTTGCTGTGGTAATATTTTTGACATGTGATGGTGATAGTCTCTGGTTCTC
>
> CCCATCCCCACAAAGGCTGTTGAACCACAGCACCAGGAAGCCTGAGAATGAATCCTGAGG
>
> GCTCTAGCCCAGGCTTTGTCCCAGGCTTTCTGGTGTGTGCCCTCCTGGTAACAGTGAAAT
>
> TGAAGCTACTTACTCATAGTGGTTGTTTCTCTGGTCTTGAGTGACTGTGTCCACAGTTCA
>
> TTTTTTTCCGGTAGGAATAACTCCTTTTCTACATCCACGCTCCATAGAGTCTCTCCTTTT
>
> CAGACATCCTGGGATGAAAGAATTTGGCTTTTTTTTTTCTTTTTTTTTTTGGACATCTGT
>
> TTTCACTCTTAGGCTTTTAAACAATAGTTATTGCTTTTATCCCTCTCAGATTCTAATAAC
>
> TGAGAGCGATGGGGCTATATTGAATCTCTGTATGCACTGAGAACTGAGCTATGAAGAGGA
>
> TCTTATTAAACTGCTGGTCTGACTTTATGGATTGACACTGTTCCTTTCTTTTATTGTGAA
>
> AAAAAAAAAAAACCCTGAAAGTCTTGGGAACCCCCTAAAGTCTTTTGGGAATCCTCAAAA
>
> AGCATGGGAAGTTAAGTATTTAGCTACATAAATGTTGTAAGATCATATCTTATGTATAGA
>
> AGTAATAAGACCATTTGGAATTACTGGACTAATTGAATAGTTAAGGTTTCTATTCGGGAC
>
> AATAAAATGTATTTTGAAAGTGCTGCTAACTATTGATGCTGACAGTGTTTCACTCCTATG
>
> AGTGACCCAAACATATTATAAATATGTGGTAAAGGGAATGGAGCCTGTGGGGTTGAGCAG
>
> AATGTTGTACTAGCTGTGCCTGGACTGAGTATAACAGCTTTATGATTATGAGAAAACAAA
>
> TTCTTTATTTTTTTTTTCTGTTCCAAAGATTCATCCTATGGGGTGGCCATAAAGTCTAGA
>
> ATTAGATACTAATATTTTGTCATTCATTATAACATATCAATAAACCATTTGTTAAAAGAT
>
> TTGCCTGGTTTCCAGACTTGGTGGCCACCTTGAATAATTCTTGCTGTCTTCTGGGAAGGA
>
> TGATGAAATTTATTCCTGCTGCCTTAAAAATATGTATCCCTTCTTCACCCATCATGACTG
>
> TCCCCAGTGAGTGTCCTTTACTATTCTTGGGAGTGACTCCTGTCTAACTTTTCATACTGG
>
> CGAGAAGAAAAGAAGCCTATTTTAACACTTTAGTGGTGTTGAAACACATTACTTACTTTC
>
> TGAAGATGTCCCAGTGAATCCTCTGTCAATTCACTGCCATATGTAATCTATATGATAAGG
>
> AATGCATCTTCCTTCTAAGTACTGCCCAAACTCTTGCCAGCTCCTCTCCCATTGTCCCTT
>
> CATGTGAATATTTCTTGGCTACCTTAGTGGAAATATAGATCAGTTTTCTCCCCATCCATC
>
> CTCTCAAACATAATGAGATTGTTTACTTTTTAGATTTATGCAGTGAAAATGCCCAGTCAG
>
> GTCTGAATCGTCAGTGCATTATATTGACTCTGAGCACTTTAGAATTTAGAGTTGCAATTG
>
> AATGCCAGCTGTGGAGATGGGGTGCATATCAGATATATAAATAAAGCTCAGGTTTGCTAG
>
> GGAACCAGGTATAGAGAAAAATAAGTCTGATATGAGGAAAATTGCACAATTTAGAGTAGT
>
> TATGCCGTAGAGAAAATTTCCACAAACTAGGAAATGTAGAGAGTTATTCTATAGAATACT
>
> CAAAAGAGGAAAGTATGTGATTTTTGGAAACAGGAAAATCTTCAAACTTCTTTCTTCACT
>
> TCCCTTTGTGTTTAGCTGACCCTCCAATGTGATCATTGCCTTTGGAGTTTGGGAGAGGTA
>
> CGGGAAGTGGCCTGATCCCTGCTTCCATACTTCACTCCTCCATCCATCCTTCCCTCCCTC
>
> TTCCCCTCCAGCTAAATGGACAATTCTAGCCAACATTGAGTCACTCAATAAGTCTCAACA
>
> GTGGGTGTGTTTGCTGAGATTGTCCAGCGGTTGAGCAGTTTGGTCTCACCTCCCTCGCTA
>
> GTTGAGACCAAAAAGAGACAAATAACTTTTTCATGGTCTTTGAAACATAATGCTTATTTC
>
> GTGGTCAATGGCTTTAAAAAAATCTGTTTCTTGTTTTCTTCAACAAACTCACTAGTTTTC
>
> CCTTAAATGATATTGTAAAAATTAAAGTAATCTTGAAAATGTTTTGACAAAAGTAAAATT
>
> AAAGGGACATCTTTTCTTGTTTTGTTTTTTTTTTTTCTATTGCCACACATGACCGTTCCT
>
> TCACCTTTAAGCAAAGAGAGTGGTTCAGATGGTTTCTAAGATGCCAACCTGACCTCGCAT
>
> TCTGTCATTCTACCCAGCTCTTAATTCAATTTGCTTCCATTATCCTAACAGGCTTCTTTC
>
> TTACTTAGAACTTGGAAAGGCTGCTGTATTTAATACCCTCCAACACTAACGCAGACTTAA
>
> GATAGGTACTGTTTATTGAAAACCTACTGAGTGAAATGTGCGGTTTTAGGACCTTCATAA
>
> ACATCTCATTTAATCTTTCTAGCATCCTGTGAAACAGCCATGATTTCACGTTGATAAACA
>
> AAGAAGACAGGGGTCCCAGGGATGTGAAGCATCTTGCCCAGGCTTCTGCTGCTGGTGACC
>
> AGTGTAGCCAGGACTCCAGCCCAGGTTTTCCTGACTCAGAAGACTGAGCTTTTTCCTGGA
>
> TGTTATTAATAGCTAATTGTGTCCAAGCAACCAAGGGCCTTGAGTCTGCTTGGTTCTGCT
>
> TATGGCCTCACATCAAGAAATGGAGCTAGTCCATGTCTGTAGTCCCAATGCTTTGGGAAG
>
> CCATGATGGGAAGGTTGTCGGAGGCCAAAAGTTCAAGACCAGGCTGGGCAATATCACAAG
>
> ACTCCATCTCTACGGAAAAGTAAAAAATTAGCCAGTCATGGTGGTGTACACTTATGGTCC
>
> TAGTTACTCAGGAGACTTAGGCAGGAGGATTGCTTGATCCTAGGAATTCGAGGCTGCAGT
>
> GAGCTATGATTGCACCTCTGCACCCAAGCCTGGGCGACACAGCGAGACCCTCTCTCTTAA
>
> AAAAAAAAAATAGCAGAGCTCACCAAAGTGATGTTCACCTTTTTATGACATTCCTTTTTC
>
> TTAGCTTAAGAAAAGAAAGCTGCTAGATGAGAGTCTTAGTTTTCCTGCATAAGACCTCCT
>
> TTATGAATAGAATAAAAGACTGTCAAAGTAGGCTGGGCTTGGGCCCAGGCTAATCTATGA
>
> AGGAAGCAAGCTCGTGTTCCTTACCTATCCTTTTGGTGTCCATTGGATTGTGCCCCGAAG
>
> TGGCCTTTACCCTTGAGCCGTCCCCAGCCATGGTGCTCACACATAGGCTTTTGAGCTCCT
>
> TGGAGCTATCCAGATCCTGCTCACTTTTCCTTCCTGAGATCAGAACAAATCACCCCCTTA
>
> CTCCCACTCCAAACAAGGCCTTGATGATAAACTAATCCTTCCTAAAATGCTGGTAGGTAA
>
> ACAAGCAATGATGAAGCATTGAACACAGGTTAACTCCTGACTTTTGTACCATTGTCTATT
>
> CCATTACACATTAACATGACTCTGAATGCCAGATCCAAACCTTTGCCCACCATCTGCTTG
>
> TCGTGCAACAGTTGAGGCAGTAACCAGGGGAGATTCACTTCCTGTCTTGTCCTTCCCCAG
>
> GGATCACCCCCCTGCTGCCCTCTAGCAGCCAAACTCAGATGAGTTCCATTGTTACCCTAG
>
> GTGTGCCCATCTCTTTGGTAGGGAAGGAGAAAGGTAAGAATAGCCATCAGTGAGGAAGGA
>
> TTCTTGGAGCGAGGAGCCACTGTGGTTTTTCCTGCTATTTAAGATGTTGAGACCGGATAA
>
> CTTTAGAAAGATACCTGCACAAACCCATAAATAGTGCTTTTATAAAGTTTAGTTCACCGG
>
> AACCTGAGTTCAGTATTTGACATTAGCTTTTTGTCCAAAGAGTTGAAGCCTGCTGGAGGT
>
> CTTTGCTCAAATAATAAATACCACATATTTCCAAGTGTGTTCAGGTATAGGCACTAGGTA
>
> CTGTCTGTTTACTTCATGTTAGGCACATTACATGCATTGGCTAATCAAATCCTCATCAAT
>
> TACATATGTAATAATCTAAACTTGCCTCCTTGTATTATAAATGGAAATAATCCTGTTTAT
>
> TTAAACGGGTTTTCATGTACCTGTAGGGATTAGGAAACTCAAATGGCCTTTTTAATACCT
>
> TTCCCTAGTTTGAGCTCCCTGTTCTCTTTAACAGATAAAACAACATATTTGCTTCAGCCT
>
> GGAATCTGTTTTTGGTGCTTTGGTGCAGAGACAGGAAATGGGCACTCAGAGTCACACTGG
>
> TAGTTGCACACTGTATCTACAGAGGGCGTGTCTCATCTGTACTCTGCTGGGTTACAGGAT
>
> TTCAGTAGGTATTTGTGTCCACCTGAGAATTCTGTTTATTACCTTTCATTTGACAGTGTC
>
> TTTCCTTTCTGCAGTTGATTTTGCTAGAGAGGCAATTCATAAGGTGAGGTCCTGTTCATA
>
> GTATGACTTGCTTTCTCAATATCTCCTTCAATTTTTAGTAACTCTTGGTCTATTTGGTGT
>
> CTTTAAAAAAAATAACCTAGTAATAAAGACTTCTTTTAATGTGGAAATGTGGTCTGGTAG
>
> TAAGTTATTTCTTTCCACATGTAACTGACCCAATCTGGTTTCCAAATGAGAAGTGTGCAG
>
> GCCCCAGAGGTTGAGAAGCCATATTTCAACTGTGAAAAAAATCTGCTTCCTGCATCTGTT
>
> GAAATATAGTTGTTCATACTTGCCATCCCTTATCTTTCTTGTAACAATTTGCACAGTTCT
>
> TGCCAGAATAAATGCCATTATCTGTATGTTTCAGGGAGTTCCCCAATTTGATCATTTTTG
>
> TGTGTGTGTGGTGTGTGTGTGAGAGAGAGAGATACTGCAGTAAAACATTTCTAAAGGATG
>
> AAAGCTCTTGTATGGCATAGATATGAATTCCTTCCTCTGGTAATAATTAGGTTATTCCCA
>
> GAAGCACAGTGTCATTCTTTAAATAAAAGCTTTCCTGTTTAAAGCTTTTCAAAGGAGCAG
>
> ACCACCTTGAAGATTCCCCCTAGGGTTGATATGTGTCTAATTCATTTTATAAAAATTATT
>
> CTTGTCTTCATTTTAAAGCTTTGGCTATATAGTCAGAAATGTCCTAAATAACAAACTATT
>
> TTGTATTTAATTTAGGGAAGACTAAAGGGAAGAAAAATGAAAACTCAGTCTTTATGTAAG
>
> CTCCAAGGATATTAGGGCTTAAAGGGCTTTTCTAGTTTTATGAGAATTTGTACTACTGAT
>
> TTTTATATATTCCTGTTTTTGAGATGAACAGATCTCTGGGGAAATTGTTGAGTTACAATG
>
> GCATTTCACTGTGATCCCTCTCAAGCTCAGATCAGTTCTATAACCCAATGACAACCTGTC
>
> TCTTTGGTTTACTGTCCTGTGAAATGTCAGCTCAAGTTTCCCAGAAGTCGTGTGTTTATG
>
> ATGAGTCAGAGTGCTTTTCCTCGGTGGGACAGTTGCTGGCCCTCTTAATTTTGGTGTATG
>
> TGCTTCCAAGTATCTAAACCTCCAGTCTGATCTGTATATGCTATCCTAACTGTTAATTGT
>
> ATTATTGATTATGTTGATTATCTTGCTTGAAGGTTCATACTTTTCAATTTGATAGAAATA
>
> AAGTTTTTTTCTGCTTATA
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to