No probs Lionel,
We will get back to you as soon as your results are ready.
Cheers
Syed
Lionel Brooks 3rd wrote:
Hi Syed,
That would be fantastic! We need the sequences because we will be
building a tiling array which should contain all known UTRs.
Our biomart shopping list is:
1. all 3'UTRs plus the 100 bp flanking upstream on the transcript
level (exonic, not intronic)
2. all 3'UTRs (without any flanking sequence)
3. all 5' UTRs plus the 100 bp flanking downstream on the transcript.
4. all 5' UTRs (without any flanking sequence)
Preferably these would be in separate files with the sequences indexed
by gene ID, transcript ID and genomic coordinates.
We would be especially grateful if redundancy in the sequence set
could be reduced with the 'Unique only' option...as we expect a good
deal of redundancy across splicoforms but we certainly understand that
this is a fairly tall order.
Thank you for helping us out Syed! Your help has already been
indispensable.
Sincerely,
Lee
Syed Haider wrote:
Hi Lionel,
If all you need is 3utr or 5utr sequences or say its a set of
queries you are interested in, with the help of Ensembl team, we can
execute them locally and send you the *correct* sequences. With
Ensembl v52 release, this will be rectified (Ensembl team to confirm).
Cheers
Syed
Lionel Brooks 3rd wrote:
Wow...thank you for a swift response Syed!
This is important information for us indeed!!
Syed Haider wrote:
Hi Lionel,
Before we dive into investigation, I would like to flag that 5utr
and 3utr sequences in current release of Ensembl are incorrect in
mart due to a configuration bug in Ensembl Configs which are
managed by Ensembl Mart team. cc'ing Ensembl experts here on thisn
email who would be able to help you on this.
Cheers
Syed
Lionel Brooks 3rd wrote:
Hello all,
First, I would like to say thank you to the developers that are
making Biomart accessible for us bench scientists!
My question is in regards to the behavior of the "Unique Only"
option. In an effort to retrieve all 3' UTRs, and the
accompanying 100 bp upstream on the transcripts I gave the
following parameters...my database of choice was the Ensembl
NCBI36 database:
$query->setDataset("hsapiens_gene_ensembl");
$query->addFilter("upstream_flank", ["100"]);
$query->addAttribute("ensembl_gene_id");
$query->addAttribute("ensembl_transcript_id");
$query->addAttribute("3utr");
$query->addAttribute("external_gene_id");
$query->addAttribute("external_gene_db");
$query->addAttribute("chromosome_name");
$query->addAttribute("start_position");
$query->addAttribute("end_position");
$query->addAttribute("biotype");
$query->addAttribute("transcript_start");
$query->addAttribute("transcript_end");
$query->addAttribute("ensembl_exon_id");
$query->addAttribute("exon_chrom_start");
$query->addAttribute("exon_chrom_end");
$query->addAttribute("strand");
$query->addAttribute("rank");
This retrieves 61356 sequences, which is equal to the number of
sequences retrieved when all 3'UTRs are called without the 100 bp
flanking...bravo.
Now, I was curious about what might happen if I fetch the "Unique
rows only". When I flag "Unique Only" as true and I fetch all 3'
UTRs with 100 bp upstream then I get only 8916 sequences.
OK, fair enough...
BUT when I I flag "Unique Only" as true and fetch all 3' UTRs
WITHOUT the 100 bp flanking upstream then I get 14322 sequences.
This does not make sense to me because I would expect to retrieve
more unique sequences when I have the 100 bp upstream, which adds
complexity to the sequence and should yield more unique hits.
I cannot figure out why I am getting the behavior described above.
Could someone point me in the direction of some good documentation
on this and/or explain to me the behavior I am observing?
Thank You!
-Lee Brooks