now cc'ing actually :)
Syed

Syed Haider wrote:
Hi Lionel,

Before we dive into investigation, I would like to flag that 5utr and 3utr sequences in current release of Ensembl are incorrect in mart due to a configuration bug in Ensembl Configs which are managed by Ensembl Mart team. cc'ing Ensembl experts here on thisn email who would be able to help you on this.

Cheers
Syed



Lionel Brooks 3rd wrote:
Hello all,
First, I would like to say thank you to the developers that are making Biomart accessible for us bench scientists!

My question is in regards to the behavior of the "Unique Only" option. In an effort to retrieve all 3' UTRs, and the accompanying 100 bp upstream on the transcripts I gave the following parameters...my database of choice was the Ensembl NCBI36 database:

$query->setDataset("hsapiens_gene_ensembl");
   $query->addFilter("upstream_flank", ["100"]);
   $query->addAttribute("ensembl_gene_id");
   $query->addAttribute("ensembl_transcript_id");
   $query->addAttribute("3utr");
   $query->addAttribute("external_gene_id");
   $query->addAttribute("external_gene_db");
   $query->addAttribute("chromosome_name");
   $query->addAttribute("start_position");
   $query->addAttribute("end_position");
   $query->addAttribute("biotype");
   $query->addAttribute("transcript_start");
   $query->addAttribute("transcript_end");
   $query->addAttribute("ensembl_exon_id");
   $query->addAttribute("exon_chrom_start");
   $query->addAttribute("exon_chrom_end");
   $query->addAttribute("strand");
   $query->addAttribute("rank");

This retrieves 61356 sequences, which is equal to the number of sequences retrieved when all 3'UTRs are called without the 100 bp flanking...bravo. Now, I was curious about what might happen if I fetch the "Unique rows only". When I flag "Unique Only" as true and I fetch all 3' UTRs with 100 bp upstream then I get only 8916 sequences.
OK, fair enough...
BUT when I I flag "Unique Only" as true and fetch all 3' UTRs WITHOUT the 100 bp flanking upstream then I get 14322 sequences. This does not make sense to me because I would expect to retrieve more unique sequences when I have the 100 bp upstream, which adds complexity to the sequence and should yield more unique hits.

I cannot figure out why I am getting the behavior described above.

Could someone point me in the direction of some good documentation on this and/or explain to me the behavior I am observing?

Thank You!
-Lee Brooks

Reply via email to