Hello all,
First, I would like to say thank you to the developers that are making
Biomart accessible for us bench scientists!
My question is in regards to the behavior of the "Unique Only" option.
In an effort to retrieve all 3' UTRs, and the accompanying 100 bp
upstream on the transcripts I gave the following parameters...my
database of choice was the Ensembl NCBI36 database:
$query->setDataset("hsapiens_gene_ensembl");
$query->addFilter("upstream_flank", ["100"]);
$query->addAttribute("ensembl_gene_id");
$query->addAttribute("ensembl_transcript_id");
$query->addAttribute("3utr");
$query->addAttribute("external_gene_id");
$query->addAttribute("external_gene_db");
$query->addAttribute("chromosome_name");
$query->addAttribute("start_position");
$query->addAttribute("end_position");
$query->addAttribute("biotype");
$query->addAttribute("transcript_start");
$query->addAttribute("transcript_end");
$query->addAttribute("ensembl_exon_id");
$query->addAttribute("exon_chrom_start");
$query->addAttribute("exon_chrom_end");
$query->addAttribute("strand");
$query->addAttribute("rank");
This retrieves 61356 sequences, which is equal to the number of
sequences retrieved when all 3'UTRs are called without the 100 bp
flanking...bravo.
Now, I was curious about what might happen if I fetch the "Unique rows
only".
When I flag "Unique Only" as true and I fetch all 3' UTRs with 100 bp
upstream then I get only 8916 sequences.
OK, fair enough...
BUT when I I flag "Unique Only" as true and fetch all 3' UTRs WITHOUT
the 100 bp flanking upstream then I get 14322 sequences.
This does not make sense to me because I would expect to retrieve more
unique sequences when I have the 100 bp upstream, which adds complexity
to the sequence and should yield more unique hits.
I cannot figure out why I am getting the behavior described above.
Could someone point me in the direction of some good documentation on
this and/or explain to me the behavior I am observing?
Thank You!
-Lee Brooks