On 26 Jan 2007, at 15:12, Bob MacCallum wrote:
Hi,
While we're talking about the results section. I've wondered if a
"unique
records only" option could be provided - to the average biologist
user, the
following query brings back duplicated genes (because the PFAM domains
are
features of transcripts, which I have deselected from the output
attributes).
<Query virtualSchemaName = "default" Header = "1" count = ""
softwareVersion = "0.5" >
<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
<Attribute name = "ensembl_gene_id" />
<Filter name = "pfam" value = "PF00169"/>
</Dataset>
</Query>
However, we can leave the default gene + transcript attributes and
instead
provide two PFAM ids (that I know are sometimes in the same protein).
Then
the results again contain some duplicate records (although adding the
PFAM id
output attribute would fix this of course).
<Query virtualSchemaName = "default" Header = "1" count = ""
softwareVersion = "0.5" >
<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
<Attribute name = "ensembl_gene_id" />
<Attribute name = "ensembl_transcript_id" />
<Filter name = "pfam" value = "PF00169,PF00017"/>
</Dataset>
</Query>
snippet:
ENSG00000102010 ENST00000342014
ENSG00000102010 ENST00000342014
ENSG00000102010 ENST00000348343
ENSG00000102010 ENST00000348343
ENSG00000102010 ENST00000357607
ENSG00000102010 ENST00000357607
ENSG00000102010 ENST00000380391
ENSG00000102010 ENST00000380391
I note that the gene count ("count" button) is always correct however.
What do people think?
cheers,
Bob.
Hi Bob,
yes, this request has come up several times and finally we need to give
in :)
As you correctly pointed out the transcript rather than gene level
annotation is a
feature of Ensembl data and as Ensembl is likely to stick with this
for a foreseeable
future so we will have to add a 'fix' on our side. Unfortunately this
cannot be as simple as 'distinct' as this
can have grave consequences on the performance on large datasets
and in particular with certain combination of filters. We will be able
however to provide an alteration to the mart structure such that it
will artificially provide such annotation at a higher lever which will
be an equivalent of 'distinct' but without a performance hit. We are
now in
the process of implementing this in MBuilder so future Ensembl mart
releases should have this 'fix'. This will of course work with other
'non-Ensembl'
data as well
a.
Arek Kasprzyk writes:
On 26 Jan 2007, at 14:43, David Croft wrote:
Hi Arek,
Sounds like a good suggestion, we can consider that. At the moment
you can only ask for first 10, 20, 50 .... 200 or the whole lot but
not
the pagination that (I think) you have in mind
Yes, that's right - similar to what you get when you go to Google.
It would be kind of cool if the page displayed the total count of
results and told you that you are on page 3 of 28 (or whatever)
and gave you buttons to go back a page or forward a page.
not sure about this :) we could certainly add 'next' 'back' or
equivalents
but the total count and would be a bit more problematic. We do not
have
all the results during preview yet and the total count tend to be
often
expensive so we do not do it as default. Let us try to think about at
least
some of it
a.
Cheers,
David.
----------------------------------------------------------------------
--
-------
Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468
----------------------------------------------------------------------
--
-------
--
Bob MacCallum | VectorBase Developer | Kafatos/Christophides Groups |
Division of Cell and Molecular Biology | Imperial College London |
Phone +442075941945 | Email [EMAIL PROTECTED]
------------------------------------------------------------------------
-------
Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468
------------------------------------------------------------------------
-------