On 27 Feb 2007, at 14:04, Arek Kasprzyk wrote:
On 22 Feb 2007, at 15:48, Rosienne wrote:
Hi,
a few weeks ago I was attending an Open Door Workshop at the
Sanger. I had occasion to speak to one of your team and mention a
couple of problems we regularly encounter when using biomart. I
was advised to post to this address.
I, and my colleagues, use biomart to output gene related
information for lists of microarray feature IDs. Even though we
untick the ensembl transcript ID box we still get an output for
each transcript. In some cases, where genes have 9 documented
transcripts we get 9 perfectly replicated entries. When dealing
with lists of over a thousand genes each time this gets very
confusing and generally makes excel stop responding!
We wonder if in future re-works of the tool a gene specific rather
than a transcript specific output can be made available. We are
aware that for people working on only one, or a handful of genes,
getting all the transcript specific information is essential.
However, it would make life a lot easier for scientists like us
who handle large gene lists if we could specifically select to
obtain only gene specific outputs, 1 gene = 1 row of output.
Dear Rosienne,
this particular problem is really specific to Ensembl data. Ensembl
annotates on per transcript rather than on per gene basis while
most people 'outside' seem to want the latter :) The ideal solution
would be if 'per gene' annotation was provided at the source of
Ensembl annotation but failing that we are now looking for the ways
of simply altering the output such that it will artificially
introduce 'per gene' annotation
so that users like yourself would be able to avoid the annoying
repetitions. You must be aware however that such approach has a
potential of introducing
conflicting annotation as it will be totally artificial. The
correct 'per gene' annotation can only be corrected at the source.
Woah guys - it cant be "corrected" at source - that's not the case.
The "correct" annotation is at the transcript
level, which is what Ensembl provides and what gets Martified. Many
people want:
(a) when results columns from Mart only has gene attributes, not
to provide the entirely redundant rows
of things being duplicated. This I think we have come up with a solution
(b) options to have transcript-level information concatenated into
gene level reports as (perhaps)
comma separated lists
Both I think (having thought about this more) is better handled
generically in the Mart View/output layer
than Ensembl. Ensembl has the _correct_ annotation structure
(transcript orientated) it is just that
most people want a _gene_ level view. This should be in the BioMart
area.
And - Arek - please stop characterising this as an Ensembl error.
Plus the fact that the majority of
the BioMart team is part of the Ensembl group most users don't care
about which side of BioMart or
ensembl this problem lies on, they just want it solved!
Let's
(a) put in the simple sort by gene id, don't print rows redundant
to the previous
(should be easy, right?)
(b) discuss how to think about concatonation, ideally in teh
software, not the denormalisation
a.
Our second major problem stems from the fact that sometimes there
is no information linked to particular microarray feature IDs. The
count tab tells you how many out of your list were found but there
is no information whatsoever about the ones that were not found.
Manually finding which 50 out of a list of 1000 were not found is
not easy. An output list of features not found, or inclusion of
the not found items within the output with a short 'not found'
comment next to them would be very useful.
In summary, for us the ideal situation would be if we could input
a list of 1000 feature IDs and as output get a list of 1000 rows,
1 gene per row, in the same sequence as the input list, with
either empty cells or a not found comment against those not found.
Besides this particular feature, biomart is great and has made
data mining of large data sets so much more accessible!
Thank you.
Regards
Rosienne
_______________________________________________________
Rosienne Farrugia
Division of Transfusion Medicine
Department of Haematology
University of Cambridge
Long Road
Cambridge
CB2 2PT
Tele: 01223 548008
Fax: 01223 548136
----------------------------------------------------------------------
---------
Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468
----------------------------------------------------------------------
---------