On Tue, 27 Feb 2007, Ewan Birney wrote:
On 27 Feb 2007, at 14:04, Arek Kasprzyk wrote:
On 22 Feb 2007, at 15:48, Rosienne wrote:
Hi,
a few weeks ago I was attending an Open Door Workshop at the Sanger. I had
occasion to speak to one of your team and mention a couple of problems we
regularly encounter when using biomart. I was advised to post to this
address.
I, and my colleagues, use biomart to output gene related information for
lists of microarray feature IDs. Even though we untick the ensembl
transcript ID box we still get an output for each transcript. In some
cases, where genes have 9 documented transcripts we get 9 perfectly
replicated entries. When dealing with lists of over a thousand genes each
time this gets very confusing and generally makes excel stop responding!
We wonder if in future re-works of the tool a gene specific rather than a
transcript specific output can be made available. We are aware that for
people working on only one, or a handful of genes, getting all the
transcript specific information is essential. However, it would make life
a lot easier for scientists like us who handle large gene lists if we
could specifically select to obtain only gene specific outputs, 1 gene = 1
row of output.
Dear Rosienne,
this particular problem is really specific to Ensembl data. Ensembl
annotates on per transcript rather than on per gene basis while
most people 'outside' seem to want the latter :) The ideal solution would
be if 'per gene' annotation was provided at the source of
Ensembl annotation but failing that we are now looking for the ways of
simply altering the output such that it will artificially introduce 'per
gene' annotation
so that users like yourself would be able to avoid the annoying
repetitions. You must be aware however that such approach has a potential
of introducing
conflicting annotation as it will be totally artificial. The correct 'per
gene' annotation can only be corrected at the source.
Woah guys - it cant be "corrected" at source - that's not the case. The
"correct" annotation is at the transcript
level, which is what Ensembl provides and what gets Martified. Many people
want:
(a) when results columns from Mart only has gene attributes, not to provide
the entirely redundant rows
of things being duplicated. This I think we have come up with a solution
(b) options to have transcript-level information concatenated into gene
level reports as (perhaps)
comma separated lists
Both I think (having thought about this more) is better handled generically
in the Mart View/output layer
than Ensembl. Ensembl has the _correct_ annotation structure (transcript
orientated) it is just that
most people want a _gene_ level view. This should be in the BioMart area.
And - Arek - please stop characterising this as an Ensembl error. Plus the
fact that the majority of
the BioMart team is part of the Ensembl group most users don't care about
which side of BioMart or
ensembl this problem lies on, they just want it solved!
Let's
(a) put in the simple sort by gene id, don't print rows redundant to the
previous
(should be easy, right?)
(b) discuss how to think about concatonation, ideally in teh software, not
the denormalisation
WormMart has the same gene/transcript (CDS in WormBase's case) issues. I
solve this by 'merging' the multiple transcript values into a single
attribute in the gene_main table. See this example;
http://tinyurl.com/2x3bj5
The 'merged' attributes are pretty vital for wormbase where the 'cross
dimension multiplicity' (unconstrained dimension joins) are much more of
an issue than for Ensembl. I would like to see this approach supported
natively by BioMart/MartView, and hopefuly MartBuilder as well.
Will