On Tue, 27 Feb 2007, Ewan Birney wrote:


On 27 Feb 2007, at 14:04, Arek Kasprzyk wrote:


On 22 Feb 2007, at 15:48, Rosienne wrote:

Hi,

a few weeks ago I was attending an Open Door Workshop at the Sanger. I had occasion to speak to one of your team and mention a couple of problems we regularly encounter when using biomart. I was advised to post to this address.


I, and my colleagues, use biomart to output gene related information for lists of microarray feature IDs. Even though we untick the ensembl transcript ID box we still get an output for each transcript. In some cases, where genes have 9 documented transcripts we get 9 perfectly replicated entries. When dealing with lists of over a thousand genes each time this gets very confusing and generally makes excel stop responding!

We wonder if in future re-works of the tool a gene specific rather than a transcript specific output can be made available. We are aware that for people working on only one, or a handful of genes, getting all the transcript specific information is essential. However, it would make life a lot easier for scientists like us who handle large gene lists if we could specifically select to obtain only gene specific outputs, 1 gene = 1 row of output.


Dear Rosienne,
this particular problem is really specific to Ensembl data. Ensembl annotates on per transcript rather than on per gene basis while most people 'outside' seem to want the latter :) The ideal solution would be if 'per gene' annotation was provided at the source of Ensembl annotation but failing that we are now looking for the ways of simply altering the output such that it will artificially introduce 'per gene' annotation so that users like yourself would be able to avoid the annoying repetitions. You must be aware however that such approach has a potential of introducing conflicting annotation as it will be totally artificial. The correct 'per gene' annotation can only be corrected at the source.


Woah guys - it cant be "corrected" at source - that's not the case. The "correct" annotation is at the transcript level, which is what Ensembl provides and what gets Martified. Many people want:


(a) when results columns from Mart only has gene attributes, not to provide the entirely redundant rows
of things being duplicated. This I think we have come up with a solution

(b) options to have transcript-level information concatenated into gene level reports as (perhaps)
comma separated lists

Both I think (having thought about this more) is better handled generically in the Mart View/output layer than Ensembl. Ensembl has the _correct_ annotation structure (transcript orientated) it is just that
most people want a _gene_ level view. This should be in the BioMart area.


And - Arek - please stop characterising this as an Ensembl error. Plus the fact that the majority of the BioMart team is part of the Ensembl group most users don't care about which side of BioMart or
ensembl this problem lies on, they just want it solved!

Let's

(a) put in the simple sort by gene id, don't print rows redundant to the previous
(should be easy, right?)

(b) discuss how to think about concatonation, ideally in teh software, not the denormalisation


WormMart has the same gene/transcript (CDS in WormBase's case) issues. I solve this by 'merging' the multiple transcript values into a single attribute in the gene_main table. See this example;
  http://tinyurl.com/2x3bj5

The 'merged' attributes are pretty vital for wormbase where the 'cross dimension multiplicity' (unconstrained dimension joins) are much more of an issue than for Ensembl. I would like to see this approach supported natively by BioMart/MartView, and hopefuly MartBuilder as well.

Will

Reply via email to