Hi,
We're running into a problem in populating our biomart databases for our Ensembl datasets. The
problem lies in the population of the *_gene_ensembl__exon_transcript__dm mart tables.
I take scerevisiae as an example because it's being done fine by the Ensembl
team, so I don't understand what's wrong with our procedure.
We're using a saccharomyces_cerevisiae_core_50_1i database installed on
our local mysql server. The data were dumped from ensembl ftp site, and
we're using ensembl_50.xml that Ensembl gave us. We just adapted it to our
database connectivity.
The problem is that this table has only data for cerevisiae genes
attached to the mitochondrial chromosome.
The problem lies with this mart building query:
create table fungal_mart_redo_50.TEMP28 as select a.*,b.value as
value_1060,b.attrib_type_id as attrib_type_id_1060 from
fungal_mart_redo_50.TEMP26 as a inner join
saccharomyces_cerevisiae_core_50_1i.seq_region_attrib as b on
a.seq_region_id_1015=b.seq_region_id and (b.attrib_type_id=11 or
b.attrib_type_id is null);
TEMP26 is fine and has data for every chromosomes, TEMP28 has only data
for the mitochondrial chromosome.
What happens is that seq_region_attr table has only attrib_type_id=11
for the mitochondrial seq_region. I haven't seen any rows where
seq_region_attr.attrib_type_id is null
attrib_type_id=11 corresponds to 'Codon Table' attribute.
So I guess it tries to add codon table information into
*_gene_ensembl__exon_transcript__dm. It seems that ensembl by default
doesn't attach a codon table to a seq_region, I guess because they are
all 1, except for mitochndrial or chloroplast genes.
So, we must be missing something in our mart building procedure ? Does
it ring a bell to you ?
I can fix this issue by replacing the inner join by a left join. Should this be
enough ?
We're facing another problem. attrib_type_id=11 might not be true for all datasets. We also have Ensembl core databases that we built ourselves, and it turns out that
the attrib_type corresponding to 'Codon Table' has attrib_type_id=3.
I mean, we want to set up a fungal mart for two datasets, scerevisiae and spombe. So we have a build xml file with attrib_type_id=11 so we get scerevisiae ok (assuming we replace the above inner join by a left join),
but we don't get spombe right as it requires attrib_type_id=3 !
So, is the only way to fix it, to have consistently the same attrib_type_id value for all datasets or could you suggest another solution that does not rely on a predefined
attrib_type_id value ?
Thanks
Arnaud