Hi chaps,
I'm having some problems with partitioning my ensembl multi-species
database into datasets on the basis of a column value indicating species
in one of the tables in the dataset, and wondered if anyone could offer
any advice as discussions with Syed and Andreas here at the EBI haven't
come up with a solution.
Following the advice of Richard after an earlier mail to mart-dev, I've
used MartBuilder to use coord_system as a partition table for
gene_ensembl based on the value of species_id. This all looks fine in
martbuilder, and the saved XML is:
<datasetPartitionTable name="coord_system" selectedColumns="species_id_107">
<partitionApplication name="gene_ensembl"
pCols="species_id_107" dsCols="1.7.0.specie\
s_id" relationIds="1955" nameCols="species_id_107" compounds="0"/>
</datasetPartitionTable>
However, when I try and build the mart (either directly with the
generated SQL or using martrunner), the first dataset is built
correctly, but all subsequence datasets fail with this kind of error
when building the transcript_main table:
Unknown column 'a.ox_GO_bool' in 'field list'
which comes from the following bit of SQL (in this case from the second
species):
create table multi_bacterial_mart_51.TEMP400 as select
a.ox_RefSeqDNA_bool,a.ox_GO_bool,a.ox_integr8_gene_type_1_bool,a.db_display_name_1018,
a.display_label_1074,a.analysis_id_1020,a.description_1020,a.status_1020,
a.seq_region_strand_1020,a.seq_region_id_1020,a.length_1059,a.coord_system_id_1059,
a.ox_EMBLDNA_bool,a.ox_integr8_name_1_bool,a.description_106,a.biotype_1020,
a.gene_id_1020_key,a.ox_BioCyc_bool,a.ox_BioCyc_1_bool,a.name_106,a.ox_Rfamncrna_bool,
a.name_1059,a.species_id_107,a.code_106,a.seq_region_end_1020,a.band_1027,
a.attrib_type_id_1060,a.value_1060,a.seq_region_start_1020,a.transcript_count,a.is_current_1020,
a.display_xref_id_1020,a.value_1022,a.ox_integr8_gene_type_bool,a.ox_ENTREZGENE_bool,
a.ox_GeneID_bool,a.stable_id_1023,a.ox_GeneID_1_bool,a.ox_integr8_name_bool,a.ox_GO_1_bool,a.source_1020
from multi_bacterial_mart_51.pyrococcus_gene_ensembl_37__gene__main as a;
Digging a bit further, it appears that the optimiser column ox_GO_1_bool
is added to pyrococcus_gene_ensembl_37__gene__main, instead of
ox_GO_bool (which is the name used for this column in the first
dataset). This SQL tries to read both ox_GO_bool and ox_GO_1_bool (and
this is true for all the object_xref optimisers). Looking at the third
and fourth species datasets, this statement also fails, but is trying to
look at ox_GO_bool, ox_GO_1_bool and ox_GO_2_bool (for the third) and
ox_GO_bool, ox_GO_1_bool, ox_GO_2_bool and ox_GO_3_bool (for the
fourth), so it seems that for some reason the SQL generated is referring
the old column names from the previous partitions.
Does anyone have any suggestions about why this is happening? Any advice
gratefully received...
Cheers,
Dan.
--
Dan Staines, PhD
EMBL-European Bioinformatics Institute Tel: +44-(0)1223-492507
Wellcome Trust Genome Campus, Hinxton Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK email: [EMAIL PROTECTED]