Hi, Dan, Could you send us a copy of the XML file for dataset partitioning. We will take a look at the issue with your XML file and see if a solution/workaround can be found.
Regards, Victor Gu Ontario Institute for Cancer Research [EMAIL PROTECTED] (416)673-8591 -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Dan Staines Sent: Thursday, August 21, 2008 6:29 AM To: [email protected] Subject: [mart-dev] problem with dataset partitioning Hi chaps, I'm having some problems with partitioning my ensembl multi-species database into datasets on the basis of a column value indicating species in one of the tables in the dataset, and wondered if anyone could offer any advice as discussions with Syed and Andreas here at the EBI haven't come up with a solution. Following the advice of Richard after an earlier mail to mart-dev, I've used MartBuilder to use coord_system as a partition table for gene_ensembl based on the value of species_id. This all looks fine in martbuilder, and the saved XML is: <datasetPartitionTable name="coord_system" selectedColumns="species_id_107"> <partitionApplication name="gene_ensembl" pCols="species_id_107" dsCols="1.7.0.specie\ s_id" relationIds="1955" nameCols="species_id_107" compounds="0"/> </datasetPartitionTable> However, when I try and build the mart (either directly with the generated SQL or using martrunner), the first dataset is built correctly, but all subsequence datasets fail with this kind of error when building the transcript_main table: Unknown column 'a.ox_GO_bool' in 'field list' which comes from the following bit of SQL (in this case from the second species): create table multi_bacterial_mart_51.TEMP400 as select a.ox_RefSeqDNA_bool,a.ox_GO_bool,a.ox_integr8_gene_type_1_bool,a.db_display_name_1018, a.display_label_1074,a.analysis_id_1020,a.description_1020,a.status_1020, a.seq_region_strand_1020,a.seq_region_id_1020,a.length_1059,a.coord_system_id_1059, a.ox_EMBLDNA_bool,a.ox_integr8_name_1_bool,a.description_106,a.biotype_1020, a.gene_id_1020_key,a.ox_BioCyc_bool,a.ox_BioCyc_1_bool,a.name_106,a.ox_Rfamncrna_bool, a.name_1059,a.species_id_107,a.code_106,a.seq_region_end_1020,a.band_1027, a.attrib_type_id_1060,a.value_1060,a.seq_region_start_1020,a.transcript_count,a.is_current_1020, a.display_xref_id_1020,a.value_1022,a.ox_integr8_gene_type_bool,a.ox_ENTREZGENE_bool, a.ox_GeneID_bool,a.stable_id_1023,a.ox_GeneID_1_bool,a.ox_integr8_name_bool,a.ox_GO_1_bool,a.source_1020 from multi_bacterial_mart_51.pyrococcus_gene_ensembl_37__gene__main as a; Digging a bit further, it appears that the optimiser column ox_GO_1_bool is added to pyrococcus_gene_ensembl_37__gene__main, instead of ox_GO_bool (which is the name used for this column in the first dataset). This SQL tries to read both ox_GO_bool and ox_GO_1_bool (and this is true for all the object_xref optimisers). Looking at the third and fourth species datasets, this statement also fails, but is trying to look at ox_GO_bool, ox_GO_1_bool and ox_GO_2_bool (for the third) and ox_GO_bool, ox_GO_1_bool, ox_GO_2_bool and ox_GO_3_bool (for the fourth), so it seems that for some reason the SQL generated is referring the old column names from the previous partitions. Does anyone have any suggestions about why this is happening? Any advice gratefully received... Cheers, Dan. -- Dan Staines, PhD EMBL-European Bioinformatics Institute Tel: +44-(0)1223-492507 Wellcome Trust Genome Campus, Hinxton Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK email: [EMAIL PROTECTED]
