Hi Arek,

Those future developments sound interesting.  Thanks for the info.


Instead of the DISTINCT, what about a simple hash-based filter implemented
just before outputing the data, e.g.

use Digest::MD5 qw(md5_hex);
my %seen;
foreach $row (@rows_to_output) {
  print "$row\n" unless ($seen{md5_hex($row)}++);
}

in this example $row has to be a text string (not array ref), of course

of course, "sort -u" on a text mart export does almost the same

cheers,
Bob.

Arek Kasprzyk writes:
 > 
 > On 26 Jan 2007, at 15:12, Bob MacCallum wrote:
 > 
 > >
 > > Hi,
 > >
 > > While we're talking about the results section.  I've wondered if a  
 > > "unique
 > > records only" option could be provided - to the average biologist  
 > > user, the
 > > following query brings back duplicated genes (because the PFAM domains  
 > > are
 > > features of transcripts, which I have deselected from the output  
 > > attributes).
 > >
 > >
 > > <Query  virtualSchemaName = "default" Header = "1" count = ""  
 > > softwareVersion = "0.5" >
 > >                    
 > >            <Dataset name = "hsapiens_gene_ensembl" interface = "default" >
 > >                    <Attribute name = "ensembl_gene_id" />
 > >                    <Filter name = "pfam" value = "PF00169"/>
 > >            </Dataset>
 > > </Query>
 > >
 > >
 > > However, we can leave the default gene + transcript attributes and  
 > > instead
 > > provide two PFAM ids (that I know are sometimes in the same protein).   
 > > Then
 > > the results again contain some duplicate records (although adding the  
 > > PFAM id
 > > output attribute would fix this of course).
 > >
 > >
 > > <Query  virtualSchemaName = "default" Header = "1" count = ""  
 > > softwareVersion = "0.5" >
 > >                    
 > >            <Dataset name = "hsapiens_gene_ensembl" interface = "default" >
 > >                    <Attribute name = "ensembl_gene_id" />
 > >                    <Attribute name = "ensembl_transcript_id" />
 > >                    <Filter name = "pfam" value = "PF00169,PF00017"/>
 > >            </Dataset>
 > > </Query>
 > >
 > > snippet:
 > > ENSG00000102010 ENST00000342014
 > > ENSG00000102010 ENST00000342014
 > > ENSG00000102010 ENST00000348343
 > > ENSG00000102010 ENST00000348343
 > > ENSG00000102010 ENST00000357607
 > > ENSG00000102010 ENST00000357607
 > > ENSG00000102010 ENST00000380391
 > > ENSG00000102010 ENST00000380391
 > >
 > >
 > > I note that the gene count ("count" button) is always correct however.
 > >
 > >
 > > What do people think?
 > >
 > > cheers,
 > > Bob.
 > >
 > 
 > Hi Bob,
 > yes, this request has come up several times and finally we need to give  
 > in :)
 > 
 > As you correctly pointed out the transcript rather than gene level  
 > annotation  is a
 > feature of Ensembl data and  as Ensembl is likely to stick with this  
 > for a foreseeable
 > future so we will have to add a 'fix' on our side. Unfortunately this  
 > cannot be as simple as 'distinct' as this
 > can have grave consequences on the performance on large datasets
 > and in particular with certain combination of filters. We will be able
 > however to provide an alteration to the mart structure such that it
 > will artificially provide such annotation at a higher lever which will
 > be an equivalent of 'distinct' but without a performance hit. We are  
 > now in
 > the process of implementing this in MBuilder so future Ensembl mart
 > releases should have this 'fix'. This will of course work with other  
 > 'non-Ensembl'
 > data as well
 > 
 > 
 > a.
 > 
 > 
 > 
 > >
 > >
 > > Arek Kasprzyk writes:
 > >>
 > >> On 26 Jan 2007, at 14:43, David Croft wrote:
 > >>
 > >>> Hi Arek,
 > >>>
 > >>>> Sounds like a good suggestion, we can consider that. At the moment
 > >>>> you can only ask for first 10, 20, 50 .... 200 or the whole lot but
 > >>>> not
 > >>>> the pagination that (I think) you have in mind
 > >>>
 > >>> Yes, that's right - similar to what you get when you go to Google.
 > >>> It would be kind of cool if the page displayed the total count of
 > >>> results and told you that you are on page 3 of 28 (or whatever)
 > >>> and gave you buttons to go back a page or forward a page.
 > >>>
 > >>
 > >> not sure about this :) we could certainly add 'next' 'back' or
 > >> equivalents
 > >> but the total count and would be a bit more problematic. We do not  
 > >> have
 > >> all the results during preview yet and the total count tend to be  
 > >> often
 > >> expensive so we do not do it as default. Let us try to think about at
 > >> least
 > >> some of it
 > >>
 > >> a.
 > >>
 > >>
 > >>> Cheers,
 > >>>
 > >>> David.
 > >>>
 > >>>
 > >>
 > >>
 > >> ---------------------------------------------------------------------- 
 > >> --
 > >> -------
 > >> Arek Kasprzyk
 > >> EMBL-European Bioinformatics Institute.
 > >> Wellcome Trust Genome Campus, Hinxton,
 > >> Cambridge CB10 1SD, UK.
 > >> Tel: +44-(0)1223-494606
 > >> Fax: +44-(0)1223-494468
 > >> ---------------------------------------------------------------------- 
 > >> --
 > >> -------
 > >>
 > >>
 > >>
 > >
 > > -- 
 > > Bob MacCallum | VectorBase Developer | Kafatos/Christophides Groups |
 > > Division of Cell and Molecular Biology | Imperial College London |
 > > Phone +442075941945 | Email [EMAIL PROTECTED]
 > >
 > 
 > 
 > ------------------------------------------------------------------------ 
 > -------
 > Arek Kasprzyk
 > EMBL-European Bioinformatics Institute.
 > Wellcome Trust Genome Campus, Hinxton,
 > Cambridge CB10 1SD, UK.
 > Tel: +44-(0)1223-494606
 > Fax: +44-(0)1223-494468
 > ------------------------------------------------------------------------ 
 > -------
 > 
 > 
 > 

-- 
Bob MacCallum | VectorBase Developer | Kafatos/Christophides Groups |
Division of Cell and Molecular Biology | Imperial College London |
Phone +442075941945 | Email [EMAIL PROTECTED]

Reply via email to