Re: bioGUID

Roderic Page Fri, 30 Mar 2007 03:03:01 -0800


Dear Matt,


I was wondering what the rules are for creating the actual identifier
that bioGUID would end up using to reference this database record.


The rules, such as they are, are in my previous post.


I'm not sure. There seem to be various adventures of GO in RDF around
the place. I think GONG is worth looking at (http://gong.man.ac.uk/).
But I think I misunderstand something: why do these other resources
need a format that can be converted to RDF? I thought you were
interested in just the links between records in different databases
and were providing an RDF layer over this, not actually trying to also
represent all the content inside these records also.

No, I want content as well. For example, I want bibliographic detailsfor a paper, I want latitude and longitudes for voucher specimens, etc.

So there is some assumptions you make on the meaning of a link in a
record? For example how would you handle the link that
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=4503913

has to http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=7669492

which is given in the XML form of the record as:

 <GBSeq_comment>[WARNING] On Apr 28, 2000 this sequence was replaced
by gi:7669492.; PROVISIONAL RefSeq: This is a provisional reference
sequence record that has not yet been subject to human review. The
final curated reference sequence record may be somewhat different from
this one.</GBSeq_comment>

and the intended interpretation is 'superseding'? Would I be right in
thinking your bioGUID database is able to parse records from entrez
and interpret the meaning of the links so as to write out a useful
description of the link? Or would you be looking to entrez to supply
this information in a more informative (at the machine level)
rdf/rdf-s based form?

Gack. I haven't looked at records like this. So far I look at GenBankrecords and extract the obvious links (say to PubMed and NCBItaxonomy). I also look at the reference records to see if I canextract enough information to do a DOI lookup for the publication(not all GenBank sequences are linked to publications in PubMed). Ispend a lot of time trying to interpret the mess that is theinformation on the voucher specimen, such as parsing the "isolate","specimen_voucher", and "lat_long" records, trying to see whetherthere is a link to a specimen that has an online representation (agood number of voucher specimens in museums have digital records Ican access). I've still to deal with host association (e.g., I wantto have a link to the host of a parasite).

I haven't looked at the other links yet, such as links betweenproteins and nucleotides, or the kinds of things you mentioned above.


So this would mean that you wouldn't make any decisions yourself that
one bioGUID record should reference another because you or someone
else thinks it should, it solely relies on parsing these data sources
and extracting 'database links'.

Not quite, I add links where possible. For example, http://bioguid.info/gi:90184449 has a link to doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2, which isn't in the original GenBank record(http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=90184449). Likewise, the specimen links are added.


You would want the person to provide you with a set of generated
bioGUIDs that each resolve and return relevant RDF which include the
cross references to other bioGUIDs (i.e. they have already worked out
by the provider)?

That would be nice, but for all the providers I care about, I've hadto do a lot of fussing from scratch.


Each database tends to have it's own API for inspecting records. Often
it is necessary to follow a chain of records from different databases
to end up with a record in the database you want. For example: you
have a sequence id (protein_gi) for an orthologue and you wanted to
retrieve the an enzyme record, you can look the protein_gi number up
in the entrez database, locate the record, and follow the link to the
expasy enzyme record or perhaps directly query the KEGG db. All of
these steps employ database specific APIs aware of the data format. I
was merely suggesting your RDF graph would merge the linking into a
common format which is nicer to handle.

I sort of envisaged that people would follow the chain themselves,and store the results locally. I would provide neighbours for eachGUID, but would follow the graph (this could potentially explode).



Regards

Rod

----------------------------------------
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom

Phone: +44 141 330 4778
Fax: +44 141 330 2792
email: [EMAIL PROTECTED]
web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
iChat: aim://rodpage1962
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic
Biologists Website: http://systematicbiology.org
Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/
Find out what we know about a species: http://ispecies.org
Rod's rants on phyloinformatics: http://iphylo.blogspot.com
Rod's rants on ants: http://semant.blogspot.com

Re: bioGUID

Reply via email to