Hi Chris,

Glad to know that you are interested.

Chris Mungall wrote:

Hi Alistair

The UI is very nice!

I'm curious that you don't include any ontologies. The source datasets are quite ontology-centric (the Chado database in particular). The BDGP data includes annotation of each individual image with terms from fly_anatomy. This allows you to query for genes expressed in the brain (including its parts), or expressed in tissue derived from the neurectoderm for example.

A while back I created a D2RQ mapping of both the BDGP InSitu databases and Chado. See:

    http://www.bioontology.org/wiki/index.php/OBD:SPARQL-InSitu

Certainly we looked at your previous work. Actually our very first version of in-house bdgp sparql endpoint was based on your work. Then, we decided to take an incremental approach, matching only the minimal subset of bdgp database for the need of our application.

We would have liked to use the fly_anatomy ontology to define the gene expressions. However, when I had a briefly look of the bdgp database, the accession numbers of the gene expression terms seem no longer to be consistent with the latest fly_anatomy ontology [1]. For example, the term "dorsal compartment" is associated with "FBbt:00005876" in the bdgp database, while "FBbt:00005876" is named as "dorsal central protocerebral neuroblast" in the fly_anatomy ontology. We would love to have more information from BDGP team with respect to this.

[1] http://www.obofoundry.org/cgi-bin/detail.cgi?id=fly_anatomy 

In your Chado mapping, you're really just extracting synonym information. Is there really a need to define a new ontology here, rather than using, say, SKOS? Do you have plans to map more of the schema? I'm particularly interested in the representation of genomic intervals, and scalable querying.

The reason we used very light ontology-centric approach is because we wanted to impose as little as possible any interpretation of the source data. And same as what we did with BDGP database, we only extracted the minimum set of information from chado schema. We are planning to extract some other fields from flybase in the near future, driven by the needs of our scientists. But if you have any use cases relating to the "genomic intervals" information, we would be very interested to know:).


You provide 3 SPARQL endpoints. It looks like you're doing the mashup in the UI. In many ways this is a traditional AJAX architecture, albeit with SPARQL endpoints rather than, say, a REST interface to a relational db. Did you find the triplestore/SPARQL route had particular advantages (or disadvantages)? What can you do that you can't do by simply going straight to the relational dbs?

One research goal of our project is to investigate to what extent the existing semantic web technologies and tools could be used to support real use cases; hence the sparql endpoints over all the datasets. And en route to this, we have found out many interesting questions to answer, such as scalability, identity mapping, etc.

We appreciated the flexibility of RDF data model, which allows us to impose an RDF view of the source relational data based on the needs of our application and to work with a very lightweight, unified data layer. And also of course, mapping these interesting data resources into RDF gives our the opportunity of linking our data resources to others, making it possible to re-use and data integration.

In the future release, we are interesting in investigating some light weight reasoning, such as searching for gene expression images using the fly anatomy ontology, once we sort out some detailed ontology mapping problem.


I'm not sure why you needed to write your own SPARQL protocol on top of Jena. Isn't this what Joseki does?

The main reason was because (by the time of our experiment) Joseki did not handle multiple concurrent connections to the underlying RDF store. You could find more details from: http://code.google.com/p/sparqlite/wiki/HomePage. :)

Interested to see future developments

Jun

Cheers
Chris

Cheers
Chris

On Nov 5, 2008, at 8:44 AM, Alistair Miles wrote:


Dear all,

This is a summary of work so far by the FlyWeb Project team. We're
exploring integration of life science data in support of Drosophila
(fruit fly) functional genomics. We'd like to develop credible, robust
and genuinely useful tools for the Drosophila research community; and
to provide data and services of value to bioinformaticians and
Semantic Web / Life Science developers.

This is the first time we've announced our work more widely, and we'd
very much appreciate thoughts, suggestions, feedback, re-use and
testing of the applications, services, software and data described
below. Please note however that this is work in progress, and things
may break, change, move or disappear without notice.


= Search Applications =

http://openflydata.org/search/insitus

This application allows you to search for images of in situ RNA
hybridisation experiments, depicting expression of specific genes in
different organs (testes and embryos). It is a mashup of data from the
Berkeley Drosophila Genome Project (BDGP) and the Drosophila Testis
Gene Expression Database (Fly-TED). It also uses data from FlyBase to
disambiguate gene name synonyms.

It's a pure AJAX application using SPARQL to access data from each of
the three sources on the fly (pardon the pun :).


= RDF Data =

The following RDF data used in the search application above are
available for bulk download:

* http://openflydata.org/dump/flybase (latest)
 http://openflydata.org/dump/flybase_genenames_20081017 (snapshot)

 data on D. melanogaster gene identifiers, symbols and synonyms,
 derived from flybase.org; approx 8 million triples; gzipped
 n-triples

* http://openflydata.org/dump/bdgp (latest)
 http://openflydata.org/dump/bdgp_images_20081030 (snapshot)

 metadata on images of embryo in situ gene expression experiments,
 derived from fruitfly.org; approx 1 million triples; gzipped
 n-triples

* http://openflydata.org/dump/flyted (latest)
 http://openflydata.org/dump/flyted_20080626 (snapshot)

 metadata on images testis in situ gene expression experiments,
 derived from www.fly-ted.org; approx 30,000 triples; gzipped turtle


= Data Services =

The following SPARQL endpoints are available for queries over the
above data. See also limitations below.

* http://openflydata.org/query/flybase (latest)
 http://openflydata.org/query/flybase_genenames_20081017 (snapshot)

* http://openflydata.org/query/bdgp (latest)
 http://openflydata.org/query/bdgp_images_20081030 (snapshot)

* http://openflydata.org/query/flyted (latest)
 http://openflydata.org/query/flyted_20080626 (snapshot)

Limitations: only GET requests are supported; only SELECT and ASK
queries are supported; only JSON results format is supported (request
must specify output=json); SELECT queries are limited to max 500
results; no more than 5 requests per second from any one origin

The endpoints are implemented using our own Java SPARQL protocol
implementation (SPARQLite, see below) backed by Jena TDB 0.6
stores. The endpoints run inside Tomcat 5.5 behind Apache 2.2 via
mod_jk, on a small EC2 instance, with TDB storing data on an attached
EBS volume.


= Software Downloads & Source Code =

* FlyUI
 http://flyui.googlecode.com

This is a library of composable javascript widgets, providing a
user-interface to above data. These widgets are used to build the
search application above. FlyUI is built on YAHOO's javascript user
interface library (YUI).

* SPARQLite
 http://sparqlite.googlecode.com

This is an experimental and incomplete implementation of the SPARQL
protocol, designed to work with Jena TDB or SDB stores. We're using
this as a platform to explore a number of quality of service issues
that SPARQL raises.


= Ontologies/Schemas =

The following OWL schemas are used in the above data:

* CHADO OWL Schema
 http://purl.org/net/chado/schema/

This is an OWL representation of a subset of the CHADO relational
schema used by FlyBase (see http://gmod.org/wiki/Schema).

* FlyBase OWL Synonym Types
 http://purl.org/net/flybase/synonym-types/

This is a micro-ontology, representing the FlyBase synonym type
vocabulary.

* BDGP OWL Schema
 http://purl.org/net/bdgp/schema/

This is an OWL representation of a subset of the BDGP relational
schema.

* FlyTED OWL Schemas

These are under revision, to be published shortly.


= RDF Data Conversion Utilities =

The following utilities were developed to obtain the RDF data
described above:

* CHADO/FlyBase D2RQ Map
http://code.google.com/p/openflydata/source/browse/trunk/flybase/genenames/d2r-flybase-genenames.ttl

This provides a mapping from the CHADO/FlyBase relational schema to
the CHADO/FlyBase OWL ontologies, for basic D. melanogaster gene
(feature) data (identifiers, symbols, synonyms, species).

* BDGP D2RQ Map
http://code.google.com/p/openflydata/source/browse/trunk/bdgp/imagemapping/d2r-bdgp-insituimages.ttl

This maps the BDGP relational schema to OWL/RDF.

See also: http://openflydata.googlecode.com


= Future Developments =

We're currently working on improving the user interface to the BDGP
data (grouping and ordering images by developmental stage) and on
integrated expression level data from FlyAtlas.

Other suggestions for future developments are warmly welcomed.


= Acknowledgments =

Thanks especially to Helen White-Cooper and Andy Seaborne for all
their help.

The FlyWeb Project is funded by the UK Joint Information Systems
Committee (JISC).


= Further Information =

The FlyWeb project website is at:

http://imageweb.zoo.ox.ac.uk/wiki/index.php/FlyWeb_project

Graham will be presenting this work at the UK SWIG meeting next week.

Or send us an email :)

Kind regards,

Alistair Miles
Jun Zhao
Graham Klyne
David Shotton


--
Alistair Miles
Senior Computing Officer
Image Bioinformatics Research Group
Department of Zoology
The Tinbergen Building
University of Oxford
South Parks Road
Oxford
OX1 3PS
United Kingdom
Web: http://purl.org/net/aliman
Email: [EMAIL PROTECTED]
Tel: +44 (0)1865 281993








Reply via email to