[HCLS] Challenges and goals for the HCLS Semantic Web community in the next years

Matthias Samwald Wed, 07 Nov 2007 09:14:26 -0800

As I will not be able to attend the F2F meeting in Boston this week, and theteleconference connection will probably not work (as teleconferences usuallydo), I have collected my thoughts on several aspects of our Semantic Webdevelopments in the following text.


Topics:
GENERAL APPROACH AND DESIGN PHILOSOPHY
WEB USER INTERFACES
ONTOLOGICAL FOUNDATIONS
DOMAIN ONTOLOGIES
LIFE SCIENCE AND HEALTH CARE SPECIFIC TOPICS

MAKING THE SEMANTIC WEB GROW TOGETHER: IDENTIFIERS, FINDING RESOURCES,TRUST

COMMERCIALISATION STRATEGIES

--------------

GENERAL APPROACH AND DESIGN PHILOSOPHY

** Small incremental steps and legacy support VS. radically new approaches**I think the community should become less reluctant to apply Semantic Webtechnologies in radically new ways. For example, instead of describingdigital resources which themselves describe entities of interest (such asdatabase records in Uniprot describing proteins), we should focus ondescribing those entities of interest directly -- without taking a detourthrough describing database entries and other artifacts of the pre-SemanticWeb era.Of course, there are cases where such 'legacy support' is needed forpragmatic reasons, but I think that in the majority of cases there is nopractical advantage at all.RDF/OWL is not only a syntactically more flexible alternative to currentdatabase systems; it enables a whole new philosophy of how information canbe organized. If we want to demonstrate the real advantages of the SemanticWeb, we need to be bold enough to break with current habits.

** Focus solely on technical aspects VS. focus also on institutional /sociological / legal context **Many of the ideas inside our community cannot be realized when we solelyfocus on the technical aspects of Semantic Web technologies. We want to makesignificant change in the HCLS community happen, e.g., widespread use ofstructured digital abstracts or better communication between bench andbedside. Some of this work actually has nothing to do with Semantic Webtechnologies and is therefore outside the scope of the W3C HCLS interestgroup, so we might need to find other platforms to organize these things. Iguess Science Commons (http://sciencecommons.org/) might become even moreimportant for our work than it already is.


--------------

WEB USER INTERFACES

** Flexible but unergonomic VS. inflexible but user friendly **

This is a choice we are facing with any kind of user interface for aSemantic Web application. RDF/OWL is so flexible that it is very hard tocreate user interfaces to display arbitrary information in an appealing way.Many of the current RDF browsers produce lists of entities and relationsthat look raw and uninviting. This can be remedied by creating userinterfaces that are specialized for certain domains, as we did with 'EntrezNeuron' (current prototype at http://gouda.med.yale.edu:8087/).Striking a balance between user friendliness and flexibility will be one ofthe most difficult problems we are facing in the development of GUIs.


** User interface ideas that should receive more attention **

- Autocomplete fields / interfaces that motivate re-use of existingentities. Example: the newly started Okkam project (http://www.okkam.org/)is building an extension for Protégé to allow user to find existing entitiesthat they can re-use. The Sindice project (http://sindice.com) provides afast and scalable index of Semantic Web resources through a simple web API.- Open query builders with a social component. The Leipzig DBpedia queryinterface (http://wikipedia.aksw.org/) is a nice prototype. Moreknowledgeable users can create queries from scratch and share them withothers, less knowledgeable users can pick existing queries and make someminor modifications for their needs. Such a system could make use of socialdynamics, e.g., rating of queries to rank the most useful ones first;profiling of user interests to suggest those queries that cater to the needsof specific user groups. The Leipzig query interface also demonstrates theusefulness of the auto-complete feature.- Semantic Web Pipes / modularized RDF/OWL data flows. Such systems could beinspired by Yahoo Pipes, and could also have a social component. A prototypeof such a system is http://pipes.deri.org/- Interfaces resembling text editors. Such interfaces could enable a muchfaster way of creating and querying RDF/OWL compared to current ontologyeditors. Of course, they need to offer the user assistance in the form ofauto-completion, type checking, text formatting etc. I made a prototype ofsuch an interface at: http://neuroscientific.net/leeet/- Semantic Wikis based on RDF/OWL triple stores. The best example at themoment is OntoWiki (http://ontowiki.net/). Such dedicated Wiki systemsshould be distinguished from systems that merely add a thin layer of RDF ontop of a normal, text based Wiki system (like Semantic MediaWiki). Thelatter are not suitable to support the creation of large, consistent RDF/OWLknowledge bases, in my opinion.- Spreadsheets. Spreadsheets are very common tools for data entry /organization in science. Making an elegant and meaningful mapping betweenspreadsheets and biomedical domain ontologies possible would be an importantgoal. Again, the goal should not be to describe the structure and content ofthe spreadsheet in RDF, but rather to describe biomedical reality asdirectly as possible. http://rdf123.umbc.edu/ seems to be an interestingproject in this area.

** User interface ideas that turned out to be impractical and should receiveless attention **

- Graphs in almost any form and size.

- Emulating the interface of an ontology editor like Protégé inside the webbrowser.


--------------

ONTOLOGICAL FOUNDATIONS

** Heterogeneity reduction: unrestricted but heterogeneous VS. restrictedbut homogeneous (using foundational ontology) **If we look at the RDF/OWL datasets that are currently part of the 'HCLS demo'we can see that their structures are quite heterogeneous. Every data sourceis structured in a very unique way, so that someone writing a query spanningseveral data sources needs a deep understanding of each data source to makeit work.


** Granularity dependent VS. granularity independent **

Granularity-dependent ontologies (such as BFO) force us to index eachontology to a certain granularity (like 'atom', 'molecule', 'cell','organism'). Things that are classified as an 'object' in one granularityare classified as an 'aggregate' in another granularity, placing them indisjoint class hierarchies and thereby making the integration across scalesmore difficult. Since such an integration across scales is probably one ofour major targets, we may want to explore the advantages and disadvantagesof ontologies that are granularity independent.


** Dealing with time: 3D VS. widespread reification of relations VS. 4D **

The representation of time (or rather, the change of relations betweenentities during time) has received relatively little attention so far. Manyontologies we are currently using -- including those based on BFO -- arebased on the '3D' perspective: Physical objects (e.g. proteins, persons)persist in time and do not have temporal parts. This causes problems when weare dealing with change over time, e.g., when we want to make the statement'Eve - has hair colour - brown' at one point in time and 'Eve -has haircolour - grey' at another time. I can give examples from the HCLS domain ifrequired.The only way to deal with this in many of our current ontologies would be toindex each ontology to a certain time. Eve would have brown hair in oneontology and grey hair in another ontology. However, at the moment it isstill quite undecided how such indexing would be practically implemented inRDF/OWL, and how much problems such indexing would cause for our goal ofeasy and widespread information integration. It is possible that our currentontologies lead us down a road where we will encounter a lot of trouble whenwe finally need to take care about time.Therefore, we should explore how temporal changes can be represented withoutsome obscure mechanism of ontology indexing.One possibility would be to reify most of the relations between entities andto attach a temporal index to each relation. However, this would add a lotof unnecessary complexity in cases where we actually do not care abouttemporal aspects.Another possibility (favored by me at the moment) would be to build 4Dontologies where physical objects can have temporal parts. For example, wecould say that 'Eve at age 20' and 'Eve at age 60' are two temporal parts ofEve. The great advantage of this approach is that it keeps our ontologiessimple when we do not want to care about temporal aspects. For example, wecan simply say 'Eve has hair colour brown' now. When 40 years have passed,and we discover that Eve's hair has turned gray, we can refine ourdescription of Eve by saying that the first Eve we described was merely onetemporal part of her, and that there is another temporal part of Eve withgray hair.


--------------

DOMAIN ONTOLOGIES

** The role of human readable text inside datatype properties **

Google demonstrates that querying unstructured documents might not beperfect, but it can often provide a very quick and intuitive mechanism forfinding information. I have the feeling that the Semantic Web community issometimes so focused on providing structured data/metadata that we forgetabout that unstructured information kept inside datatype properties is auseful target for mining/querying as well.Finding the right balance between explicit information in RDF triples andimplicit information inside the values of datatype properties could turn outto be quite important.


** Class vs. instance/individual **

One should be aware that the distinction between class and individual is notan arbitrary syntactic choice. It should also not be confused with the useof 'class' and 'instance' in object oriented programming, or the distinctionbetween 'schema' and 'data' in database systems.In almost all ontologies, individuals are things that are located at acertain space and time. In most of our projects, we do not want to makestatements about a certain serotonin receptor protein we saw swimming in ourPetri dish; rather, we want to be able to make general statements aboutcertain classes of serotonin receptor proteins, which can be shared with andfurther refined by other participants of the HCLS community.One problem we have encountered with the extensive use of classes in someontologies of the HCLS demo was that the underlying RDF graphs became verycomplicated. This is caused by the representation of OWL class propertyrestrictions in RDF. We should explore ways to lessen this problem, e.g., bycreating simpler RDF representations for some OWL constructs.


** Domain ontologies we need in the near future **

- An ontologically consistent, OBO Foundry-compliant ontology for molecularinteractions and pathway. BioPAX-OBO is a new development in that area.Personally I have also made some first developments in that area (e.g., the'OBO Essentials' ontology).- An ontologically consistent, OBO Foundry-compliant ontology for microarrayexperiments- An ontology of proteins and protein structures (e.g.http://proteinontology.info/ ?)


--------------

LIFE SCIENCE AND HEALTH CARE SPECIFIC TOPICS

** Focus on description of experimental procedures, interventions andresults VS. focus on description of nature **Some projects in the HCLS community focus on describing the process ofscientific investigation, experimental procedures and their results (e.g.OBI, http://obi.sourceforge.net/), while others focus on describing theobjects of these investigations directly.To give a concrete example, we can describe protein expression eitherthrough describing a microarray assay ("a cell from organism X wasextracted, pixels on the microarray corresponding to gene Y had value Z"),or by describing physiology ("organism X has part cell, gene Y mRNA haslocation cell, gene Y mRNA has concentration Z").In my opinion, the consistent description of should have a higher prioritythan the description of experimental procedures. After all, our resources togenerate structured data are limited, and we should focus our energies ondescribing our objects of investigation rather than every detail of our workin the lab.


--------------

MAKING THE SEMANTIC WEB GROW TOGETHER: IDENTIFIERS, FINDING RESOURCES,TRUST


** Trust: coarse, location based VS. fine-grained, statement based. **

I think that rather than implementing the complicated trust metricsdescribed in academic publications over the recent years (fine-grainednetworks of trust, based on RDF), we will probably implement much simplermechanisms to determine whether a piece of RDF/OWL we encounter on the webis trustworthy or not. Just like on the current web, trust will be mostlybased on the location of the RDF/OWL resource, i.e. on the server. Somecentral websites will bundle some resources in central indices, users willchoose between those central websites and different 'perspectives' of theresources on the global Semantic Web.

** Identifiers: huge sameAs services VS. strict enforcement of reuse ofexisting entities **We are currently steering towards a Semantic Web with a high degree ofredundancy in terms of identifiers / URIs. URIs for things that areessentially the same are being generated with a breathtaking pace, and amapping between these entities is often not technically feasible (who wantsto load a mapping file between Uniprot record URIs minted by Science Commonsand those minted by Uniprot itself?).

This problem has two causes:

- technically, it is often quite hard to find existing resources. This needsto be addressed by the creation of services that allow for the quickretrieval of existing resources during ontology creation (this is the goalof http://sindice.com or the OKKAM project).- socially, many people are very reluctant to re-use entities that have aURI with a foreign namespace. This problem is still underestimated and hasalready done a lot of damage to the development of the Semantic Web. The useof PURLs eases this problem a bit, as they are perceived as a more neutralground. Personally, I still believe that the use of completely opaque URIs(like 'urn:uuid:c2f41010-65b3-11d1-a29f-00aa00c14882') might be aninteresting option, although this would be against the principles of the'open linked data' initiative.

Many of these and other questions are addressed in Jonathans texts aboutURIs (http://sw.neurocommons.org/2007/uri-note/).


--------------

COMMERCIALISATION STRATEGIES

** Of course, Semantic Web technologies can be used both on the publicinternet as well as the intranet of organizations (laboratory,pharmaceutical industry, healthcare providers). However, I am interested inthe possibility of making the public, global Semantic Web commerciallyuseful. We should think about scenarios where the value of the Semantic Webfor commercial enterprises is not solely based on using the technologieslocally, but also on becoming part of the global HCLS Semantic Webcommunity; donating Semantic Web resources where possible and, at the sametime, profiting from the donations of others. An open source data andknowledge economy.The role of a Semantic Web company in such a scenario would not only be totailor software applications to the specific needs of customers (i.e. HCLSinstitutions), but also to help customers become a 'good citizen' of theglobal Semantic Web - for their own benefit.


** Revenue from advertisements **

Revenue through targeted advertisements on websites is financing largeportions of the current public web. Non-governmental institutions that planto offer information resource on the Semantic Web need to be able to getsome revenue from placing advertisements. In some cases, e.g., when theinformation is not offered through some HTML page but through a SPARQLendpoint, it is currently difficult to place targeted advertisements. It isimportant for the sustained growth of the public Semantic Web to explorestrategies for placing advertisements in such scenarios.Because information and context is much more explicit in Semantic Webresources than on normal web pages, the potential for targeted advertising(similar to Google's AdSense) is huge.


------------------------------
------------------------------

Many of the items are presented as choices between A or B, as I have theimpression that such bold distinctions encourage feedback. Of course, mostof them are not either-or choices but are rather continua where the bestsolution lies somewhere in between (but not necessarily in the middle).If there is interest in some of these topics, please reply so they can bediscussed in more detail. If anyone is interested in extending thisunorganized note to a publishable review or some W3C document, I would behappy to participate.


Cheers,
Matthias Samwald

---
About me: http://neuroscientific.net/curriculum

[HCLS] Challenges and goals for the HCLS Semantic Web community in the next years

Reply via email to