> Hmmmm, there are a lot of modeling questions in there.

The adage "all models are wrong, but some are useful" comes to mind.  To answer 
these questions, you need to define your use cases.  What are you trying to 
model?  Why?  How is the data going to be used?

Are you trying to model the sequencing and primary data analysis steps?  Or 
metadata about the sequencing technology and instrument/platform?  Or a 
physical piece of DNA?  Or the myriad annotations that can be associated with a 
region of DNA sequence (each with their own provenance)?  Or the current state 
of our collective knowledge of molecular and cellular biology related to a 
given DNA sequence?  Or the clinical phenotype (e.g., disease) and/or treatment 
options that might be related to a particular variant?  Or...

It is possible to create a very intricate model that represents all of this 
(and more).  However, it is likely unnecessary (unless you've got some monster 
use cases).

The mol bio/genetics community has spent decades refining ways to express 
genetic data.  One could create a model for a VCF file, but I'm not sure it 
would be all that useful.  VCF was developed (in part) to be a compact file 
format for representing a list of genetic variants.  By definition, it includes 
only the differences from some reference sequence.  It was not intended to be 
an accurate model of biology.

I've spent some time thinking about and exploring ways to express genetic data 
in RDF.  I have yet to find a compelling example where the RDF representation 
has a significant advantage (and in most cases the opposite is true)..  That 
said, it is quite possible that someone more proficient in RDF will succeed 
where I have not, and I look forward to the day if/when that occurs.

> All my ideas seem at least a little awkward.

Indeed.

It will certainly be important to track metadata about the sequence analysis 
method, etc.  In some cases it will be important to have information about the 
confidence or quality score for a base call at a given position (most likely to 
aid reconciliation efforts, when multiple sequences for the same sample are 
obtained).  Haplotype phasing will also start to become an issue as techniques 
are developed to determine it experimentally (as opposed to the statistical 
approaches that are currently used).  I suppose all this could be expressed in 
RDF, provided there is a use case driving the effort.

In my opinion, the real potential of using semweb technologies with genetic 
data is in the layers of interpretation that are built from the genetic 
sequences.  While the underlying genetic sequences can be rebuilt and refined 
over time, there are plenty of existing tools that can manage this process very 
efficiently.  Our collective knowledge about those sequences, however, advances 
continuously.  Changes in our understanding in one area might cascade into 
others.  We need a way to dynamically update the interpretations and discover 
novel relationships.  Genetic data (in any format) + biological knowledge (in 
RDF?) + reasoners could be a powerful combination.

What is the impact of a genetic variant at a given location?  This is a hot 
field of study within genetic/bioinformatic research, and solutions to this 
problem will be critical for clinical personalized medicine programs.

Bob


________________________________
From: public-semweb-lifesci-requ...@listhub.w3.org 
[mailto:public-semweb-lifesci-requ...@listhub.w3.org] On Behalf Of Jeremy J 
Carroll
Sent: Thursday, March 21, 2013 5:01 PM
To: w3c semweb HCLS
Subject: 'Variants' and Chromosome Modelling


Jerven suggests:

"instead of saying chrM it would have been solved by
using 
http://my.lab.org/confidential/patientXXYYZZ/genome/sampleXX/ChrM/assemblyTTv43/VariantCalls5";

rather than continuing the philosophical/theological threads ....
I am interested in this practical question.


chrM as an address

I am wanting to represent bases on chrM, how should do I do this?

My current intent is to continue with the model and the ontology implicit in 
the VCF format (1000 genomes) and make somewhat more explicit.

In this model "chrM: 5000 - 5003" identifies 4 bases (inclusive end point) in 
the mitochondrial DNA in some reference assembly .... if I have understood 
correctly, and the items of interest to be modeled are variations against that 
reference assembly. In this model, we may choose to use an address like "chrM: 
5000 - 5003" to identify some part of a reference assembly from which the 
current experimental assembly differs.

In this way of thinking, I am not really interested in an assembly of ChrM for 
patient XXYZZ's sampleXX, and so Jerven URI to refer to that is not so useful.
I guess I am surprised to see Jerven suggesting a URI in which the assembly is 
part of the ChrM rather than the other way round.


variants, defaults, non-monotonic reasoning

Part of my problem here is to do with defaults and diffs and knowledge and 
modeling ....
In general, the smart money likes monotonic reasoning as opposed to 
non-monotonic reasoning; because of reasoning tractability issues. Defaults, 
diffs, variants, all tend to non monotonic reasoning, or closed world 
assumptions or ... since if I have not been told that a particular sample's 
assembly has a variant from the reference assembly at a particular position 
then I effectively assume that the base in my sample's assembly is the same as 
the base in the reference assembly. In practice this is then an issue when the 
quality of the non-variant call is questionable. (see
https://sites.google.com/site/gvcftools/home/about-gvcf
concerning non-variant sites)

My gut feel is that these concerns, while theoretically well-founded, are 
practically irrelevant - we simply need to engineer our knowledge systems so 
that we do have 'complete' variant information, and some awareness that any 
individual call (either variant or non-variant) may be wrong. 'complete' may 
have a rather parochial system-specific meaning ...

Without the defaults, and the diffs and all the rest, the storage and query 
tractability issues appear overwhelming .... and so there isn't really any 
practical choice here.

phases of analysis

Analysis of the raw experiments in sequencing machines takes place in phases; 
and each phase does in practice need to assume the results of the previous 
phase; with some awareness of the shades of grey in such assumptions. Each 
phase essentially passes only 'output' to the next stage, and we cannot, in 
practice, forever return to the raw data to justify every step at every stage.



practical ? proposal for representing an assembly of a patient's sample


_:sample eg:sampleFromFile   <ftp://example.org/mypatientsample.vcf> .

# metadata headers from VCF file, cleaned up somewhat
<ftp://example.org/mypatientsample.vcf> vcf:reference 
<http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.17/>
<ftp://example.org/mypatientsample.vcf> vcf:fileDate "2012-06-26"^^xsd:date .

# each row of the data of the VCF file becomes something like

_:sample eg:hasVariant [

   eg:aboutGenomePosition [
# we use a restricted vocabulary of chromosome names
       eg:chromosome eg:chrM ;
       eg:startPosition "5000"^^xsd:int ;
       eg:endPosition "5003"^^xsd:int ;
       eg:referenceSequence _:ref5000 ;
       eg:alternateSequence _:alt5000
 # more stuff from ID, ALT, QUAL, FILTER and INFO fields of VCF
   ]

# some mapping of the per-sample field
# e.g. in 1000 genomes data FORMAT=GT:DS:GL 1|0:1.000:-1.69,-0.01,-5.00
# the 1|0 is a phased genotype call
   eg:GT [
      eg:phase _:p1 ;
      eg:gtCall _:alt5000 ;
   ]
   eg:GT [
      eg:phase _:p2 ;
      eg:gtCall _:ref5000 ;
   ]
].
_:ref5000 eg:sequence "ACTG" .
_:alt5000 eg:sequence "A" .



Hmmmm, there are a lot of modeling questions in there. The VCF file format has 
some answers, but not very good ones, partly because the questions do not 
appear to have been asked as modeling questions.
It seems pretty unclear to me how to include the GL (Genotype Likelihood) 
values in there. I think these are used to help make the genotype call; and 
then kept around in case you don't like the call.
The phasing also seems problematic, since it seems that it is generally useful 
information as to which strand which allele was seen on, (for example for 
hapliotype identification) but in practice we can't trace a strand all the way 
through a chromosome.

Further the genotype call may be phased (ordered with respect to genotype calls 
at at least one other position), or unphased (i.e. an unordered pair); and the 
two values may be the same or different - the best way to model that is ??? All 
my ideas seem at least a little awkward.

Or would it be better just to dump this stuff in an RDB, and be done with it.

Jeremy





Reply via email to