On Wednesday, 1 September 2021 at 05:36:53 UTC, James Blachly
wrote:
In another post, I've just announced our D-based high
throughput sequencing library, dhtslib.
One feature that is, AFAIK, novel in the field is leveraging
the compiler's type system to enforce correctness regarding
different genome/reference sequence coordinate systems.
Clearly, the encoding of domain specific knowledge in a
language's type system is nothing new, but it is surprising
that this has not been done before in bioinformatics, and it is
an idea that IMO is long overdue given the trainwreck of
different coordinate systems in our field.
You can find dhtslib's develop branch, with Typesafe
Coordinates merged and ready to use, here:
https://github.com/blachlylab/dhtslib/
**Now the request:**
We've drafted a manuscript describing Typesafe Coordinates as a
sort of low-key endorsement of the D language and our library
package `dhtslib`. You can find the manuscript here:
https://github.com/blachlylab/typesafe-coordinates/
We would be very grateful to those of you who would take the
time to read the manuscript and post comments (publicly or
privately), _especially if we have made any incorrect
statements_ or our language regarding type systems is awkward
or nonstandard.
We did praise D, and gently criticized Rust and OCaml* somewhat
as it appeared to me that they lacked the features required to
implement Typesafe Coordinate Systems in as ergonomic a way as
we could in D. However, being a true novice at both of these
other languages there is the possibility that I've missed
something significant, and that the Rust and OCaml
implementations could be retooled to match the D
implementation. I'd still be glad to hear it if that's the case.
I plan to make a few minor cleanups and submit this to a
preprint server as well as a scientific journal in the next
week or so.
Kind regards
James S Blachly, MD
The Ohio State University
* as a side note, I actually find the OCaml code quite
attractive in its terseness: `let j = cl_interval_of_ho
(ob_interval_of_zb i)`
Hi James and Charles,
I am happy to hear of your latest idea of creating type-safe
coordinate systems. It's a great idea!
After reading the code on GitHub, I have only one major remark:
IMHO, it would be great to separate the novel coordinates systems
from any `htslib` dependencies ([see lines
47-50](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L47-L50)) as there are only auxiliary functions that use both the novel coordinates systems and `htslib`. The greater goal I have in mind is to provide the coordinate systems in a separate DUB sub-package (e.g. `dhtslib:coordinates`) that requires only a D compiler. That makes integration into existing projects that do not need `htslib` much easier.
Also, I have a short list of minor, technical remarks:
1. The returned type in [line
114](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L114) has a typo, there is an additional 's'.
2. The array of identifiers `CoordSystemLabels` in [line
203](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L203) is a bit unsafe and not strictly required for two reasons:
1. It can by generated by the compiler using `enum
CoordSystemLabels = __traits(allMembers, CoordSystem);`.
2. As far as I can tell its only application is in [line
376](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L376). The same result can be achieved safely using `cs.stringof.split('.')[$ - 1]` or without use of `std.array.split`: `cs.stringof[CoordSystem.stringof.length + 1 .. $]`.
3. The function `unionImpl` in [line
326](https://github.com/blachlylab/dhtslib/blob/e3b5af14e9eefa54bcc27bc0fcc9066dc3a4ea54/source/dhtslib/coordinates.d#L326) actually computes the convex hull of the two intervals which should be noted in the doc comment for completeness' sake.
4. I have noted that you use operator overloading for union and
intersection of `Interval`s. You may also add overloads for the
`offset` function in both `Interval` and `Coordinate` with `auto
opBinary(string op, T)(T off) if ((op == '+' || op == '-') &&
isIntegral!T)` and `auto opBinaryRight(string op, T)(T off) if
((op == '+' || op == '-') && isIntegral!T)`.
I enjoyed reading the manuscript. It highlights the issue clearly
and presents the solution without getting lost in details.
Ignoring typos at this stage, I have no remarks on it – keep
going!
Cheers!
-- Arne