Hi,

I think CML is definitely a useful starting point, however I think it would be 
a mistake to just translate everything across in a literal way. In particular, 
I think it's definitely worth thinking carefully about the different strengths 
of the XML and JSON formats (in both syntax and philosophy), and also about the 
reasons why CML has struggled a bit to gain traction.

The idea of keeping things as simple as possible is a major aspect of the JSON 
syntax and philosophy, so I think it's worth keeping that in mind as much as 
possible, especially considering the perceived complexity and verbosity of CML 
seems to put a lot of people off. In practice this might mean defining things 
implicitly where possible (like array indices as ids, and 2D/3D defined by the 
absence of any z coordinates, rather than having x2, y2, x3, y3, z3), and 
purposefully avoiding certain features in the core specification (multiple 
conformers, distributed bonds?).

Part of the reason behind the current increased interest in JSON is that it 
plays so nicely with many modern technologies that are becoming more and more 
widespread - i.e. web applications, REST APIs,  document-oriented NoSQL 
databases etc. I think these use-cases definitely need to taken into account 
when designing the format. For example, shorter key names are helpful in NoSQL 
databases and when embedding data in web pages, but there is a tradeoff there 
with readability. It's also worth thinking about how people might want to query 
and index these documents in NoSQL databases - for example having atoms as an 
array of objects allows "elemMatch" style queries in MongoDB, which could be 
useful.

As others have said, the downside to JSON's simplicity is that extensibility is 
not standardised like it is in CML - it is essentially a free-for-all, which as 
Craig points out will likely lead to chaos in the long run. There are projects 
like JSON-LD (http://json-ld.org) which would allow proper decentralised 
extensibility, but it's not widely supported and sacrifices simplicity, meaning 
you lose some of JSON's biggest strengths over XML anyway. Maybe something like 
namespaced keys ("org.openbabel.fp2": "...") would be simpler, along with some 
kind of ongoing community project to define equivalent keys in a 
machine-readable dictionary. Or maybe we just accept an inevitable free-for-all 
and just aim to define a sensible common core.

Basically, I just think it's worth being cautious not to just repeat the work 
people have done with CML, and It would be great if we could create something 
that really plays to JSON's strengths.

Matt

On 7 Jun 2013, at 20:21, Patrick Fuller <patrickful...@gmail.com> wrote:

> Wow, that was a very insightful email. Thank you for writing it.
> 
> Getting back to something actionable, what do you think about the idea of 
> just translating the CML standard to json? Outside of some nuances, XML and 
> JSON generally accomplish the same thing, so I would think that the chemical 
> XML standard would be easily translatable to chemical JSON.
> 
> 
> On Fri, Jun 7, 2013 at 1:45 PM, Craig James <cja...@emolecules.com> wrote:
> On Fri, Jun 7, 2013 at 10:25 AM, Patrick Fuller <patrickful...@gmail.com> 
> wrote:
> A SMILES contains exactly the same information as the atom/bond lists in a 
> much more compact form. If you want to avoid the aromaticity problem, just 
> use Kekule form, which makes it virtually identical to any other connection 
> table format, but in about 10x to 20x fewer bytes. SMILES are very easy to 
> parse, and there are dozens of parsers around.
> 
> What I truly like about smiles is that it's human readable + hashable, which 
> I see as the real goal. The shorter length is just a corollary of that. Prove 
> me wrong, but I think people make too big a deal about size of molecule 
> formats. I just bought a 2 TB hard disk drive for $70. WIth mongo db + their 
> json serialization, I estimated that I can put 200 million verbose json mof 
> structures on that drive. I only have a few thousand, so I some room to spare.
> 
> I have a database of 10 million compounds. The SDF version, even compressed, 
> is difficult over the internet.  It's not about disks, it's about file 
> transfers and database performance.  It's not a matter of a few bytes here or 
> there (I agree that people worry about file size too much).  It's about a 
> factor of ten or twenty.  Connection-table lists of atoms and bonds are just 
> a dumb way to represent atoms and bonds.
> This discussion has focussed on the syntax of JSON, but completely overlooks 
> the real problem with ALL chemical file formats: how do you handle all of the 
> cases where a simple connection-table ("ball and stick") doesn't capture 
> reality? Things like aromaticity, tautomers, organo-metallic bonds, 
> boron-hydrogen cages, distributed bonds (ferrocenes and the like) ... these 
> are the problems.
> 
> The point of json (and xml) is that they are extensible- that's why json has 
> exploded in the developer community.
> 
> This isn't necessarily a good thing.  One of the biggest problems in 
> cheminformatics and molecular modeling is that people have altered existing 
> formats to suit their own needs ... and that has led to disaster.  There is 
> no such thing as the "PDB format" -- rather, you mostly have to know the 
> origin of a particular PDB file in order to interpret it.  Each project 
> effectively has its own "PDB format."
> 
> JSON may be extensible, but that is useless unless there is a widely 
> recognized authority on the meaning of each extension, along with open-source 
> software that illustrates a practical application of the standard.
> 
> Never forget the old joke, "The great thing about standards is that there are 
> so many to choose from!"  JSON essentially gives you a stronger rope when you 
> in the process of hanging yourself. 
> If you need handles for aromaticity and metallic bonding, just add new 
> properties to the json/xml. Because of the extensibility, adding new 
> properties will not break any existing code.
> 
> Then why have a standard at all? What is the use of new properties if nobody 
> knows what they mean?  What happens when five projects all introduce their 
> own syntax and semantics for representing aromaticity and metallic bonding?  
> Chaos.
> That's the advantage over all of the older table formats, which weren't built 
> to be extensible. And you see the repercussions in scientific code all the 
> time.
> 
> The real problem had nothing to do with being "built to be extensible," but 
> rather that the table format definitions were controlled by commercial 
> companies that had no interest in data exchange or in participation by the 
> chemistry community.
> 
> When I created the OpenSMILES.org web page, I more-or-less did it by stealing 
> the leadership from Daylight, the company that invented SMILES.  I invited 
> their participation but, while they didn't object to our project, they also 
> elected to stay out of it.  SMILES now has a future that's in the hands of 
> the community.  If the community decides to add features, we can ... and 
> we'll all be able to agree on those features.
> 
> It might seem as if I'm trying to discourage JSON, but nothing could be 
> farther from the truth.  A modern, object-oriented, extensible and well 
> documented format is long overdue.  The CML project is one such (you might 
> want to look at it for ideas), but it never got traction.  Maybe JSON, with 
> its widespread use and readily-available software, is just the thing.
> 
> If you really want to make JSON a standard, the JSON syntax itself is a 
> trivial part of the problem. The real problem is establishing standards for 
> how each datatype is to be interpreted, followed by clear, published 
> standards for each datatype.  If you let people just add their own datatypes 
> on an as-you-please basis, you'll just have another Tower of Babel ... and 
> that's where the name OpenBabel came from in the first place.
> 
> Craig
> 
> 
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations
> 2. Dashboards that offer high-level views of enterprise services
> 3. A single system of record for all IT processes
> http://p.sf.net/sfu/servicenow-d2d-j_______________________________________________
> OpenBabel-discuss mailing list
> OpenBabel-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

------------------------------------------------------------------------------
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to