I vote to aim to define a sensible common core. I think namespaced keys
will sacrifice too much human readability, and I agree that JSON-LD
sacrifices simplicity. In the vein of readability, how about just
serializing pybel.Molecule?


On Fri, Jun 7, 2013 at 5:58 PM, Matt Swain <mattswain...@gmail.com> wrote:

> Hi,
>
> I think CML is definitely a useful starting point, however I think it
> would be a mistake to just translate everything across in a literal way. In
> particular, I think it's definitely worth thinking carefully about the
> different strengths of the XML and JSON formats (in both syntax and
> philosophy), and also about the reasons why CML has struggled a bit to gain
> traction.
>
> The idea of keeping things as simple as possible is a major aspect of the
> JSON syntax and philosophy, so I think it's worth keeping that in mind as
> much as possible, especially considering the perceived complexity and
> verbosity of CML seems to put a lot of people off. In practice this might
> mean defining things implicitly where possible (like array indices as ids,
> and 2D/3D defined by the absence of any z coordinates, rather than having
> x2, y2, x3, y3, z3), and purposefully avoiding certain features in the core
> specification (multiple conformers, distributed bonds?).
>
> Part of the reason behind the current increased interest in JSON is that
> it plays so nicely with many modern technologies that are becoming more and
> more widespread - i.e. web applications, REST APIs,
>  document-oriented NoSQL databases etc. I think these use-cases definitely
> need to taken into account when designing the format. For example, shorter
> key names are helpful in NoSQL databases and when embedding data in web
> pages, but there is a tradeoff there with readability. It's also worth
> thinking about how people might want to query and index these documents in
> NoSQL databases - for example having atoms as an array of objects allows
> "elemMatch" style queries in MongoDB, which could be useful.
>
> As others have said, the downside to JSON's simplicity is that
> extensibility is not standardised like it is in CML - it is essentially a
> free-for-all, which as Craig points out will likely lead to chaos in the
> long run. There are projects like JSON-LD (http://json-ld.org) which
> would allow proper decentralised extensibility, but it's not widely
> supported and sacrifices simplicity, meaning you lose some of JSON's
> biggest strengths over XML anyway. Maybe something like namespaced keys
> ("org.openbabel.fp2": "...") would be simpler, along with some kind of
> ongoing community project to define equivalent keys in a machine-readable
> dictionary. Or maybe we just accept an inevitable free-for-all and just aim
> to define a sensible common core.
>
> Basically, I just think it's worth being cautious not to just repeat the
> work people have done with CML, and It would be great if we could create
> something that really plays to JSON's strengths.
>
> Matt
>
> On 7 Jun 2013, at 20:21, Patrick Fuller <patrickful...@gmail.com> wrote:
>
> Wow, that was a very insightful email. Thank you for writing it.
>
> Getting back to something actionable, what do you think about the idea of
> just translating the CML standard to json? Outside of some nuances, XML and
> JSON generally accomplish the same thing, so I would think that the
> chemical XML standard would be easily translatable to chemical JSON.
>
>
> On Fri, Jun 7, 2013 at 1:45 PM, Craig James <cja...@emolecules.com> wrote:
>
>> On Fri, Jun 7, 2013 at 10:25 AM, Patrick Fuller 
>> <patrickful...@gmail.com>wrote:
>>
>>>  A SMILES contains exactly the same information as the atom/bond lists
>>> in a much more compact form. If you want to avoid the aromaticity problem,
>>> just use Kekule form, which makes it virtually identical to any other
>>> connection table format, but in about 10x to 20x fewer bytes. SMILES are
>>> very easy to parse, and there are dozens of parsers around.
>>>
>>> What I truly like about smiles is that it's human readable + hashable,
>>> which I see as the real goal. The shorter length is just a corollary of
>>> that. Prove me wrong, but I think people make too big a deal about size of
>>> molecule formats. I just bought a 2 TB hard disk drive for $70. WIth mongo
>>> db + their json serialization, I estimated that I can put 200 million
>>> verbose json mof structures on that drive. I only have a few thousand, so I
>>> some room to spare.
>>>
>> I have a database of 10 million compounds. The SDF version, even
>> compressed, is difficult over the internet.  It's not about disks, it's
>> about file transfers and database performance.  It's not a matter of a few
>> bytes here or there (I agree that people worry about file size too much).
>> It's about a factor of ten or twenty.  Connection-table lists of atoms and
>> bonds are just a dumb way to represent atoms and bonds.
>>
>>> This discussion has focussed on the syntax of JSON, but completely
>>> overlooks the real problem with ALL chemical file formats: how do you
>>> handle all of the cases where a simple connection-table ("ball and stick")
>>> doesn't capture reality? Things like aromaticity, tautomers,
>>> organo-metallic bonds, boron-hydrogen cages, distributed bonds (ferrocenes
>>> and the like) ... these are the problems.
>>>
>>> The point of json (and xml) is that they are *extensible*- that's why
>>> json has exploded in the developer community.
>>>
>> This isn't necessarily a good thing.  One of the biggest problems in
>> cheminformatics and molecular modeling is that people have altered existing
>> formats to suit their own needs ... and that has led to disaster.  There is
>> no such thing as the "PDB format" -- rather, you mostly have to know the
>> origin of a particular PDB file in order to interpret it.  Each project
>> effectively has its own "PDB format."
>>
>> JSON may be extensible, but that is useless unless there is a widely
>> recognized authority on the meaning of each extension, along with
>> open-source software that illustrates a practical application of the
>> standard.
>>
>> Never forget the old joke, "The great thing about standards is that there
>> are so many to choose from!"  JSON essentially gives you a stronger rope
>> when you in the process of hanging yourself.
>>
>>>  If you need handles for aromaticity and metallic bonding, just add new
>>> properties to the json/xml. Because of the extensibility, adding new
>>> properties will not break any existing code.
>>>
>> Then why have a standard at all? What is the use of new properties if
>> nobody knows what they mean?  What happens when five projects all introduce
>> their own syntax and semantics for representing aromaticity and metallic
>> bonding?  Chaos.
>>
>>>  That's the advantage over all of the older table formats, which weren't
>>> built to be extensible. And you see the repercussions in scientific code
>>> all the time.
>>>
>> The real problem had nothing to do with being "built to be extensible,"
>> but rather that the table format definitions were controlled by commercial
>> companies that had no interest in data exchange or in participation by the
>> chemistry community.
>>
>> When I created the OpenSMILES.org web page, I more-or-less did it by
>> stealing the leadership from Daylight, the company that invented SMILES.  I
>> invited their participation but, while they didn't object to our project,
>> they also elected to stay out of it.  SMILES now has a future that's in the
>> hands of the community.  If the community decides to add features, we can
>> ... and we'll all be able to agree on those features.
>>
>> It might seem as if I'm trying to discourage JSON, but nothing could be
>> farther from the truth.  A modern, object-oriented, extensible and well
>> documented format is long overdue.  The CML project is one such (you might
>> want to look at it for ideas), but it never got traction.  Maybe JSON, with
>> its widespread use and readily-available software, is just the thing.
>>
>> If you really want to make JSON a standard, the JSON syntax itself is a
>> trivial part of the problem. The real problem is establishing standards for
>> how each datatype is to be interpreted, followed by clear, published
>> standards for each datatype.  If you let people just add their own
>> datatypes on an as-you-please basis, you'll just have another Tower of
>> Babel ... and that's where the name OpenBabel came from in the first place.
>>
>> Craig
>>
>>
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations
> 2. Dashboards that offer high-level views of enterprise services
> 3. A single system of record for all IT processes
>
> http://p.sf.net/sfu/servicenow-d2d-j_______________________________________________
> OpenBabel-discuss mailing list
> OpenBabel-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
>
>
>
------------------------------------------------------------------------------
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to