Re: [Open Babel] Open Babel in the browser
On 2013-06-06 22:13, Geoffrey Hutchison wrote: Although I'm starting to think that json is such a simple format that it could do without a strict chemical specification. Getting json out of an OBMol is 5 lines of code My concern is the opposite. It's always easy to write to an arbitrary format from an OBMol. Parsing a pile of different formats is a pain, which is why it'd be better to have a somewhat standardized, extensible style. I'd argue that chemdoodle json, cml json, whatever json should be added to input/output formats. Openbabel's own json format would obviously be OBMols serialized to json. Neither requires making up yet another data model. Dimitri -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Re: [Open Babel] Open Babel in the browser
Regarding using JSON as a new file format... This discussion has focussed on the syntax of JSON, but completely overlooks the real problem with ALL chemical file formats: how do you handle all of the cases where a simple connection-table (ball and stick) doesn't capture reality? Things like aromaticity, tautomers, organo-metallic bonds, boron-hydrogen cages, distributed bonds (ferrocenes and the like) ... these are the problems. If we could solve these problems, it wouldn't much matter which file format we picked ... they'd all be equivalent and sufficient. Without solving these problems, a new file format doesn't really matter very much. All it does is make another parser with yet-another-interpretation of these hard problems. If JSON is a need, I suggest that you embed an existing chemical format (see my previous note that uses SMILES) into a JSON object. Craig -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Re: [Open Babel] Open Babel in the browser
On Thu, Jun 6, 2013 at 2:11 PM, Patrick Fuller patrickful...@gmail.comwrote: Tim, I think Dimitri's point is that all the references are implicitly defined by list indices, rather than explicit keys. For example, something like { atoms: { C1: { element: C, location: [ 0.230811, 0.380820, -0.610968 ] }, C2: { element: C, location: [ -0.230811, -0.380820, 0.610968 ] } }, bonds: [ { atoms: [ C1, C2 ], order: 1 } ]} will result in generally cleaner code. That is, molecule[atoms][C1][location] is easier to understand than molecule[elements][coords][3d][0]. In that regard, I completely agree with him. If you're going to rely on positions within arrays, why not just do it the simple way? { smiles: [CCO], 2D: [1,1,2,2,3,3], 3D: [1,1,1,2,2,2,3,3,3] } The atoms are indexed left-to-right in the SMILES. That's it. Everything else keys to that. A SMILES contains exactly the same information as the atom/bond lists in a much more compact form. If you want to avoid the aromaticity problem, just use Kekule form, which makes it virtually identical to any other connection table format, but in about 10x to 20x fewer bytes. SMILES are very easy to parse, and there are dozens of parsers around. If we're going to invent yet-another-file-format, can't we at least move past 1970s atom/bond table technology? Craig -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Re: [Open Babel] Open Babel in the browser
If you're going to rely on positions within arrays, why not just do it the simple way? { smiles: [CCO], 2D: [1,1,2,2,3,3], 3D: [1,1,1,2,2,2,3,3,3] } Smiles are a great representation of molecules (especially with smarts/smirks regex), and, in cases where they can be used, I think they're the best thing out there. However, they don't cover everything. I work with metal-organic frameworks, which are large crystals that require more extensibility than smiles offers (I still use _-separated smiles of the mof constituents to hash the cif / json files, however). Also, my point in that previous email is that referencing by index is bad, not good. It's less direct than explicitly referencing items, which makes the format more difficult to understand for new users + more prone to user error. A SMILES contains exactly the same information as the atom/bond lists in a much more compact form. If you want to avoid the aromaticity problem, just use Kekule form, which makes it virtually identical to any other connection table format, but in about 10x to 20x fewer bytes. SMILES are very easy to parse, and there are dozens of parsers around. What I truly like about smiles is that it's human readable + hashable, which I see as the real goal. The shorter length is just a corollary of that. Prove me wrong, but I think people make too big a deal about size of molecule formats. I just bought a 2 TB hard disk drive for $70. WIth mongo db + their json serialization, I estimated that I can put 200 million verbose json mof structures on that drive. I only have a few thousand, so I some room to spare. This discussion has focussed on the syntax of JSON, but completely overlooks the real problem with ALL chemical file formats: how do you handle all of the cases where a simple connection-table (ball and stick) doesn't capture reality? Things like aromaticity, tautomers, organo-metallic bonds, boron-hydrogen cages, distributed bonds (ferrocenes and the like) ... these are the problems. The point of json (and xml) is that they are *extensible*- that's why json has exploded in the developer community. If you need handles for aromaticity and metallic bonding, just add new properties to the json/xml. Because of the extensibility, adding new properties will not break any existing code. That's the advantage over all of the older table formats, which weren't built to be extensible. And you see the repercussions in scientific code all the time. (I was recently handed a project where someone used heavy metals in molfiles to encode rotational data. That kind of hack is exactly what json/xml fixes.) There's also the advantage that many languages don't need a third-party library to parse a json file. Or, if you do, it's *heavily* supported (ie. gson for java). Geoff - Outside of some fairly minor issues, xml translates easily to json. Could the chemical xml specification just be translated to json? On Fri, Jun 7, 2013 at 11:32 AM, Craig James cja...@emolecules.com wrote: Regarding using JSON as a new file format... This discussion has focussed on the syntax of JSON, but completely overlooks the real problem with ALL chemical file formats: how do you handle all of the cases where a simple connection-table (ball and stick) doesn't capture reality? Things like aromaticity, tautomers, organo-metallic bonds, boron-hydrogen cages, distributed bonds (ferrocenes and the like) ... these are the problems. If we could solve these problems, it wouldn't much matter which file format we picked ... they'd all be equivalent and sufficient. Without solving these problems, a new file format doesn't really matter very much. All it does is make another parser with yet-another-interpretation of these hard problems. If JSON is a need, I suggest that you embed an existing chemical format (see my previous note that uses SMILES) into a JSON object. Craig -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Re: [Open Babel] Open Babel in the browser
On 06/07/2013 12:25 PM, Patrick Fuller wrote: Geoff - Outside of some fairly minor issues, xml translates easily to json. Could the chemical xml specification just be translated to json? If you gloss over things like #[P]CDATA, (elt+), (#CDATA|(foo,bar,baz)), it's trivial. Except for attributes: if you have bad xml, like atom idx=1 id=C/ instead of atomidx1/idxidC/id/atom, then it isn't. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Re: [Open Babel] Open Babel in the browser
I don't think we need to worry about the naming conventions of corner cases just yet. Taking something basic, ethane in cml molecule atomArray atom id=a1 elementType=C x3=0.229656 y3=0.720147 z3=-0.015085/ atom id=a2 elementType=C x3=-0.229656 y3=-0.720147 z3=0.015085/ /atomArray bondArray bond atomRefs2=a1 a2 order=1/ /bondArray/molecule and my translation to json { atoms: { a1: { element type: C, x3: 0.229656, y3: 0.720147, z3: -0.015085 }, a2: { element type: C, x3: -0.229656, y3: -0.720147, z3: 0.015085 } } bonds: [ {atom refs: [a1, a2], order: 1} ]} it could use some cleaning up, but that's the idea. On Fri, Jun 7, 2013 at 12:50 PM, Dimitri Maziuk dmaz...@bmrb.wisc.eduwrote: On 06/07/2013 12:25 PM, Patrick Fuller wrote: Geoff - Outside of some fairly minor issues, xml translates easily to json. Could the chemical xml specification just be translated to json? If you gloss over things like #[P]CDATA, (elt+), (#CDATA|(foo,bar,baz)), it's trivial. Except for attributes: if you have bad xml, like atom idx=1 id=C/ instead of atomidx1/idxidC/id/atom, then it isn't. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Re: [Open Babel] Open Babel in the browser
On Fri, Jun 7, 2013 at 10:25 AM, Patrick Fuller patrickful...@gmail.comwrote: A SMILES contains exactly the same information as the atom/bond lists in a much more compact form. If you want to avoid the aromaticity problem, just use Kekule form, which makes it virtually identical to any other connection table format, but in about 10x to 20x fewer bytes. SMILES are very easy to parse, and there are dozens of parsers around. What I truly like about smiles is that it's human readable + hashable, which I see as the real goal. The shorter length is just a corollary of that. Prove me wrong, but I think people make too big a deal about size of molecule formats. I just bought a 2 TB hard disk drive for $70. WIth mongo db + their json serialization, I estimated that I can put 200 million verbose json mof structures on that drive. I only have a few thousand, so I some room to spare. I have a database of 10 million compounds. The SDF version, even compressed, is difficult over the internet. It's not about disks, it's about file transfers and database performance. It's not a matter of a few bytes here or there (I agree that people worry about file size too much). It's about a factor of ten or twenty. Connection-table lists of atoms and bonds are just a dumb way to represent atoms and bonds. This discussion has focussed on the syntax of JSON, but completely overlooks the real problem with ALL chemical file formats: how do you handle all of the cases where a simple connection-table (ball and stick) doesn't capture reality? Things like aromaticity, tautomers, organo-metallic bonds, boron-hydrogen cages, distributed bonds (ferrocenes and the like) ... these are the problems. The point of json (and xml) is that they are *extensible*- that's why json has exploded in the developer community. This isn't necessarily a good thing. One of the biggest problems in cheminformatics and molecular modeling is that people have altered existing formats to suit their own needs ... and that has led to disaster. There is no such thing as the PDB format -- rather, you mostly have to know the origin of a particular PDB file in order to interpret it. Each project effectively has its own PDB format. JSON may be extensible, but that is useless unless there is a widely recognized authority on the meaning of each extension, along with open-source software that illustrates a practical application of the standard. Never forget the old joke, The great thing about standards is that there are so many to choose from! JSON essentially gives you a stronger rope when you in the process of hanging yourself. If you need handles for aromaticity and metallic bonding, just add new properties to the json/xml. Because of the extensibility, adding new properties will not break any existing code. Then why have a standard at all? What is the use of new properties if nobody knows what they mean? What happens when five projects all introduce their own syntax and semantics for representing aromaticity and metallic bonding? Chaos. That's the advantage over all of the older table formats, which weren't built to be extensible. And you see the repercussions in scientific code all the time. The real problem had nothing to do with being built to be extensible, but rather that the table format definitions were controlled by commercial companies that had no interest in data exchange or in participation by the chemistry community. When I created the OpenSMILES.org web page, I more-or-less did it by stealing the leadership from Daylight, the company that invented SMILES. I invited their participation but, while they didn't object to our project, they also elected to stay out of it. SMILES now has a future that's in the hands of the community. If the community decides to add features, we can ... and we'll all be able to agree on those features. It might seem as if I'm trying to discourage JSON, but nothing could be farther from the truth. A modern, object-oriented, extensible and well documented format is long overdue. The CML project is one such (you might want to look at it for ideas), but it never got traction. Maybe JSON, with its widespread use and readily-available software, is just the thing. If you really want to make JSON a standard, the JSON syntax itself is a trivial part of the problem. The real problem is establishing standards for how each datatype is to be interpreted, followed by clear, published standards for each datatype. If you let people just add their own datatypes on an as-you-please basis, you'll just have another Tower of Babel ... and that's where the name OpenBabel came from in the first place. Craig -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record
Re: [Open Babel] Open Babel in the browser
Wow, that was a very insightful email. Thank you for writing it. Getting back to something actionable, what do you think about the idea of just translating the CML standard to json? Outside of some nuances, XML and JSON generally accomplish the same thing, so I would think that the chemical XML standard would be easily translatable to chemical JSON. On Fri, Jun 7, 2013 at 1:45 PM, Craig James cja...@emolecules.com wrote: On Fri, Jun 7, 2013 at 10:25 AM, Patrick Fuller patrickful...@gmail.comwrote: A SMILES contains exactly the same information as the atom/bond lists in a much more compact form. If you want to avoid the aromaticity problem, just use Kekule form, which makes it virtually identical to any other connection table format, but in about 10x to 20x fewer bytes. SMILES are very easy to parse, and there are dozens of parsers around. What I truly like about smiles is that it's human readable + hashable, which I see as the real goal. The shorter length is just a corollary of that. Prove me wrong, but I think people make too big a deal about size of molecule formats. I just bought a 2 TB hard disk drive for $70. WIth mongo db + their json serialization, I estimated that I can put 200 million verbose json mof structures on that drive. I only have a few thousand, so I some room to spare. I have a database of 10 million compounds. The SDF version, even compressed, is difficult over the internet. It's not about disks, it's about file transfers and database performance. It's not a matter of a few bytes here or there (I agree that people worry about file size too much). It's about a factor of ten or twenty. Connection-table lists of atoms and bonds are just a dumb way to represent atoms and bonds. This discussion has focussed on the syntax of JSON, but completely overlooks the real problem with ALL chemical file formats: how do you handle all of the cases where a simple connection-table (ball and stick) doesn't capture reality? Things like aromaticity, tautomers, organo-metallic bonds, boron-hydrogen cages, distributed bonds (ferrocenes and the like) ... these are the problems. The point of json (and xml) is that they are *extensible*- that's why json has exploded in the developer community. This isn't necessarily a good thing. One of the biggest problems in cheminformatics and molecular modeling is that people have altered existing formats to suit their own needs ... and that has led to disaster. There is no such thing as the PDB format -- rather, you mostly have to know the origin of a particular PDB file in order to interpret it. Each project effectively has its own PDB format. JSON may be extensible, but that is useless unless there is a widely recognized authority on the meaning of each extension, along with open-source software that illustrates a practical application of the standard. Never forget the old joke, The great thing about standards is that there are so many to choose from! JSON essentially gives you a stronger rope when you in the process of hanging yourself. If you need handles for aromaticity and metallic bonding, just add new properties to the json/xml. Because of the extensibility, adding new properties will not break any existing code. Then why have a standard at all? What is the use of new properties if nobody knows what they mean? What happens when five projects all introduce their own syntax and semantics for representing aromaticity and metallic bonding? Chaos. That's the advantage over all of the older table formats, which weren't built to be extensible. And you see the repercussions in scientific code all the time. The real problem had nothing to do with being built to be extensible, but rather that the table format definitions were controlled by commercial companies that had no interest in data exchange or in participation by the chemistry community. When I created the OpenSMILES.org web page, I more-or-less did it by stealing the leadership from Daylight, the company that invented SMILES. I invited their participation but, while they didn't object to our project, they also elected to stay out of it. SMILES now has a future that's in the hands of the community. If the community decides to add features, we can ... and we'll all be able to agree on those features. It might seem as if I'm trying to discourage JSON, but nothing could be farther from the truth. A modern, object-oriented, extensible and well documented format is long overdue. The CML project is one such (you might want to look at it for ideas), but it never got traction. Maybe JSON, with its widespread use and readily-available software, is just the thing. If you really want to make JSON a standard, the JSON syntax itself is a trivial part of the problem. The real problem is establishing standards for how each datatype is to be interpreted, followed by clear, published standards for each datatype. If you let
Re: [Open Babel] Open Babel in the browser
On 06/07/2013 01:45 PM, Craig James wrote: ... The CML project is one such (you might want to look at it for ideas), but it never got traction. XML is bad at tabular data. A table of x, y, x coordinates in properly formatted xml is at least twice as many bytes (x123.456/x uses as many bytes for markup as for the value). So projects like cml try to get around that by encoding values in attributes -- about the #1 on how not to design your dtd list. The problem is that only scales to a few dozen rows. Once you get to 10^6 molecules of 10^3 atoms, it doesn't scale either. So it doesn't get widely adopted. Instead others do one worse and create xml where tables are stuffed into #CDATA. Which means a bunch of bytes with whose meaning and structure was known to the postdoc who went back to China three years ago. Maybe JSON, with its widespread use and readily-available software, is just the thing. JSON comes with less markup overhead, that's one of the reasons it's seeing more use. The downside is exactly as you said -- too many (read no) standards. The advantage of xml is the dtd: a valid xml document tells you what the elements mean. All json tells you is array, associative array, string, number, boolean. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
Re: [Open Babel] Open Babel in the browser
Hi, I think CML is definitely a useful starting point, however I think it would be a mistake to just translate everything across in a literal way. In particular, I think it's definitely worth thinking carefully about the different strengths of the XML and JSON formats (in both syntax and philosophy), and also about the reasons why CML has struggled a bit to gain traction. The idea of keeping things as simple as possible is a major aspect of the JSON syntax and philosophy, so I think it's worth keeping that in mind as much as possible, especially considering the perceived complexity and verbosity of CML seems to put a lot of people off. In practice this might mean defining things implicitly where possible (like array indices as ids, and 2D/3D defined by the absence of any z coordinates, rather than having x2, y2, x3, y3, z3), and purposefully avoiding certain features in the core specification (multiple conformers, distributed bonds?). Part of the reason behind the current increased interest in JSON is that it plays so nicely with many modern technologies that are becoming more and more widespread - i.e. web applications, REST APIs, document-oriented NoSQL databases etc. I think these use-cases definitely need to taken into account when designing the format. For example, shorter key names are helpful in NoSQL databases and when embedding data in web pages, but there is a tradeoff there with readability. It's also worth thinking about how people might want to query and index these documents in NoSQL databases - for example having atoms as an array of objects allows elemMatch style queries in MongoDB, which could be useful. As others have said, the downside to JSON's simplicity is that extensibility is not standardised like it is in CML - it is essentially a free-for-all, which as Craig points out will likely lead to chaos in the long run. There are projects like JSON-LD (http://json-ld.org) which would allow proper decentralised extensibility, but it's not widely supported and sacrifices simplicity, meaning you lose some of JSON's biggest strengths over XML anyway. Maybe something like namespaced keys (org.openbabel.fp2: ...) would be simpler, along with some kind of ongoing community project to define equivalent keys in a machine-readable dictionary. Or maybe we just accept an inevitable free-for-all and just aim to define a sensible common core. Basically, I just think it's worth being cautious not to just repeat the work people have done with CML, and It would be great if we could create something that really plays to JSON's strengths. Matt On 7 Jun 2013, at 20:21, Patrick Fuller patrickful...@gmail.com wrote: Wow, that was a very insightful email. Thank you for writing it. Getting back to something actionable, what do you think about the idea of just translating the CML standard to json? Outside of some nuances, XML and JSON generally accomplish the same thing, so I would think that the chemical XML standard would be easily translatable to chemical JSON. On Fri, Jun 7, 2013 at 1:45 PM, Craig James cja...@emolecules.com wrote: On Fri, Jun 7, 2013 at 10:25 AM, Patrick Fuller patrickful...@gmail.com wrote: A SMILES contains exactly the same information as the atom/bond lists in a much more compact form. If you want to avoid the aromaticity problem, just use Kekule form, which makes it virtually identical to any other connection table format, but in about 10x to 20x fewer bytes. SMILES are very easy to parse, and there are dozens of parsers around. What I truly like about smiles is that it's human readable + hashable, which I see as the real goal. The shorter length is just a corollary of that. Prove me wrong, but I think people make too big a deal about size of molecule formats. I just bought a 2 TB hard disk drive for $70. WIth mongo db + their json serialization, I estimated that I can put 200 million verbose json mof structures on that drive. I only have a few thousand, so I some room to spare. I have a database of 10 million compounds. The SDF version, even compressed, is difficult over the internet. It's not about disks, it's about file transfers and database performance. It's not a matter of a few bytes here or there (I agree that people worry about file size too much). It's about a factor of ten or twenty. Connection-table lists of atoms and bonds are just a dumb way to represent atoms and bonds. This discussion has focussed on the syntax of JSON, but completely overlooks the real problem with ALL chemical file formats: how do you handle all of the cases where a simple connection-table (ball and stick) doesn't capture reality? Things like aromaticity, tautomers, organo-metallic bonds, boron-hydrogen cages, distributed bonds (ferrocenes and the like) ... these are the problems. The point of json (and xml) is that they are extensible- that's why json has exploded in the developer community.