Sorry Steve, there was a bug in MolVS that you encountered. Should now be fixed.
"pip install -U molvs" to get the update (v0.0.7).
Matt
> On 1 Dec 2016, at 15:52, Stephen O'hagan <soha...@manchester.ac.uk> wrote:
>
> Thanks for the interesting links.
>
> MolVS looks good, but failed on ‘NC(CC(=O)O)C(=O)[O-].O.O.[Na+]’ which isn’t
> that extraordinary…
>
> Couldn’t get Standardise to work at all, even on the example given; API not
> intuitive or docs wrong or out of date.
>
> I will have a look at the info in the UniChem paper, though not inclined to
> use a web service for what I want to do.
>
> Cheers,
> Steve.
>
> From: George Papadatos [mailto:gpapada...@gmail.com]
> Sent: 01 December 2016 14:26
> To: Greg Landrum <greg.land...@gmail.com>
> Cc: Stephen O'hagan <soha...@manchester.ac.uk>;
> rdkit-discuss@lists.sourceforge.net; Francis Atkinson <fran...@ebi.ac.uk>
> Subject: Re: [Rdkit-discuss] comparing two or more tables of molecules
>
> HI Stephen,
>
> Further to Greg's excellent reply, see this paper on how InChI strings and
> keys can be used in practice to map together tautomer (ones covered by InChI
> at least), isotope, stereo and parent-salt variants.
> http://rd.springer.com/article/10.1186/s13321-014-0043-5
> <http://rd.springer.com/article/10.1186/s13321-014-0043-5>
>
> Francis (cc'ed) has a nice notebook somewhere illustrating these nice InChI
> splits to find these variants.
>
> For educational purposes, there have been other approaches like the NCI's
> identifiers - discussion here:
> http://acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf
> <http://acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf>
>
> For pure structure standardization using RDKit see here:
> https://github.com/flatkinson/standardiser
> <https://github.com/flatkinson/standardiser>
> and
> https://github.com/mcs07/MolVS <https://github.com/mcs07/MolVS>
>
>
> Cheers,
>
> George
>
>
>
>
> On 29 November 2016 at 17:02, Greg Landrum <greg.land...@gmail.com
> <mailto:greg.land...@gmail.com>> wrote:
> Wow, this is a great question and quite a fun thread.
>
> It's hard to really make much of a contribution here without writing a
> book/review article (something that I'm really not willing to do!), but I
> have a few thoughts. Most of this is repeating/rephrasing things others have
> already said.
>
> I'm going to propose some things as facts. I think that these won't be
> controversial:
> fact 1: if the structures are coming from different sources, they need to be
> standardized/normalized before you compare them. This is true regardless of
> how you want to compare them. The details of the standardization process are
> not incredibly important, but it does need to take care of the things you
> care about when comparing molecules. For example, if you don't care about
> differences between salts, it should strip salts. If you don't care about
> differences between tautomers, it should normalize tautomers.
> fact 2: The InChI algorithm includes a standardization step that normalizes
> some tautomers, but does not remove salts.
> fact 3: The InChI representation contain a number of layers defining the
> structure in increasing detail (this isn't strictly true, because some of the
> choices about how layers are ordered are arbitrary, but it's close).
> fact 4: canonicalization, the way I define it, produces a canonical atom
> numbering for a given structure, but it does *not* standardize
> fact 5: the RDKit has essentially no well-documented standardization code
>
> fact X: we don't have any standard, broadly accepted approach for
> standardization, canonicalization or representation that is fool-proof or
> that works for even all of organic chemistry, never mind organometallics.
> InChI, useful as it is for some things, completely fails to handle things
> like atropisomers (they are working on this kind of thing, but it's not out
> yet).
>
> Given all of this, if I wanted to have flexible duplicate checking *right*
> now, I think I would use the AvalonTools struchk functionality that the RDKit
> provides (the new pure-RDKit version still needs a bit more testing) to
> handle basic standardization and salt stripping and then produce a table that
> includes the InChI in a couple of different forms. I'd want to be able to
> recognize molecules that differ only by stereochemistry, molecules that
> differ only by location of tautomeric Hs, and molecules that differ only by
> the location of isotopic labels. You can do this with various clever splits
> of the InChI (how to do it is left as an exercise for the reader and/or a
> future RDKit blog post).
>
> I think there's something fun to be done here with SMILES variants, borrowing
> heavily from some of the things that Roger has written about:
> https://nextmovesoftware.com/blog/2013/04/25/finding-all-types-of-every-mer/
> <https://nextmovesoftware.com/blog/2013/04/25/finding-all-types-of-every-mer/>
> here's a more recent application of that from Noel:
> https://nextmovesoftware.com/blog/2016/06/22/fishing-for-matched-series-in-a-sea-of-structure-representations/
>
> <https://nextmovesoftware.com/blog/2016/06/22/fishing-for-matched-series-in-a-sea-of-structure-representations/>
>
> If I didn't really care about details and just wanted something that I could
> explain easily to others, I'd skip all the complication and just use InChIs
> (or InChI keys) to recognize duplicates. There would be times when that would
> be the wrong answer, but it would be a broadly accepted kind of wrong.[1]
>
> Regardless of the approach, I would not, under most any circumstances,
> discard the original input structures that I had. It's really good to be able
> to figure out what the original data looked like later.
>
> -greg
> [1] I'm crying as I write this...
>
>
>
>
> On Mon, Nov 28, 2016 at 5:25 PM, Stephen O'hagan <soha...@manchester.ac.uk
> <mailto:soha...@manchester.ac.uk>> wrote:
> Has anyone come up with fool-proof way of matching structurally equivalent
> molecules?
>
> Unique Smiles or InChI String comparisons don’t appear to work presumable
> because there are different but equivalent structures, e.g. explicit vs
> non-explicit H’s, Kekule vs Aromatic, isomeric forms vs non-isomeric form,
> tautomers etc.
>
> I also expect that comparing InChI strings might need something more than
> just a simple string comparison, such as masking off stereo information when
> you don’t care about stereo isomers.
>
> I assume there are suitable tools within RDKit that can do this?
>
> N.B. I need to collate tables from several sources that have a mix of smiles
> / InChI / sdf molecular representations.
>
> I usually use RDKit via Python and/or Knime.
>
> Cheers,
> Steve.
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> <mailto:Rdkit-discuss@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> <mailto:Rdkit-discuss@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss