Re: [BlueObelisk-discuss] Intro + Possible Member

Andrew Dalke Tue, 21 Dec 2021 04:29:17 -0800

H Suliman,

On Dec 19, 2021, at 05:51, Suliman Sharif <sharifsulim...@gmail.com> wrote
>> When was the current state of machine representation figured out?
> 
> I would say the 1980s was after the invention of SMILES where they used 
> something somewhat "readable", they got it started and now we continue is my 
> thought there.


As additional factors to think about, 1980s SMILES didn't handle chirality or 
isotopes. Those were added in the 1990s.

Computer databases in the 1970s, like MACCS, could already store those. Indeed, 
lack of stereochemistry support in WLNs was one of the factors which lead to 
its decline around 1980. (Stereochemistry was given in human-readable notes.) 
What could SMILES handle that MACCS's connection tables couldn't in 1980?

I don't think the 1980s SMILES representation exceeds that of MCC (mechanical 
chemical code - https://pubs.acs.org/doi/pdf/10.1021/c160027a002 ), which 
includes isotopes but not stereochemistry.

>> Does cheminformatics include its roots in library science? Or are those now 
>> different fields?
> 
> I like chem + informatic because it's one character shorter and in my opinion 
> sounds cooler. I mean you could say it's around the time we started 
> constructing the IUPAC language trying to turn what's going on chemistry wish 
> to a language representation and it's a part of library science. But anything 
> is a part of a library science since we all record scientific information in 
> some format.

I don't think I expressed my question well enough. The current journal "Journal 
of Chemical Information and Modeling" was previously "Journal of Chemical 
Information and Computer Science", which was previously "Journal of Computer 
Documentation". Before J. Chem. Doc., papers were published in American 
Documentation or the Journal of Computer Education.

The word "Documentation" is used in those earlier journals because 
documentation science is the precursor to information science, coming out of 
the work of Otlet and Fontaine. See Traité de Documentation (1934) and their 
work on the Mundaneum (1910).

"Documentation" was the hot topic in the mid-20th century. Chemistry was one of 
the biggest data sets around (after legal cases), and much of the field we now 
call "cheminformatics" arose during the post-war era as a way to mechanize 
documentation management, first through punched cards and then through 
computers. Terms like "chemical descriptor" come directly out of this era, and 
the same researcher who coined both "descriptor" and "chemical descriptor" also 
coined the term "information retrieval", for an ACS conference.

So I don't mean the abstract "we all record scientific information in some 
format", but I mean the historic evolution of this field as a branch of library 
science, with practitioners who work in libraries, and publication articles on 
how to manage their collection. (Eg,  "The Charter: A "Must" for Effective 
Information System Planning and Design", http://dx.doi.org/10.1021/c160012a004  
"It is the product of research work by information center managers, information 
system supervisors, technical report file custodians, and others who undertook 
information storage and retrieval efforts".)

On the other hand, cheminformatics can also be interpreted the field which 
(among other things) uses methods of chemical information originally developed 
for documentation management in order to model chemical behavior. That's the 
"... and Modeling" of JCIM. Someone can have a successful career in that aspect 
of cheminformatics without knowing anything about the connections to library 
science.

Which means a book about cheminformatics has to decide what "cheminformatics" 
means, hence my question.

> Maybe we should teach IUPAC first again,

Again, what is your purpose? What topics do you de-emphasize in order to teach 
more about IUPAC?

And from what I hear, IUPAC has recently changed.

>  Check out Morgan's paper and some slides I made from that paper in teaching. 

I have read Morgan's paper. Amusingly, the ACS included it in final report of 
the NSF-funded work they did to develop and expand a computer-based Chemical 
Registry System, which means it's not behind a paywall. 
https://eric.ed.gov/?id=ED032214 , Appendix D, starting on PDF page 134.

I also looked at your text at 
https://sharifsuliman1.medium.com/understanding-morgan-f70186b172f6 .

Since the slides are a bit ambiguous about a few concepts, here are some other 
things to consider:

   "Well to do that he first decided he needed to come up
    with a rank ordering system, a way to sequentially at atoms
    in some sort of list for example for acetone:"

He didn't come up with a rank ordering system. He came up with a unique rank. A 
non-unique rank ordering was in use in, eg, Ray and Kirsh's 1957 computer 
substructure search implementation, and in Mooers' 1951 theoretical description.

  "He chose to implement an old method of a Search Tree"

I think you should point out that these concepts were new at the time. 

  "Morgan decided the information would be stored in a series of 5 lists"

One of the things that makes that paper difficult to understand is how it uses 
the compact connection table, which is a representation I think no one uses 
these days. Those 5 lists are part of that specific representation, but not 
essential to the algorithm.

This representation came from Gluck's work at Du Pont ("A Chemical Structure 
Storage and Search System Developed at Du Pont", 
https://pubs.acs.org/doi/pdf/10.1021/c160016a008 , presented 1964, published 
1965).

Now, Gluck also had a canonicalization method, described in that paper as "The 
atom numbers in the bond columns are the newly assigned rank positions. The two 
Atoms No. 4 have different atom ranks associated with their single bonds. The 
iterative procedure which follows the initial ordering break ties according to 
the magnitudes of the atoms to which the tied atoms are bonded. .. This 
iterative process of reordering according to the new rank of the atoms in the 
bond columns continues until all atoms are uniquely ranked, in which case the 
compound is is canonical form, or until no further reordering is possible until 
ties still remain."

You can see ties with the Morgan approach; Gluck then went to work at CAS with 
Morgan. The main problem being that Gluck's algorithm wasn't actually 
canonical. In "A Collection of Algorithms for Searching Chemical Compound 
Structure Analogs" at 
https://archive.org/details/DTIC_AD0460819/page/n19/mode/2up you can see 
Lehman's counter-example showing how the algorithm failed.

The Morgan algorithm resolved that problem.

  "Essentially what you can do is start with a Radius of 0 around the atom."

I'm concerned that you've mixed up the "Morgan invariant", as its described for 
ECFP-like fingerprints, with the algorithm that Morgan described in the paper. 
If you look at your radius=2 example, you'll see the 17 = 3*3 + (3+3+2), that 
is, the invariant for the initial carbon, squared, plus the sum of the 
invariants for the atoms at R=2 away. It no longer includes the R=1 invariants.

You can see that even if the neighboring -OH has an initial invariant of 1,000, 
that value won't be part of the initial carbon's invariant.

Instead, for purposes of teaching I would start with Penny codes, which is the 
paper immediately following Morgan's in the same issue, at 
https://pubs.acs.org/doi/pdf/10.1021/c160017a019 .

On page 11 of that same "A Collection of Algorithms for Searching Chemical 
Compound Structure Analogs" link at 
https://archive.org/details/DTIC_AD0460819/page/n19/mode/2up you can read about 
Penny codes. 

  Penny, in a recent paper, recognizes correctly that atom and 
  bonding considerations alone are in some cases inadequate for
  distinguishing compounds. His method is concerned with enumerating 
  the simple connectivity in the neighborhood of each atom. As he 
  states, "it is a unique expression of the atomic network within
  the immediate neighborhood of the subject atom and is an attribute
  of the atom as much as its chemical identity". 

Page 12 then goes into more detail. You'll see these are much more in line with 
your description.

I personally think RDKit's use of "Morgan" fingerprint should be "Penny" 
fingerprint, but I know that's a predilection of mine.
 
> It's weird to me that data structures is not a core requirement for 
> cheminformatic folk.

Like all interdisciplinary fields, cheminformatics uses only a subset of a 
larger topic of "data structures", and has some specialized needs not covered 
by normal introductory classes.

I have a CS degree.

Data structures as taught by computer scientists include many topics I have not 
yet needed in cheminformatics.

I've never needed to care about B-tree implementations. Or red-black balanced 
binary trees. I don't think I've even had to use Dijkstra's algorithm, which is 
pretty molecular-graph-adjacent.

While intro data structure classes don't teach substructure isomorphism 
algorithms. And I think Bloom filters (conceptually related to molecular 
fingerprints) is also a more advanced topic.

On the other hand, I have used concepts I learned in automata theory.

So while I completely support the idea that a cheminformatics textbook should 
include a deeper treatment of graph theory than, say, the 5 pages Gasteiger 
gives in his textbook, I also complete support the idea that a semester-long 
general-purpose programming course, plus a semester long data structures 
course, isn't appropriate.

Cheers,

                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Blueobelisk-discuss mailing list
Blueobelisk-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss

Re: [BlueObelisk-discuss] Intro + Possible Member

Reply via email to