Hi Chris, I would treat tautomers as duplicates for my use case, but this would *not* be expected behavior for the majority of RDKit users.
I think it'll be impossible to write something that works for everyone, so then the question is what is in scope and how to handle errors gracefully. I think uniquifying with kwargs (keep_isomers=True?) is enough. It's okay if your code doesn't do serious data cleaning - it's on the user to handle their dataset - but the code should return reasonable warnings/errors for all corner cases. What response do you get if there's an invalid SMILES? Does the code seize up or warn and move on? What about single-atom smiles? How does the code react if you try to fingerprint H2 or H+? Getting the core concept of this project going should be reasonably straightforward, but I think keeping it stable when handed off to a large group of end users will be key to adoption. That and maybe making MongoDB replaceable with other backends so your code's not beholden to the popularity of a particular database. Just my thoughts, and hope this helps, Pat On Thu, Jul 9, 2020 at 5:41 PM Christopher Zou <cw...@berkeley.edu> wrote: > Hi Patrick, > > Thanks for the data and the feedback! > > I hadn't thought about logging malformed structures, which seems like > something good to build into the data registration process. My mentors > (Greg Landrum, Peter Gedeck, and Marco Stenta) and I also discussed > possible approaches to pre-processing molecules and data registration > today. From what I gathered, it seems like there's a lot of ongoing > discussion over identity search and what constitutes a duplicate > molecule—would you be able to clarify a little bit more what that means > from your end? (ex. do we include different tautomers as duplicates?) > > I'll keep the other features you mentioned in mind going forward as > well—while they're not quite optimized yet, we can already support the > queries that you mention, ensure indices, and canonicalize SMILES. > > Best, > Chris > > On Wed, Jul 8, 2020 at 7:03 PM Patrick Fuller <patrickful...@gmail.com> > wrote: > >> Chris, >> >> That sounds like a great idea! Optimized similarity and substructure >> searches are hard to get right, and most libraries leave it as an exercise >> to the reader to choose the right fingerprinting and db structure. I think >> the hardest part will be figuring out a robust end-user experience. You'll >> be writing the "glue" between two domain-specific libraries so you'll need >> extensibility, error handling, and lots of tutorial documentation. >> >> I attached a 1000-line sample of a much larger raw dataset I have lying >> around. I think the script should canonicalize the smiles, remove >> duplicates, skip and error log malformed structures, build fingerprints, >> ensureIndex on the mongodb, and be able to quickly query things like >> carboxylic acid substructure or 80% similarity to terephthalic acid. Hope >> this helps! >> >> Pat >> >> On Wed, Jul 8, 2020 at 4:27 PM Christopher Zou <cw...@berkeley.edu> >> wrote: >> >>> Dear RDKit Community, >>> >>> Hope you're all well! I'm a student from UC Berkeley building an >>> integration between RDKit and MongoDB as part of Google Summer of Code. >>> >>> The idea of the project is twofold: >>> >>> 1. Provide tools for building a chemically-intelligent MongoDB >>> database. >>> 2. Provide high-performance similarity and substructure search that >>> leverage MongoDB. >>> >>> If you use or would like to use MongoDB as part of your work, I'd love >>> to get some input from you, either via email or through a short call. What >>> kinds of Mongo setups are all of you using? What kinds of information would >>> you like to store? What are some examples of searches? This would help me >>> build something as usable as possible for all of you. >>> >>> Many thanks—I'm incredibly excited to be contributing to this community. >>> >>> Best, >>> Chris >>> >>> >>> >>> -- >>> *Christopher Zou * >>> Computer Science and Biochemistry, >>> UC Berkeley '22 >>> >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> > > -- > *Christopher Zou * > Computer Science and Biochemistry, > UC Berkeley '22 > >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss