Hi Chris,

I would treat tautomers as duplicates for my use case, but this would *not*
be expected behavior for the majority of RDKit users.

I think it'll be impossible to write something that works for everyone, so
then the question is what is in scope and how to handle errors gracefully.
I think uniquifying with kwargs (keep_isomers=True?) is enough. It's okay
if your code doesn't do serious data cleaning - it's on the user to handle
their dataset - but the code should return reasonable warnings/errors for
all corner cases. What response do you get if there's an invalid SMILES?
Does the code seize up or warn and move on? What about single-atom smiles?
How does the code react if you try to fingerprint H2 or H+?

Getting the core concept of this project going should be reasonably
straightforward, but I think keeping it stable when handed off to a large
group of end users will be key to adoption. That and maybe making MongoDB
replaceable with other backends so your code's not beholden to the
popularity of a particular database.

Just my thoughts, and hope this helps,
Pat

On Thu, Jul 9, 2020 at 5:41 PM Christopher Zou <cw...@berkeley.edu> wrote:

> Hi Patrick,
>
> Thanks for the data and the feedback!
>
> I hadn't thought about logging malformed structures, which seems like
> something good to build into the data registration process. My mentors
> (Greg Landrum, Peter Gedeck, and Marco Stenta) and I also discussed
> possible approaches to pre-processing molecules and data registration
> today. From what I gathered, it seems like there's a lot of ongoing
> discussion over identity search and what constitutes a duplicate
> molecule—would you be able to clarify a little bit more what that means
> from your end? (ex. do we include different tautomers as duplicates?)
>
> I'll keep the other features you mentioned in mind going forward as
> well—while they're not quite optimized yet, we can already support the
> queries that you mention, ensure indices, and canonicalize SMILES.
>
> Best,
> Chris
>
> On Wed, Jul 8, 2020 at 7:03 PM Patrick Fuller <patrickful...@gmail.com>
> wrote:
>
>> Chris,
>>
>> That sounds like a great idea! Optimized similarity and substructure
>> searches are hard to get right, and most libraries leave it as an exercise
>> to the reader to choose the right fingerprinting and db structure. I think
>> the hardest part will be figuring out a robust end-user experience. You'll
>> be writing the "glue" between two domain-specific libraries so you'll need
>> extensibility, error handling, and lots of tutorial documentation.
>>
>> I attached a 1000-line sample of a much larger raw dataset I have lying
>> around. I think the script should canonicalize the smiles, remove
>> duplicates, skip and error log malformed structures, build fingerprints,
>> ensureIndex on the mongodb, and be able to quickly query things like
>> carboxylic acid substructure or 80% similarity to terephthalic acid. Hope
>> this helps!
>>
>> Pat
>>
>> On Wed, Jul 8, 2020 at 4:27 PM Christopher Zou <cw...@berkeley.edu>
>> wrote:
>>
>>> Dear RDKit Community,
>>>
>>> Hope you're all well! I'm a student from UC Berkeley building an
>>> integration between RDKit and MongoDB as part of Google Summer of Code.
>>>
>>> The idea of the project is twofold:
>>>
>>>    1. Provide tools for building a chemically-intelligent MongoDB
>>>    database.
>>>    2. Provide high-performance similarity and substructure search that
>>>    leverage MongoDB.
>>>
>>> If you use or would like to use MongoDB as part of your work, I'd love
>>> to get some input from you, either via email or through a short call. What
>>> kinds of Mongo setups are all of you using? What kinds of information would
>>> you like to store? What are some examples of searches? This would help me
>>> build something as usable as possible for all of you.
>>>
>>> Many thanks—I'm incredibly excited to be contributing to this community.
>>>
>>> Best,
>>> Chris
>>>
>>>
>>>
>>> --
>>> *Christopher Zou *
>>> Computer Science and Biochemistry,
>>> UC Berkeley '22
>>>
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>
> --
> *Christopher Zou *
> Computer Science and Biochemistry,
> UC Berkeley '22
>
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to