Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Greg Landrum Fri, 09 Jun 2017 06:43:35 -0700

Hi Alexis,

If I understand your use case correctly, you really don't need this level
of complication.


If you are comparing Q molecules to M molecules and M>>Q (in the discussion
so far Q = 1000, M = 500000) and you only need to compare each of the Qs to
each of the Ms a single time, you can safely construct all the Q molecules
and store them in memory and then loop over the Ms individually and compare
them to each of the Qs (this is what I did in my little sample). This will
have more or less exactly the same performance as reading all of the Ms at
once and then processing them.

so, on a machine with infinite memory these two snippets will take more or
less the same amount of time to execute:

low memory usage:

queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
not None]
matches = []
for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
    if m is None:
        continue
    matches.append([m.HasSubstructMatch(q) for q in queries])



high memory usage:

queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
not None]
mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not
None]
matches = []
for m in mols:
    if m is None:
        continue
    matches.append([m.HasSubstructMatch(q) for q in queries])



The second form consumes a lot more memory without delivering any
improvement in performance.

Best,
-greg


On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Hi again, FYI here is the memory monitoring in attachment. Thanks,
>
> Alexis
>
> On 9 June 2017 at 15:12, Alexis Parenty <alexis.parenty.h...@gmail.com>
> wrote:
>
>> Dear Greg and Brian,
>> Many thanks for your response. I was also thinking of your streaming
>> approach! I think the RAM of most machine would deal with lists of 100K mol
>> so we could put the threshold higher than 1000. Actually, I was thinking to
>> monitor the available RAM and only start processing the matrix and clearing
>> the list when less than 20% of RAM is left. This way, the best machines
>> could skip the clearing process and gain time. What do you think?
>>
>>
>> Best,
>>
>> Alexis
>>
>>
>>
>>
>>
>> On 9 June 2017 at 14:40, Brian Kelley <fustiga...@gmail.com> wrote:
>>
>>> While not multithreaded (yet) this is the use case of the filter catalog:
>>>
>>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-
>>> filtercatalog.html?m=1
>>>
>>> Look for the SmartsMatcher class in the blog.
>>>
>>> It is a good idea to make this multithreaded as well, I'll add this as a
>>> possible enhancement.
>>>
>>> ----
>>> Brian Kelley
>>>
>>> On Jun 9, 2017, at 7:04 AM, Greg Landrum <greg.land...@gmail.com> wrote:
>>>
>>> Hi Alexis,
>>>
>>> I would approach this by loading the 1000 queries into a list of
>>> molecules and then "stream" the others past that (so that you never attempt
>>> to load the full 500K set at once).
>>>
>>> Here's a quick sketch of one way to do this:
>>>
>>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf')
>>> if x is not None]
>>>
>>> In [5]: matches = []
>>>
>>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>>>    ...:     if m is None:
>>>    ...:         continue
>>>    ...:     matches.append([m.HasSubstructMatch(q) for q in queries])
>>>    ...:
>>>
>>>
>>>
>>> Brian has some thoughts on making this particular use case easier/faster
>>> (in particular by adding multi-threading support), so maybe there will be
>>> something in the next release there.
>>>
>>> I hope this helps,
>>> -greg
>>>
>>>
>>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
>>> alexis.parenty.h...@gmail.com> wrote:
>>>
>>>> Dear RDKit community,
>>>>
>>>> I need to screen for substructure relationships between two sets of
>>>> structures (1 000 X 500 000): I thought I should build two lists of mol
>>>> objects from SMILES, but I keep having a memory error when the second list
>>>> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
>>>> virtual memory.
>>>>
>>>> Do I really have to compromise on speed and make mol object on the
>>>> flight from two lists of SMILES? Is there another memory efficient way to
>>>> store mol object?
>>>>
>>>> Best,
>>>>
>>>> Alexis
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Reply via email to