[Rdkit-discuss] Clustering

2017-06-04 Thread Chris Swain
Hi, I want to do clustering on around 4 million structures The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html ) suggests "For large sets of molecules (more than 1000-2000), it’s most efficient to use the Butina clustering algorithm” However

Re: [Rdkit-discuss] Clustering

2017-06-04 Thread Samo Turk
Hi Chris, There are other options for clustering. According to this: http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html HDBSCAN and K-means scale well. HDBSCAN will find clusters based on density and it also allows for outliers, but can be fiddly to find the right parametes.

Re: [Rdkit-discuss] RDKit on armv7h

2017-06-04 Thread Samo Turk
I'll just install debian in chroot and test if armv7 binary from the repo works. On Sat, Jun 3, 2017 at 9:00 PM, Maciek Wójcikowski wrote: > I have Odroid C2 (which is Rpi3 faster cousin), also armv8. Thats a myth > rather than reality. Debian is almost 100% arm64 friendly (spoiler alert). > > I

Re: [Rdkit-discuss] RDKit on armv7h

2017-06-04 Thread Maciek Wójcikowski
I tried compiling Git master on armv8/arm64 Debian Sid (as mentioned before) and all tests but two passed. cmake .. -D > LD_LIBRARY_PATH="$RDBASE/lib:$PYROOT/lib:$LD_LIBRARY_PATH" > PYTHONPATH=$RDBASE:$PYTHONPATH ctest Failures: > 61: [12:08:04] - > 61: [12:0

Re: [Rdkit-discuss] Clustering

2017-06-04 Thread Maciek Wójcikowski
Is there a big difference in the quality of the final dataset between K-means and random under-sampling of big database (~20M)? Pozdrawiam, | Best regards, Maciek Wójcikowski mac...@wojcikowski.pl 2017-06-04 12:24 GMT+02:00 Samo Turk : > Hi Chris, > > There are other options for clusterin

Re: [Rdkit-discuss] RDKit on armv7h

2017-06-04 Thread Maciek Wójcikowski
I must correct myself, pandas was not installed, so the only test that failed was "test3D". Pozdrawiam, | Best regards, Maciek Wójcikowski mac...@wojcikowski.pl 2017-06-04 15:07 GMT+02:00 Maciek Wójcikowski : > I tried compiling Git master on armv8/arm64 Debian Sid (as mentioned > before)

[Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-04 Thread Alexis Parenty
Dear RDKit community, I need to screen for substructure relationships between two sets of structures (1 000 X 500 000): I thought I should build two lists of mol objects from SMILES, but I keep having a memory error when the second list reaches 300 000 mol. All my RAM (12G) gets consumed along wit