Thanks, Greg,
Yes, sciket learn will automatically promote to arrays of float with
check_array()
function. What I am currently doing is
fpa = numpy.zeros((len(fp),),numpy.double)
DataStructs.ConvertToNumpyArray(fp,fpa)
np.sum(np.reshape(fpa, (4, -1)), axis = 0)
Is this the same as
One small notice from me - I would still use other agregative function
instead of sum to get binary FP:
np.reshape(fpa, (4, -1)).any(axis = 0)
I guess it doesn't change a thing with tanimoto, but if you try other
distances then you can get unexpected results (assuming there are crashes).
Hi Greg,
Thanks! It works! But, is that possible to fold the fingerprint to smaller
size? np.zeros((100,2048)) still takes a lot of memory...
Best,
Jing
On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum greg.land...@gmail.com
wrote:
On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu
Hi Jing,
Most fingerprints are binary, thus can be stored as np.bool_, which
compared to double should be 64 times more memory efficient.
Best,
Maciej
Pozdrawiam, | Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl
2015-08-27 16:15 GMT+02:00 Jing Lu ajin...@gmail.com:
Hi Greg,
On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin...@gmail.com wrote:
So, I wonder is there any way to convert fingerprint to a numpy vector?
Indeed there is:
In [11]: from rdkit import Chem
In [12]: from rdkit import DataStructs
In [13]: import numpy
In [14]: m =Chem.MolFromSmiles('C1CCC1')
On 08/23/2015 11:38 AM, Jing Lu wrote:
Thanks, Andrew!
Yes, I was thinking about using scikit-learn also. But I guess I need to
use a data structure for sparse matrix and define a function for
connectivity. I hope the memory issue won't be a problem.
Most AgglomerativeClustering algorithms
Thanks, Takayuki,
For both Repeated Bisection clustering and K-means clustering, they all
need the number of clusters as input, right?
Best,
Jing
On Sun, Aug 23, 2015 at 1:17 AM, Taka Seri serit...@gmail.com wrote:
Dear Jing,
How about your trying using bayon ?
On Aug 23, 2015, at 3:43 AM, Jing Lu wrote:
If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I
calculate the distance between every pair of molecules, the size of distance
matrix will be too big. Does RDKit support any heuristic clustering algorithm
without
On Aug 23, 2015, at 6:38 PM, Jing Lu wrote:
I hope the memory issue won't be a problem.
That's up to you and your choice of threshold.
Most AgglomerativeClustering algorithms have time complexity with N^2. Will
that be a problem?
You have to decided for yourself what counts as a problem.
Dear Jing,
How about your trying using bayon ?
https://code.google.com/p/bayon/
It's not function of RDKit, but I think the library can cluster molecules
using ECFP4.
Unfortunately, input file format of bayon is not distance matrix but easy
to prepare the format.
Best regards.
Takayuki
Dear RDKit users,
If I want to cluster more than 1M molecules by ECFP4. How could I do it? If
I calculate the distance between every pair of molecules, the size of
distance matrix will be too big. Does RDKit support any heuristic
clustering algorithm without calculating the distance matrix of the
11 matches
Mail list logo