Hi Greg,
thank you for the quick reply. I really liked the idea and I implemented
it as follows:
def get_ecfp(mols, n=3, cut=10):
allfps = []
counts = defaultdict(int)
for i, molin enumerate(mols):
fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements()
logger.debug('Getting fp %s', str(i))
allfps.append(fp)
for idx, countin fp.items():
counts[idx] += count
df = pd.DataFrame(columns=counts.keys())
for i,fpin enumerate(allfps):
logger.debug('appending %s', str(i))
df.append(fp, ignore_index=True)
df = df[df.columns[df.sum(axis=0) >= cut]]
return df
Not sure if I got your idea right but in this way it made sense for me.
Unfortunately the process gets killed somewhere while appending to the
dataframe. I did not have time to look into it fully just wanted to
check if I got your idea right. Ill play around with this and the scipy
sparse matrices more at the weekend.
Thanks!
Jennifer
On 2018-05-18 11:00, Greg Landrum wrote:
Hi Jennifer,
I think what you're going to want to do is start by getting the
indices and counts that are present across the entire dataset and then
using those to decide what the columns in your pandas table will be.
Here's some code that may help get you started:
from collections import defaultdict
def get_ecfp(mols, n=3, cut=10):
allfps = []
counts = defaultdict(int)
for i, mol in enumerate(mols):
fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements()
allfps.append(fp)
for idx,count in fp.items():
counts[idx] += count
print(counts)
If that's not enough, let me know and I will try to do a full version
tomorrow.
-greg
On Fri, May 18, 2018 at 9:41 AM Jennifer Hemmerich
<jennyhemmeric...@gmail.com <mailto:jennyhemmeric...@gmail.com>> wrote:
Hi all,
I am trying to calculate Morgan Fingerprints for approximately
100.000 Molecules. I do not want to use the folded FPs. I want to
use the counts for the bits which are on and afterwards drop all
the infrequent features. So I would ultimately need a dataframe
with the counts for each molecule assigned to the respective bit
of the FP like this:
Molecule 1
2...
6...
...n
Structure1
0
0
4
1
Structure2
1
0
0
8
The function I am currently using is:
def get_ecfp(mols, n=3, cut=10):
df = []
all_dfs = pd.DataFrame()
for i, molin enumerate(mols):
d = pd.DataFrame(AllChem.GetMorganFingerprint(mol,
n).GetNonzeroElements(),index=[mols.index[i]], dtype='uint8')
df.append(d)
logger.debug('Calculating %s',str(i))
#append all collected dataframes in between to prevent memory issue
if i%20000 ==0 or i ==len(mols):
logger.debug('Concatenation %s', str(i))
part = pd.concat(df)
all_dfs = pd.concat([part,all_dfs])
logger.debug('Datatypes %s', str(all_dfs.dtypes))
del part
df = []
all_dfs = all_dfs[all_dfs.columns[all_dfs.sum(axis=0) >= cut]] #drop
columns where count<10
return df
But the concatenation is awfully slow. Without the intermediate
concatenation I am quickly running out of Memory trying to
concatenate all dataframes, although using a Machine with 128GB of
RAM.
I found the possibility to convert the fingerprint to a numpy
array. That needs me to assign a numpy array with a certain length
which is impossible, as I do not know how long the final array has
to be. Assigning it to an array without predefining the length
just never finishes the computation. If I check for the length of
the fp with fp.GetLength() I get 4294967295 which just seems to be
the maximum number of a 32bit int. This means that converting all
of the FPs to such long numpy Arrays also is not really an option.
Is there any way which I did not see to get the desired DataFrame
or ndarray out of RDKit directly? Or any better conversion? I
assume that the keys of the dict I get with GetNonzeroElements()
are the set bits of the 4294967295 bit long vector?
Thanks in advance!
Jennifer
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org!
http://sdm.link/slashdot_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss