Hi Greg,

thank you for the quick reply. I really liked the idea and I implemented it as follows:

def get_ecfp(mols, n=3, cut=10):
    allfps = []
    counts = defaultdict(int)
    for i, molin enumerate(mols):
        fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements()
        logger.debug('Getting fp %s', str(i))
        allfps.append(fp)
        for idx, countin fp.items():
            counts[idx] += count
    df = pd.DataFrame(columns=counts.keys())
    for i,fpin enumerate(allfps):
        logger.debug('appending %s', str(i))
        df.append(fp, ignore_index=True)

    df = df[df.columns[df.sum(axis=0) >= cut]]
    return df

Not sure if I got your idea right but in this way it made sense for me. Unfortunately the process gets killed somewhere while appending to the dataframe. I did not have time to look into it fully just wanted to check if I got your idea right. Ill play around with this and the scipy sparse matrices more at the weekend.

Thanks!
Jennifer



On 2018-05-18 11:00, Greg Landrum wrote:
Hi Jennifer,

I think what you're going to want to do is start by getting the indices and counts that are present across the entire dataset and then using those to decide what the columns in your pandas table will be. Here's some code that may help get you started:

    from collections import defaultdict

    def get_ecfp(mols, n=3, cut=10):
        allfps = []
        counts = defaultdict(int)
        for i, mol in enumerate(mols):
            fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements()
            allfps.append(fp)
            for idx,count in fp.items():
                counts[idx] += count
        print(counts)

If that's not enough, let me know and I will try to do a full version tomorrow.

-greg




On Fri, May 18, 2018 at 9:41 AM Jennifer Hemmerich <jennyhemmeric...@gmail.com <mailto:jennyhemmeric...@gmail.com>> wrote:

    Hi all,

    I am trying to calculate Morgan Fingerprints for approximately
    100.000 Molecules. I do not want to use the folded FPs. I want to
    use the counts for the bits which are on and afterwards drop all
    the infrequent features. So I would ultimately need a dataframe
    with the counts for each molecule assigned to the respective bit
    of the FP like this:

    Molecule    1
        2...
        6...
        ...n
    Structure1
        0
        0
        4
        1
    Structure2
        1
        0
        0
        8

    The function I am currently using is:

    def get_ecfp(mols, n=3, cut=10):

         df = []
         all_dfs = pd.DataFrame()
         for i, molin enumerate(mols):
             d = pd.DataFrame(AllChem.GetMorganFingerprint(mol, 
n).GetNonzeroElements(),index=[mols.index[i]], dtype='uint8')
             df.append(d)
             logger.debug('Calculating %s',str(i))

        
        #append all collected dataframes in between to prevent memory issue

      if i%20000 ==0 or i ==len(mols):
                 logger.debug('Concatenation %s', str(i))
                 part = pd.concat(df)
                 all_dfs = pd.concat([part,all_dfs])
                 logger.debug('Datatypes %s', str(all_dfs.dtypes))
                 del part
                 df = []

             all_dfs = all_dfs[all_dfs.columns[all_dfs.sum(axis=0) >= cut]] #drop 
columns where count<10

         return df

    But the concatenation is awfully slow. Without the intermediate
    concatenation I am quickly running out of Memory trying to
    concatenate all dataframes, although using a Machine with 128GB of
    RAM.

    I found the possibility to convert the fingerprint to a numpy
    array. That needs me to assign a numpy array with a certain length
    which is impossible, as I do not know how long the final array has
    to be. Assigning it to an array without predefining the length
    just never finishes the computation. If I check for the length of
    the fp with fp.GetLength() I get 4294967295 which just seems to be
    the maximum number of a 32bit int. This means that converting all
    of the FPs to such long numpy Arrays also is not really an option.

    Is there any way which I did not see to get the desired DataFrame
    or ndarray out of RDKit directly? Or any better conversion? I
    assume that the keys of the dict I get with  GetNonzeroElements()
    are the set bits of the 4294967295 bit long vector?

    Thanks in advance!

    Jennifer

    
------------------------------------------------------------------------------
    Check out the vibrant tech community on one of the world's most
    engaging tech sites, Slashdot.org!
    http://sdm.link/slashdot_______________________________________________
    Rdkit-discuss mailing list
    Rdkit-discuss@lists.sourceforge.net
    <mailto:Rdkit-discuss@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to