hmm, this got lost in my mailbox. Sorry.

You can do what I think you want to do using the information theory
machinery that the rdkit has available. Here's a short snippet that finds
the bits that are not redundant in a data set (redundancy here calculated
using information entropy):

In [48]: ms = [Chem.MolFromSmiles(x.split()[1]) for x in
file('./Target_no_107_58879.txt')]

In [49]: nbits = 2048

In [50]: fps =
[rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(x,nbits) for x in
ms]

In [58]: entropies = []

In [59]: for i in range(nbits):
    arr = numpy.array([x[i] for x in fps])
    e = InfoTheory.InfoEntropy(arr)
    entropies.append(e)
   ....:

In [60]: entropies = numpy.array(entropies)

In [61]: goodbits = numpy.array(range(nbits))[entropies>0.0]

In [62]: len(goodbits)
Out[62]: 891

your case is pretty big, so this may take a bit, but it shouldn't be too
slow.

-greg



On Wed, Jun 4, 2014 at 5:08 AM, Stephen O'hagan <soha...@manchester.ac.uk>
wrote:

>  Hi,
>
>
>
> I have a set of say 1000 generated fingerprints each of length 39972;
> across all 1000 fingerprints many bits are the same – they contain no
> information about the differences between the 1000 molecules.
>
>
>
> e.g. for list
>
>
>
> 010100001
>
> 010110100
>
> 010101110
>
> 010100010
>
>
>
> The first four bits are redundant, I could just record them as:
>
>
>
> 00001
>
> 10100
>
> 01110
>
> 00010
>
>
>
> In reality, the redundant bits are distributed through the bit string, so
> I need a method to determine which bits are redundant, and then remove them
> from each fingerprint.
>
>
>
> Cheers,
>
> Steve.
>
>
>
>
>
>
>
> *From:* Greg Landrum [mailto:greg.land...@gmail.com]
> *Sent:* 04 June 2014 04:40
> *To:* Stephen O'hagan
> *Cc:* rdkit-discuss@lists.sourceforge.net
> *Subject:* Re: [Rdkit-discuss] remove redundant bits from bitvector
> fingerprints
>
>
>
> Hi Steve,
>
>
>
> On Tue, Jun 3, 2014 at 2:08 PM, Stephen O'hagan <soha...@manchester.ac.uk>
> wrote:
>
> I have a fragment of code generating fingerprints for a long  list of
> molecules (length ~ 1000)
>
>
>
> for index in range(0,len(smi)):
>
>                 smiles=smi[index]
>
> mol=Chem.MolFromSmiles(smiles)
>
> AllChem.EmbedMolecule(mol)
>
> AllChem.UFFOptimizeMolecule(mol)
>
> dm = Chem.Get3DDistanceMatrix(mol)
>
> fp = Generate.Gen2DFingerprint(mol,factory, dMat=dm)
>
> fp = fp.ToBitString()
>
> bs[index]=fp
>
>
>
> The length of  each bitvectors generated is 39972, and the list has a lot
> of redundant ‘1’s and ‘0’s.
>
>
>
> Is there an easy method to filter out these redundant bits?
>
>
>
> What do you mean by redundant bits?
>
>
>
> The length of the bit vectors is determined by the parameters you provide
> for building the pharmacophore fingerprints (number of points, number of
> features, and number of distance bins). The length of the strings that you
> get from fp.ToBitString() should be equal to this number of bits.
>
>
>
> -greg
>
>
>
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to