Hello all,
I am writing because I have a concern with RDKIT ...

I am trying to remove duplicates from a file "SMI" and put a numerical code to 
code before the doubloon.
In a first file, you will find files without duplicating code before (1, CCBC; 
2 CCCCCC; 3CCCCCCCCC, etc ...)
In a second file, the code duplication with the other file. (1, CCBC, 1, CCBC, 
1, CCBC, 2, CCCCCC; 2 CCCCCC; etc ....)
My problem is that RDKIT will not make the difference between:
C1CCCCC1
And
c1cccccc1
So having an output file:
 (1 C1CCCCC1; 2 c1cccccc1, etc ....)
Thank you

My code


# script qui a partir d'un fichier smi va retirer les doublons et enregistrer 
dans un autre fichier smi


print "hello from RD_remove_duplicate"
from sys import *



from cinfony import rdk
from rdkit import Chem

# Dictionary storing the canonical codes seen so far
codes = {}

# Count of total number of structures found
numStructures = 0
# Count of duplicate structures found
numDuplicates = 0

suppl = open("C:\Data\etudecycle/etudecyclebdzei.smi","r")
output_file = "C:\Data\etudecycle/etudecyclebdzeiv2.smi"
writer = open(output_file,'w')
output_filev2 = "C:\Data\etudecycle/etudecyclebdzeiv3.smi"
wd = open(output_filev2,'w')

# Read the first SMI file

i = 0
a = 0
while 1:
    bdsmi = suppl.readline()
    if not bdsmi:
        break
    pass


      # Check for a duplicate

    if codes.has_key(bdsmi):
       numDuplicates += 1

       wd.write(str(a))
       wd.write(str(","))
       wd.write(bdsmi)




    else:
    # Store it in the dictionary so that we can check for duplicates
       codes[bdsmi] = True
       # Write the structure
       a +=1
       writer.write(str(a))
       writer.write(str(","))
       writer.write(bdsmi)
       numStructures += 1
       i +=1
    #count the compounds
       if i == int((i/1000)*1000):
          print i

print " initials numbers= " + str(numDuplicates+ numStructures)
print " duplicates  numbers = " + str(numDuplicates)
print " final  numbers = " + str(numStructures)


**********************************************************************
DISCLAIMER
This email and any files transmitted with it, including replies and forwarded 
copies (which may contain alterations) subsequently transmitted from Firmenich, 
are confidential and solely for the use of the intended recipient. The contents 
do not represent the opinion of Firmenich except to the extent that it relates 
to their official business.
**********************************************************************

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to