Hello all,
I am writing because I have a concern with RDKIT ...
I am trying to remove duplicates from a file "SMI" and put a numerical code to
code before the doubloon.
In a first file, you will find files without duplicating code before (1, CCBC;
2 CCCCCC; 3CCCCCCCCC, etc ...)
In a second file, the code duplication with the other file. (1, CCBC, 1, CCBC,
1, CCBC, 2, CCCCCC; 2 CCCCCC; etc ....)
My problem is that RDKIT will not make the difference between:
C1CCCCC1
And
c1cccccc1
So having an output file:
(1 C1CCCCC1; 2 c1cccccc1, etc ....)
Thank you
My code
# script qui a partir d'un fichier smi va retirer les doublons et enregistrer
dans un autre fichier smi
print "hello from RD_remove_duplicate"
from sys import *
from cinfony import rdk
from rdkit import Chem
# Dictionary storing the canonical codes seen so far
codes = {}
# Count of total number of structures found
numStructures = 0
# Count of duplicate structures found
numDuplicates = 0
suppl = open("C:\Data\etudecycle/etudecyclebdzei.smi","r")
output_file = "C:\Data\etudecycle/etudecyclebdzeiv2.smi"
writer = open(output_file,'w')
output_filev2 = "C:\Data\etudecycle/etudecyclebdzeiv3.smi"
wd = open(output_filev2,'w')
# Read the first SMI file
i = 0
a = 0
while 1:
bdsmi = suppl.readline()
if not bdsmi:
break
pass
# Check for a duplicate
if codes.has_key(bdsmi):
numDuplicates += 1
wd.write(str(a))
wd.write(str(","))
wd.write(bdsmi)
else:
# Store it in the dictionary so that we can check for duplicates
codes[bdsmi] = True
# Write the structure
a +=1
writer.write(str(a))
writer.write(str(","))
writer.write(bdsmi)
numStructures += 1
i +=1
#count the compounds
if i == int((i/1000)*1000):
print i
print " initials numbers= " + str(numDuplicates+ numStructures)
print " duplicates numbers = " + str(numDuplicates)
print " final numbers = " + str(numStructures)
**********************************************************************
DISCLAIMER
This email and any files transmitted with it, including replies and forwarded
copies (which may contain alterations) subsequently transmitted from Firmenich,
are confidential and solely for the use of the intended recipient. The contents
do not represent the opinion of Firmenich except to the extent that it relates
to their official business.
**********************************************************************
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss