Hi Scott,
On May 05, 2011, at 11:36 AM, Scott Mottarella <semot...@bu.edu> wrote:
There are non-ascii letters in the SDF file you downloaded:
>>> x = file('all.sdf').read()
>>> [i for i, char in enumerate(x) if ord(char) > 127]
[2176567, 2176568, 2176569, 2176572, 2176573, 2176574, 2176675, 2176676, 2217062, 2217063, 2217085, 2217086, 7068139, 7068140, 7068143, 7068144, 7068148, 7068149, 7068153, 7068154, 7068176, 7068177, 7460867, 7460868, 7773941, 7773942, 8344632, 8344633, 21518023, 21518024, 21518027, 21518028, 23719172, 23719173, 33580283, 33580284, 33580292, 33580293, 33580294, 33580313, 33580314, 33580315]
The list above contains the indices of characters that are not ASCII.
You could easily remove them:
>>> xx = unicode(x, 'ascii', 'ignore')
>>> out = file('fixed.sdf', 'w')
>>> out.write(xx)
>>> out.close()
>>> x = file('fixed.sdf').read()
>>> [i for i, char in enumerate(x) if ord(char) > 127]
[]
Regards,
Eddie
On May 05, 2011, at 11:36 AM, Scott Mottarella <semot...@bu.edu> wrote:
Hello All,I just recently started using RDKit and am trying to test it on a small test set of 6630 molecules (The exact test set is available at http://www.drugbank.ca/downloads). I tried using the DbCLI to create a database but it terminated with an error. I have included here the command exactly as I entered it and the error.python /usr/share/RDKit/Projects/DbCLI/CreateDb.py --dbDir=drugbank --maxRowsCached=1000 --molFormat=sdf drugbank_6630.sdfTraceback (most recent call last):File "/usr/share/RDKit/Projects/DbCLI/CreateDb.py", line 456, in <module>CreateDb(options,dataFilename)File "/usr/share/RDKit/Projects/DbCLI/CreateDb.py", line 212, in CreateDblazySupplier=int(options.maxRowsCached)>0)File "/usr/lib64/python2.6/site-packages/rdkit/Chem/MolDb/Loader_orig.py", line 163, in LoadDbcurs.executemany('insert into %s values (%s)'%(regName,qs),rows)sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.I know it is failing on the 1004 molecule in the file and that all the characters in the file are alphanumeric or punctuation. I have tried a few edits, but to no avail. If anyone has seen this problem or knows how to resolve it, I would greatly appreciate some insight.Scott Mottarella------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network
management toolset available today. Delivers lowest initial
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss