Hi Scott,

There are non-ascii letters in the SDF file you downloaded:

>>> x = file('all.sdf').read()
>>> [i for i, char in enumerate(x) if ord(char) > 127]
[2176567, 2176568, 2176569, 2176572, 2176573, 2176574, 2176675, 2176676, 2217062, 2217063, 2217085, 2217086, 7068139, 7068140, 7068143, 7068144, 7068148, 7068149, 7068153, 7068154, 7068176, 7068177, 7460867, 7460868, 7773941, 7773942, 8344632, 8344633, 21518023, 21518024, 21518027, 21518028, 23719172, 23719173, 33580283, 33580284, 33580292, 33580293, 33580294, 33580313, 33580314, 33580315]

The list above contains the indices of characters that are not ASCII.

You could easily remove them:

>>> xx = unicode(x, 'ascii', 'ignore')
>>> out = file('fixed.sdf', 'w')
>>> out.write(xx)
>>> out.close()
>>> x = file('fixed.sdf').read()
>>> [i for i, char in enumerate(x) if ord(char) > 127]
[]

Regards,
Eddie

On May 05, 2011, at 11:36 AM, Scott Mottarella <semot...@bu.edu> wrote:

Hello All,

I just recently started using RDKit and am trying to test it on a small test set of 6630 molecules (The exact test set is available at http://www.drugbank.ca/downloads).  I tried using the DbCLI to create a database but it terminated with an error.  I have included here the command exactly as I entered it and the error.

python /usr/share/RDKit/Projects/DbCLI/CreateDb.py --dbDir=drugbank --maxRowsCached=1000 --molFormat=sdf drugbank_6630.sdf

Traceback (most recent call last):
  File "/usr/share/RDKit/Projects/DbCLI/CreateDb.py", line 456, in <module>
    CreateDb(options,dataFilename)
  File "/usr/share/RDKit/Projects/DbCLI/CreateDb.py", line 212, in CreateDb
    lazySupplier=int(options.maxRowsCached)>0)
  File "/usr/lib64/python2.6/site-packages/rdkit/Chem/MolDb/Loader_orig.py", line 163, in LoadDb
    curs.executemany('insert into %s values (%s)'%(regName,qs),rows)
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

I know it is failing on the 1004 molecule in the file and that all the characters in the file are alphanumeric or punctuation.  I have tried a few edits, but to no avail.  If anyone has seen this problem or knows how to resolve it, I would greatly appreciate some insight.

Scott Mottarella
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network
management toolset available today. Delivers lowest initial
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to