Hello, I tried to change the id from MEDIUMINT to INT in the idxWORD09F table. I tried to re-index all the files on our test server using:
bibindex -R -M 2000000 -f 1000 -u admin Now I got an new error: ----------------------------------------- Received by email ------------------------------------------------------ Error when putting the term ''1+\xf0\x9d\x91\x98\xe2\x80\xb2'' into db (hitlist=intbitset([10780])): (1062, "Duplicate entry '1+' for key 2") The following problem occurred on <http://doc.test.rero.ch> Registered exception 2011-05-07 08:51:31 -> IntegrityError: (1062, "Duplicate entry '1+' for key 2") User details No client information available Traceback details Forced traceback (most recent call last) File "/usr/local/lib/python2.5/site-packages/invenio/bibindex_engine.py", line 1045, in add_recIDs_by_date self.add_recIDs(alist, opt_flush) File "/usr/local/lib/python2.5/site-packages/invenio/bibindex_engine.py", line 1001, in add_recIDs self.put_into_db() File "/usr/local/lib/python2.5/site-packages/invenio/bibindex_engine.py", line 816, in put_into_db self.put_word_into_db(word) Traceback (most recent call last): File "/usr/local/lib/python2.5/site-packages/invenio/bibindex_engine.py", line 900, in put_word_into_db (word, set.fastdump())) File "/usr/local/lib/python2.5/site-packages/invenio/dbquery.py", line 227, in run_sql rc = cur.execute(sql, param) File "/var/lib/python-support/python2.5/MySQLdb/cursors.py", line 166, in execute self.errorhandler(self, exc, value) File "/var/lib/python-support/python2.5/MySQLdb/connections.py", line 35, in defaulterrorhandler raise errorclass, errorvalue IntegrityError: (1062, "Duplicate entry '1+' for key 2") ------------------------------------------------------------------------------------------------------------------------ And I needed to reboot the server due to full memory problem. Is it due to utf-8 problem? How to solve this? Thanks in advance, Le 6 mai 2011 à 10:43, Johnny Mariéthoz a écrit : > Dear Tibor, > > Le 6 mai 2011 à 10:21, Tibor Simko a écrit : > >> On Thu, 05 May 2011, Johnny Mariéthoz wrote: >>> Error when putting the term ''non-meat'' into db >>> (hitlist=intbitset([22464])): (1062, "Duplicate entry '16777215' for >>> key 1") >> >> The duplicate entry problem is related to incremental indexing of badly >> washed/truncated index terms before they are pushed to index. It could >> happen due to bad UTF-8 characters, due to change in work breaking >> procedures, etc. We have seen it too on our servers, mostly for >> full-text indexing. >> >> We believe we have fixed this problem in the latest git master branch; >> but these fixes concern Invenio v1.0 release series only. If you are on >> v0.99 release series, then some back-porting may be needed. Do you get >> these troubles on RERO DOC running Invenio v0.99.1? > Yes, this problem happens with our production server: Invenio v0.99.1 > >> In any case, rebuilding all your indexes from scratch (via bibindex -R) >> should fix the problem for some time to come, even without patching your >> sources. Because I think you see this problem only with incremental >> indexing; it should not happen during full re-indexing. Is that right? > > I do not want to redindex all the files. It will takes too much time. > Moreover, I think that we have a huge number of words as we have a lot of > document with OCR inside which create a lot of new words in the index table. > Can I change safety the type of the id in the idxWORD09F table from MediumInt > to Int? > Are they other tables that use this id? > Is it a good idea? > > Note: In the past I tried to re-index all the document, but it takes one full > day and crash the machine due to the memory problem. I tried several options > (-M -f) with bibindex without success. This is due to one of our collection > which is a scanned newpaper over 200 years which represents about 60000 > scanned pdf files. I exclude this collection from the fulltext indexing. > > Thanks for your answers. > > Regards, > >> >> Best regards >> -- >> Tibor Simko > > ---------------------------------------------------------------------- > Johnny Mariéthoz > RERO, Av. de la Gare 45, CH - 1920 MARTIGNY > Téléphone: +41(0)27 721 8579 > Fax : +41(0)27 721 8586 > Web : http://www.rero.ch > ReroDoc : http://doc.rero.ch, [email protected] > ---------------------------------------------------------------------- > > ---------------------------------------------------------------------- Johnny Mariéthoz RERO, Av. de la Gare 45, CH - 1920 MARTIGNY Téléphone: +41(0)27 721 8579 Fax : +41(0)27 721 8586 Web : http://www.rero.ch ReroDoc : http://doc.rero.ch, [email protected] ----------------------------------------------------------------------
