Hello,

I tried to change the id from MEDIUMINT to INT in the idxWORD09F table. I tried 
to re-index all the files on our test server using:

bibindex -R -M 2000000 -f 1000 -u admin 

Now I got an new error:

----------------------------------------- Received by email 
------------------------------------------------------

Error when putting the term ''1+\xf0\x9d\x91\x98\xe2\x80\xb2'' into db 
(hitlist=intbitset([10780])): (1062, "Duplicate entry '1+' for key 2")

The following problem occurred on <http://doc.test.rero.ch>

Registered exception

2011-05-07 08:51:31 -> IntegrityError: (1062, "Duplicate entry '1+' for key 2")

User details

No client information available

Traceback details

Forced traceback (most recent call last)
 File "/usr/local/lib/python2.5/site-packages/invenio/bibindex_engine.py", line 
1045, in add_recIDs_by_date
   self.add_recIDs(alist, opt_flush)
 File "/usr/local/lib/python2.5/site-packages/invenio/bibindex_engine.py", line 
1001, in add_recIDs
   self.put_into_db()
 File "/usr/local/lib/python2.5/site-packages/invenio/bibindex_engine.py", line 
816, in put_into_db
   self.put_word_into_db(word)
Traceback (most recent call last):
 File "/usr/local/lib/python2.5/site-packages/invenio/bibindex_engine.py", line 
900, in put_word_into_db
   (word, set.fastdump()))
 File "/usr/local/lib/python2.5/site-packages/invenio/dbquery.py", line 227, in 
run_sql
   rc = cur.execute(sql, param)
 File "/var/lib/python-support/python2.5/MySQLdb/cursors.py", line 166, in 
execute
   self.errorhandler(self, exc, value)
 File "/var/lib/python-support/python2.5/MySQLdb/connections.py", line 35, in 
defaulterrorhandler
   raise errorclass, errorvalue
IntegrityError: (1062, "Duplicate entry '1+' for key 2")
------------------------------------------------------------------------------------------------------------------------

And I needed to reboot the server due to full memory problem.

Is it due to utf-8 problem? How to solve this?

Thanks in advance,

Le 6 mai 2011 à 10:43, Johnny Mariéthoz a écrit :

> Dear Tibor,
> 
> Le 6 mai 2011 à 10:21, Tibor Simko a écrit :
> 
>> On Thu, 05 May 2011, Johnny Mariéthoz wrote:
>>> Error when putting the term ''non-meat'' into db
>>> (hitlist=intbitset([22464])): (1062, "Duplicate entry '16777215' for
>>> key 1")
>> 
>> The duplicate entry problem is related to incremental indexing of badly
>> washed/truncated index terms before they are pushed to index.  It could
>> happen due to bad UTF-8 characters, due to change in work breaking
>> procedures, etc.  We have seen it too on our servers, mostly for
>> full-text indexing.
>> 
>> We believe we have fixed this problem in the latest git master branch;
>> but these fixes concern Invenio v1.0 release series only.  If you are on
>> v0.99 release series, then some back-porting may be needed.  Do you get
>> these troubles on RERO DOC running Invenio v0.99.1?
> Yes, this problem happens with our production server: Invenio v0.99.1
> 
>> In any case, rebuilding all your indexes from scratch (via bibindex -R)
>> should fix the problem for some time to come, even without patching your
>> sources.  Because I think you see this problem only with incremental
>> indexing; it should not happen during full re-indexing.  Is that right?
> 
> I do not want to redindex all the files. It will takes too much time. 
> Moreover, I think that we have a huge number of words as we have a lot of 
> document with OCR inside which create a lot of new words in the index table.
> Can I change safety the type of the id in the idxWORD09F table from MediumInt 
> to Int?
> Are they other tables that use this id?
> Is it a good idea?
> 
> Note: In the past I tried to re-index all the document, but it takes one full 
> day and crash the machine due to the memory problem. I tried several options 
> (-M -f) with bibindex without success. This is due to one of our collection 
> which is a scanned newpaper over 200 years which represents about 60000 
> scanned pdf files. I exclude this collection from the fulltext indexing.
> 
> Thanks for your answers.
> 
> Regards,
> 
>> 
>> Best regards
>> -- 
>> Tibor Simko
> 
> ----------------------------------------------------------------------
> Johnny Mariéthoz
> RERO, Av. de la Gare 45, CH - 1920 MARTIGNY
> Téléphone:  +41(0)27 721 8579
> Fax              : +41(0)27 721 8586
> Web            : http://www.rero.ch
> ReroDoc    : http://doc.rero.ch, [email protected]
> ----------------------------------------------------------------------
> 
> 

----------------------------------------------------------------------
Johnny Mariéthoz
RERO, Av. de la Gare 45, CH - 1920 MARTIGNY
Téléphone:  +41(0)27 721 8579
Fax              : +41(0)27 721 8586
Web            : http://www.rero.ch
ReroDoc    : http://doc.rero.ch, [email protected]
----------------------------------------------------------------------


Reply via email to