On Tue, 28 Feb 2012, Schindler, Sebastian wrote:
> We already uploaded and indexed ~775.000 records. To achieve this, we
> had to change the data type of all bibxxx, birec_bibxxx and all
> PAIR/WORD/PHRASE tables from MEDIUMT to INT, because there occured
> some id-overflow issues at record no. ~600.000.

This may occur indeed, depending on how many index terms your word
breaking procedures generate.  E.g. for some instances of Invenio that
we are running here, we have 1M+ records, and UNSIGNED MEDIUMINT is
still OK for us.  IIRC, UNSIGNED MEDIUMINT should allow for 16,777,215
index terms.  It seems you are generating more than 16M index terms with
600K records?  That sounds like much.  Maybe you don't use stemming?  Or
do you need such a fine tuned word breaking?  You can count index terms
you generate via commands like:

  $ echo "SELECT COUNT(*) FROM idxWORD01F" | /opt/invenio/bin/dbexec
  $ echo "SELECT MAX(id) FROM idxWORD01F" | /opt/invenio/bin/dbexec

> Bibindex nearly freezes/runs very very slow when trying to index the
> global index.

It could be just slow due to index size.  How big are your idxWORD*
tables, both index tables (MYI) and data tables (MYD)?  Also, have you
tried to optimise your MySQL server parameters such as key_buffer and
friends?

> - bibindex -w global --repair => success, but the problem is still
> there

Does `bibindex -u admin -w global -k' reports success?

> - different flush sizes  (5.000, 25.000, 50.000)

The more the better, depending on your RAM size and on the size of your
bibindex process when it is running.

For some indexes that don't generate many index terms, e.g. title, you
could go as high as `-f 260000', if RAM permits.  So that re-indexing of
all your titles would take only 3 flushes.

> The last package (bibindex -w global -f 50000 --id=762635-776582)
> threw this exception (invenio.err):
>
> #################################################
> Error when putting the term \'\'1846 illustr\'\' into db
> (hitlist=intbitset([767550])): (1062, "Duplicate entry \'0\' for key
> \'PRIMARY\'")\n'

OK, so the problem seems to be with record 767550 and with the word pair
index (idxPAIR*).  So you can do the above size estimate on this index.

To investigate further, let's see first whether the troubles aren't due
to the indexing part that breaks words in pairs.  If your tables are
clean, can you run:

  $ bibindex -w global -a -i 767550

and see if the error reproducible?  Can you send the MARC representation
of this record so that we can see what characters does it contain?

Best regards
-- 
Tibor Simko

Reply via email to