On Wed, 29 Feb 2012, Sebastian Schindler wrote: > We do use stemming. Please correct me if I'm wrong, but I guess this > amount of terms per records is generated just because we are uloading > relative large records (one record in MARC-XML is about 20KB).
That could be, but especially if your documents are from diverse scientific domains, so that the number of _different_ words in them in really huge. Still, 16M of different index terms seems like a lot. Isn't there some other problem such as using English stemming on German text or something similar? It may be worth to tweak variables like CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS and CFG_BIBINDEX_CHARS_PUNCTUATION in order to reduce the number of your distinct index terms, if you don't need that many. > That repair task seems to be very slow, too. Its progress is > "ixPAIR01F flushed 0 /745112 words" for about 50 minutes now. Everything successful since then? If repairing takes too much time, one can use low-level magic techniques like: $ echo "DELETE FROM idxWORD01R WHERE type='TEMPORARY' or type='FUTURE';" | \ /opt/invenio/bin/dbexec mentioned in the BibIndex Admin Guide under: <http://invenio-demo.cern.ch/help/admin/bibindex-admin-guide#4.2> Best regards -- Tibor Simko
