On Wed, 29 Feb 2012, Sebastian Schindler wrote:
> We do use stemming. Please correct me if I'm wrong, but I guess this
> amount of terms per records is generated just because we are uloading
> relative large records (one record in MARC-XML is about 20KB).

That could be, but especially if your documents are from diverse
scientific domains, so that the number of _different_ words in them in
really huge.

Still, 16M of different index terms seems like a lot.  Isn't there some
other problem such as using English stemming on German text or something
similar?  

It may be worth to tweak variables like
CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS and
CFG_BIBINDEX_CHARS_PUNCTUATION in order to reduce the number of your
distinct index terms, if you don't need that many.

> That repair task seems to be very slow, too. Its progress is
> "ixPAIR01F flushed 0 /745112 words" for about 50 minutes now.

Everything successful since then?  If repairing takes too much time, one
can use low-level magic techniques like:

$ echo "DELETE FROM idxWORD01R WHERE type='TEMPORARY' or type='FUTURE';" | \
  /opt/invenio/bin/dbexec

mentioned in the BibIndex Admin Guide under:

 <http://invenio-demo.cern.ch/help/admin/bibindex-admin-guide#4.2>

Best regards
-- 
Tibor Simko

Reply via email to