On Tue, 15 Mar 2011, Samuele Kaplun wrote:
> My initial quest for putting outside of MARC the list of OAI sets to
> which a record belongs is merely related to performance: currently, in
> Invenio, the oai_repository_updater is incredibly slow, because it has
> to compute all the sets, and check, for each record if the record need
> to be updated.

I tend to look at oai_repository_updater as at a kind of handy daily
checking tool to ensure that all records that are eligible for OAI
exporting are indeed in proper OAI sets and all.  (Especially since one
can pre-populate OAI fields in submissions or approval forms etc.)  If
oai_repository_updater runs for ~15 minutes a day, then this seems still
an acceptable price to pay.  However, the script may scale badly for
sites with a plethora of OAI sets, so some run-time optimisations are
indeed welcome.

Have you tried to profile the script to see how much it can be
micro-optimised?  Moreover, we can probably macro-optimise it from the
daily operation perspective.  IIRC, oai_repository_updater currently
checks every eligible record upon every run, while it should not be
necessary to re-check every record every day, if the record was not
touched and if the set definitions were not changed.  We can simply
store the time-stamp information about when oai_repository_updater was
last run, and about when the OAI set definitions were last set (or use
table update times), and the oai_repository_updater script would process
only new/modified records since its last run, unless OAI set definitions
were changed as well.  There would be naturally some `--process-all'
option that would process all the records regardless of the timestamps,
but the default mode would be to process only records modified since the
last run.  Kind of like the indexer is doing.  This should help a lot in
speeding up daily operations.

> What about of having both ways, with the former only optionally
> enabled, in case one is maintaining OAI sets by hand (and not via
> oai_repository_updater)? At CDS we would keep this turned off, while
> it can be turned on on other installations...

It would be of course possible to create yet another CFG variable, but
I'd avoid that for the simplicity's sake, especially if we are able to
optimise things differently.

Best regards
--
Tibor Simko

Reply via email to