On Tue, 15 Mar 2011, Samuele Kaplun wrote: > My initial quest for putting outside of MARC the list of OAI sets to > which a record belongs is merely related to performance: currently, in > Invenio, the oai_repository_updater is incredibly slow, because it has > to compute all the sets, and check, for each record if the record need > to be updated.
I tend to look at oai_repository_updater as at a kind of handy daily checking tool to ensure that all records that are eligible for OAI exporting are indeed in proper OAI sets and all. (Especially since one can pre-populate OAI fields in submissions or approval forms etc.) If oai_repository_updater runs for ~15 minutes a day, then this seems still an acceptable price to pay. However, the script may scale badly for sites with a plethora of OAI sets, so some run-time optimisations are indeed welcome. Have you tried to profile the script to see how much it can be micro-optimised? Moreover, we can probably macro-optimise it from the daily operation perspective. IIRC, oai_repository_updater currently checks every eligible record upon every run, while it should not be necessary to re-check every record every day, if the record was not touched and if the set definitions were not changed. We can simply store the time-stamp information about when oai_repository_updater was last run, and about when the OAI set definitions were last set (or use table update times), and the oai_repository_updater script would process only new/modified records since its last run, unless OAI set definitions were changed as well. There would be naturally some `--process-all' option that would process all the records regardless of the timestamps, but the default mode would be to process only records modified since the last run. Kind of like the indexer is doing. This should help a lot in speeding up daily operations. > What about of having both ways, with the former only optionally > enabled, in case one is maintaining OAI sets by hand (and not via > oai_repository_updater)? At CDS we would keep this turned off, while > it can be turned on on other installations... It would be of course possible to create yet another CFG variable, but I'd avoid that for the simplicity's sake, especially if we are able to optimise things differently. Best regards -- Tibor Simko
