http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662
--- Comment #25 from Andreas Hedström Mace <andreas.hedstrom.m...@sub.su.se> --- The National Library of Sweden have together with Stockholm University Library provided the funding required for David to finalize his work on the harvester. Stockholm University Library has been testing the OAI-PMH harvester extensively of late, and have provided feedback and been in discussion with Dave about the development of the harvester. Here I’ll try to summarize our discussions. David will probably have to fill in the gaps where needed, and provide further detail! Our use case We are harvesting records from the Swedish union catalogue LIBRIS, which provides records in Marcxml. Today only bibliographic records are harvested, but we hope to add functionality in the future to also allow holdings to be harvested (but this is a separate development and won’t be discussed further here.) We would want to harvest repeatedly and often, preferably every 5 seconds or so, to always have up-to-date records in our local system. Cataloging is done in LIBRIS. Core functionality * The harvester works as intended, where we have tried harvesting record, editing/deleting them at the source and then reharvesting them. All works as intended. * We also tried to delete a record in Koha and then do a harvest – the intended error message is displayed (“Harvested records in error state”). * It’s very good that the HTTP and OAI-PMH parameters for the OAI server target can be tested directly! (I was trying to set up LIBRIS SRU server in Koha the other day and was frustrated that I had to go to cataloging to test whether or not I had set-up correct parameters…) All in all, the harvester works as intended! Major issues Repeated harvests The harvester as built today is made to run one-time harvest or repeating harvest that are long in between – like once every night. For those use cases, performing the scheduling in the GUI and then running the job with the cronjob (the download and the import parameters) is not a problem. But for repeated tasks, this divided responsibility is highly problematic. We would like to have all harvests (or tasks) set from the GUI! To facilitate this, David has proposed to change the functionality of the harvester to work as a daemon instead. The reasons for these is as follows: * Using the daemon, all scheduling can be handled by the GUI * Using the daemon, you could harvest every few seconds. The original intent with the cronjob was that it would be set once and never looked at again. The harvesting would just happen in the background. But since you want more control and to run the harvest every few seconds, a daemon is the way to go. * The key benefit of using the daemon is that you can control it from the GUI and that it can manage the harvests. Trying to set/schedule a cronjob from the GUI would be a bad idea. * If you’re trying to re-harvest every few seconds, a cronjob could easily get out of control. You could easily have competing processes and no way to control them at all. A daemon couldn’t be a communications centre in the way described. The way I envision it, the daemon will communicate with the Web GUI. You could start, stop, and pause harvests. The daemon would also be in charge of the actual harvest, as it could control its own activity. You can’t really control a cronjob. The cron daemon starts cronjobs based on its own unique syntax and that’s it. It’s just a scheduler. It’s not a controller. The daemon I’m talking about would be a controller. You could tell it “STOP 1” and it would stop running the harvest with the 1 identifier. David could preferably provide more detail on the proposed daemon approach. We had some initial reservations about the use of a daemon for the harvester, mainly as this would be a background process that might be hard to evaluate/work with for a systems administrator, to which David replied: * Why would it be hard for a systems administrator to evaluate/work with a daemon? It seems to me that it would actually be easier for sysadmins to evaluate/work with a daemon, as it can be monitored and controlled as a separate process. It’s much easier to control than a cronjob. It would be good to have input from others in the community on the merits of having the harvester run as a daemon! Matching rules At the moment there are not matching rules for the harvester per se. The only matching that is done is based on the OAI-PMH unique identifier. If there’s already a record in Koha with the same title, but not the same OAI-PMH unique identifier, you will get a duplicate. Not having matching rules will essentially make the harvester useless for us, and I would guess anyone harvesting from a union catalogue. We don’t want to add a lot of unnecessary duplicates to our local catalogue. In case of libraries who are already running Koha and would want to start using the harvester, there would be a lot of duplicates (possibly everything!). Also, we do not want to limit libraries to use one source to harvest from – there might be a need in the future to harvest from multiple sources. We suggest that the “Staged Marc Management” tool should be used to actually import the records into Koha – then the matching rules that apply there would be used. Or copying/mirroring this functionality for the harvester. Small issues * Viewing a server target, the page doesn’t have a back button or working breadcrumbs. David has suggested that he might not add a back-button but will fix the breadcrumbs. * The reset repository harvest button should have a warning or a help text next to it, explaining that all harvested records will be removed. * A help text should be added next to the Until parameter, detailing that this should not be set for repeated harvest. Otherwise, as the From parameter is auto-updated with each harvest, Until might be set before From, which will cause to harvester to fail. * More detailed information should be presented under “View”, preferably lists of records imported (where you can click on the bib-id to go to the actual record), lists of deleted records, updated records etc. We will draw up what we would like to see in terms of details and send to David. We can also post it here, if others are interested? * It would be great if multiple sets could be provided for one OAI server. The first time a new server is added, pressing the “Test HTTP and OAI-PMH parameters” will send you back to the OAI-PMH server targets (oai_client.pl) page, like you would expect the save button to do. David has confirmed that this is a bug. -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list Koha-bugs@lists.koha-community.org http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/