Hi,
I have quite a mess with a site that I'm supposed to harvest
selectively. This remote site is a DSpace repository that holds
documents of all Catalan universities, and we are supposed to collect a
few dozens of oaisets, the ones that belong to our university.
We are facing a few problems or doubts. First, let me show part of the
report I've produced, so some of the typical problems are shown:
oaiset = hdl_2072_16100 (Reports on Environmental Sciences)
identifiers = 9
not found = 0
duplicated = 0
oaiset = hdl_2072_1749 (Working papers)
identifiers = 401
not found = 11
oai:www.recercat.net:2072/41974
oai:www.recercat.net:2072/41975
oai:www.recercat.net:2072/41976
oai:www.recercat.net:2072/41977
oai:www.recercat.net:2072/41978
oai:www.recercat.net:2072/41979
oai:www.recercat.net:2072/41980
oai:www.recercat.net:2072/41983
oai:www.recercat.net:2072/41985
oai:www.recercat.net:2072/41984
oai:www.recercat.net:2072/41986
duplicated = 27
oai:www.recercat.net:2072/88000 = [63044, 67648]
oai:www.recercat.net:2072/87998 = [63042, 67646, 69419]
oai:www.recercat.net:2072/87999 = [63043, 67647]
oai:www.recercat.net:2072/87990 = [63034, 67639]
oai:www.recercat.net:2072/87991 = [63035, 67640]
[...]
I think that the problem originated, at least in part, due to some
control chars in the records of the remote site, and the xsl
transformation fails, produces error and apparently the rest of the
records are not converted and loaded to our local site. I've applied a
simple patch (attached) that seems to solve this. Maybe there are
better ways to remove those problematic characters, so improvements are
welcome. Otherwise, Tibor, please consider it for applying upstream.
After appying it localy, now I've forced another harvest, but it seems
that it doesn't collect older-than-last-harvesting-time records, even if
those records do not exist in my site. In a very related way, we are
not sure if older records of a this-oaiset-that-I'm-checking-now are
going to be collected next harvesting session.
I can do a manual harvesting-converting-and-uploading (h-c-u) of the
records that I've identified, no problem. But I'd like to know how does
Invenio decides that a record has to be collected for the two related
scenarios that I've tried to explain in my previous paragraph.
Do I have to do any post-processing after doing my manual h-c-u action?
Or, is there a way that I can feed a known list of local records (or
remote identifiers) to oaiharvest?
Thanks,
Ferran
bibconvert: remove invalid (control) chars
* Translate lower ASCII chars to spaces so the xsl translation doesn't fail
---
bin/bibconvert | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/bin/bibconvert b/bin/bibconvert
index 494c3f8..daba5e9 100755
--- a/bin/bibconvert
+++ b/bin/bibconvert
@@ -24,6 +24,8 @@
__revision__ = "$Id: bibconvert.in,v 1.41 2008/06/06 09:53:46 jerome Exp $"
+validchars = ''.join([' ' for i in range(32)] + [chr(i) for i in range(32,256)])
+
try:
import fileinput
import string
@@ -249,6 +251,7 @@ if opt_value.endswith('.'+
CFG_BIBFORMAT_BFX_FORMAT_TEMPLATE_EXTENSION):
# BibConvert for bfx
source_xml = sys.stdin.read()
+ source_xml = source_xml.translate(validchars)
try:
print bibconvert_bfx_engine.convert(source_xml, extract_tpl)
except NameError:
@@ -263,6 +266,7 @@ if opt_value.endswith('.'+
elif opt_value.endswith('.xsl'):
# BibConvert for XSLT
source_xml = sys.stdin.read()
+ source_xml = source_xml.translate(validchars)
try:
res = bibconvert_xslt_engine.convert(source_xml, extract_tpl)
if res is not None: