RE: [inspire-dev] Limitations with having standalone BibDocs

Piotr Praczyk Tue, 14 Jun 2011 20:08:04 +0200

Hello

Thanks for Your answer.
As I mentioned before, my knowledge of METS is very very limited (to one look 
into the tutorial), so your comments are very precious..
My main concerns with implementing the standard is its compatibility with our 
data model (which is changing anyway, so if we are to support METS, it is 
better to rethink most of things now and possibly support part of the standard).
The  second concern was the amount of effort needed to implement [but this is 
irrelevant for the discussion].
Indeed it is obvious, how beneficial it would be to have a full support for the 
standard.


>> 1) (the most obvious for me) - the case of standalone plots showing an 
>> important phenomenon. The access to them shall be provided by the figures 
>> search.
>I propose the benchmark of document digitization use-case: let's think about
>- one document with its metadata (say MARC or EAD record)
>- with many image files (or video/audio files...) for the different pages (or 
>streaming parts...) with many different file-versions per page (like: the 
>master uncompressed version for archive; the high-level compressed version for 
>local access; the low-level compressed version for online access; the 
>thumbnail; etc etc).

This is not the use case of figures from scientific publications (If I 
understand correctly, looks rather like digitalisation of entire documents), 
though seems to be relevant for Invenio/Inspire in general. Looks like a nice 
benchmark of the underlying data-structures.

>>>      FFT should be left for fulltext upload where it serves the purpose
>>>      perfectly and should be understood as syntactic sugar providing 
>>> abbreviated
>>>     form of a more general upload.
>>It also serves well the case of many documents in many formats attached
>>to the same records. I hope all these use cases will still be supported
>>through FFTs
>The idea was to provide new mechanism not modifying the existing capabilities 
>of Bibupload... just stopping to use FFT for objects as Figures.

Exactly. The FFT syntax should be an abbreviated form of attaching documents, 
equivalent to some (METS?) input. 


>>      * provide a web handler to access bibdocfiles regardless of them
>>        being owned by a record (as the
>>        current /record/123/files/foo.pdf will no longer work for non
>>        fulltext) (BTW what about restriction/authorization? What if
>>        bibdoc is referenced both by a public and a restricted record?
>>        Should we go for the strongest restriction mode?)
>It canstill work, but in a slightly more distant future we might want to 
>provide /object/123 along with /record/12

>Here (with this ad-hoc selection of quotes) I'd like to forward a big 
>preoccupation of the system analysts I worked with for digital libraries: the 
>complete autonomy of the storage file-system from metadata file-pointers.
>Digital-metadata like METS are often used as a layer (additional to the 
>descriptive-metadata like MARC) to store file-pointers managed with some 
>resolver (based on algorithms like yours). So that when we are dealing with 
>many many thousands of files per many many TeraBytes, EVERY >new 
>accommodation/substitution/refresh of the storage-system is possible … without 
>worrying about the logical (even permanent) pointers to the files. (From METS 
>tutorial: The LOCTYPE attribute specifies the type of locator contained in 
>body of the element; valid values for LOCTYPE include >'URN,' 'URL,' 'PURL,' 
>'HANDLE,' 'DOI,' and 'OTHER.')

>Speaking in concrete words: in my experience quite every time I saw,
>- descriptive-metadata (like MARC) managed on one side with specific 
>(multiple) identifiers (..also modifiable identifiers, in the collaborative 
>systems..)
>- digital-repositories, on the other side, with specific (stable!!) 
>identifiers for digital-objects and their component files,
>- and, in the middle, digital-metadata (like METS) which guarantee the 
>connections (regardless the physical file storage).


I think, I did not understand this part.
What are the cases of modifiable identifiers inside MARC ? Titles of documents 
+ authors ? Exact file paths in the file system (as we happen to still have in 
some places in Inspire ?) 
By link between two do you mean a document identifying the same document with 
both at the same time ? 
What is physical storage for You ? From the physical storage I wanted to 
abstract exactly by providing such links /object/DOCID


>>BDR is supposed to provide link between records and objects(document).
>>In METS (if I understand correctly), they are used only to describe the 
>>internal structure of objects.
>I'm not sure if I'm correctly following your arguments, but here I have to 
>suggest just the contrary: METS does provide link between records and 
>objects(document). And internal structure of objects. And technical 
>description of the digital-object...
>METS is a container which can include or point-to a lot of different layers 
>with respective schemas, also the (MARC) descriptive-metadata.
I did not know that you can reference MARC from there.
I think that in this case the main point is that currently BibUpload is the 
gate to every modification of MARC. Provided, we create another tool parsing 
METS files and uploading them to the repository, and these files will contain 
information about links between
MARC records and objects, we will break this property.
I thought about BDR as being a bridge allowing to store data about objects 
separately and having references to them inside MARC as they modify the 
appearance of records so should be managed by BibUpload.



>>> <BibDocRelation bibdoc1="tmp:NewFigure1" version1="1" bibdoc2="12456" 
>>> version2="2" type="extracted_from"/>
>>I really like the idea of creating links between specific versions.
>>Unfortunately METS is not aware of versions :-(
>>> Example:
>>Versions are crucial for us exactly for the reason You noted in las message

>METS does support multiple file versions !! And also in the precise way you 
>mentioned as an example :-)
>(Probably you already read it, but) let me repeat a quote of METS tutorial 
>(http://www.loc.gov/standards/mets/METSOverview.v2.html#filegrp):
>The file section (<fileSec>) contains one or more <fileGrp> elements used to 
>group together related files. A <fileGrp> lists all of the files which 
>comprise a single electronic version of the digital library object. For 
>example, there might be separate <fileGrp> elements for the thumbnails, the 
>>master archival images, the pdf versions, the TEI encoded text versions, etc.
>[…]
><fileGrp> becomes much more useful for objects consisting of large numbers of 
>scanned page images, or indeed any case where a single version of the object 
>consists of a large number of files. In those cases, being able to separate 
><file> elements into 
><fileGrp>s makes identifying the files belonging to a particular version of 
>the document a simple task.
>I can surely say that I used MAG (...italian, “mappable” version of METS) to 
>manage multiple images per book, with multiple versions per imagine (master 
>TIFF uncompressed, plus JPG compressed... Indeed: multi-resolution JPG file 
>version … but this is another topic).

I think, the word "version" creates confusion here as version in this sense is 
format in Invenio.
The version which I was talking about is a number telling, how many times 
object was modified. Maybe revision is a better word. 
I was rather thinking about providing a link between for example 1st revision 
of the full text (whichever format) and 3rd revision of a figure.
Assigning data to connection between particular revisions will be important 
from the point of view of processing of figures.


>But it really doesn't lack of already existing extended profiles: please take 
>a look to the actual registered profiles 
>http://www.loc.gov/standards/mets/mets-registered-profiles.html.
 Thanks... I will have a look


Looking from my perspective, I think it would be nice to repeat the example in 
custom XML I proposed few mails ago and see if it can be easily reproduced..


cheers
Piotr

RE: [inspire-dev] Limitations with having standalone BibDocs

Reply via email to