Hello.
This is a follow-up to the previous E-mail about BibUpload.
Here, I present I think the simplest to implement method of extending BibUpload
to fit new requirements (despite long description it should be short to
implement because most of work is already done or can be redirected to external
libraries).
Because proposed extensions touch a very core functionality of Invenio, I would
be very happy having green light to do this.
This part of Invenio is also crucial for my project which can not progress much
before having these parts integrated in Invenio codebase.
As I mentioned, I have a branch containing implementation of some new features
of BibDoc - mostly extension of the MoreInfo notion and introduction of
relations between records.
The main point of this e-mail is ingestion of new documents to Invenio. Main
ideas inspiring the work are STANDALONE BIBDOCS, RELATIONS BETWEEN ONE DOCUMENT
AND MANY RECORDS and UPLOAD OF CUSTOM DATA INTO MOREINFO STRUCTURES.
I believe the proposed schema is flexible, but I am not able to predict all
possible scenarios, so critical comments (scenarios in which something could
break), especially before the phase of implementation would be very much
appreciated.
Please excuse the length of the e-mail. I was trying to keep everything clear
and as short as possible, yet providing examples which might be useful in
understanding the idea.
In the ideal situation, I would like to have an implemented prototype of
similar (modified after comments) infrastructure during the Inspire week which
will take place in the beginning of July.
+ Reasons, why current upload system is not sufficient and we should rather
provide a new mechanism rather than extending the existing one
- Using BibUpload "MARC" is not a very clean solution for uploading
non-record data such as standalone documents.
In order to be compliant with the current BibUpload format, We would
have to include a FFT tag inside a record tag.
This would have to be interpretet by BibUpload as NOT modifying or
uploading any record.
- Packing data of a more complicated structure into "MARC" is non-intuitive.
Of course, it is possible to encode anything in MARC, but it will
quickly become unreadable and the code implementing encoding/decoding
of the data will be more error-prone.
FFT should be left for fulltext upload where it serves the purpose
perfectly and should be understood as syntactic sugar providing
abbreviated
form of a more general upload.
- FFT stands for Fulltext File Transfer. Using it for non-fulltext
docuemtns leads to confusion.
- The BibUpload "MARC" obfuscates the way of thinking about documents.
The internal structure of documents and relations among them (and relations
between documents and records) is not reflected in the structure of the
FFT field.
Portions of information from different subfields land in completely
different database entities.
Example: Type of document. Even the BibUpload documentation suggests
that is a tyope of document while in fact, it is a type of link
between document and a bibliographical record. Currently it is
equivalent, but it will not be after we allow multiple record
attachemnts.
+ Uploading of the documents
Current mechanism for uploading documents to Invenio is very much oriented
towards managing fulltexts that can belong to only one record.
It is difficult to extend BibUpload to allow attachments of the same BibDoc to
many records or to create BibDocs not related to any record using the FFT
syntax.
It is also difficult to provide uploading of relations between existing
BibDocs. This is because MARC provides method of encoding tree-structured data
with the maximum nesting depth of 2. (1 in the case of special fields). Data
structures that need to be uploaded to Invenio are graphs.
BibUpload is the only gateway for uploading data to Invenio. For the sake of
uniformity, it should remain this way.
I believe, the easiest and the most efficient way of adapting BibUpload to
extended usage of BibDocs and file attachments is prviding a non-MARC XML based
additional input that could be processed by BibUpload.
The existing FFT (Fulltext File Transfer) field should be preserved and
utilised ONLY when uploading fulltext documents.
FFT should be understood as a convenient abbreviated syntax allowing a limited
functionality of new XML syntax.
In addition to FFT, a new "MARC" field could be introduced - BDR (BibDoc
Reference).
The purpose of this field would be to introduce a link between uploaded (or
modified) record and an existing BibDoc.
This would enable multiple relationship between records and documents which is
currently explicitly blocked in the code of BibDoc class but theoretically
allowed in the database structure.
In addition to extending the syntax of BibUpload, the significance of internal
BibDoc identifiers should be increased.
It should be assured that the same identifier can not be reused after deletion
of a BibDoc.
This would allow the usage of document identifier inside the BibUpload info.
These identifiers might be also useful when providing Digital Object
Identifiers to objects uploaded to an instance of Invenio.
++ Syntax of the input encoding new elements
The additional input of BibUpload should be provided as an additional file
containing following tags:
<BibDoc/>
<BibDocRelation/>
+++ <BibDoc>
This tag allows uplading of a document that will be managed by the installation
of Invenio.
The main subfield of BibDoc is File allowing to attach a file in a particular
format representing the BibDoc.
<File format=".jpg" path="/tmp/some_figure.jpg" />
If the format is not specified, it is guessed based on the file extension.
Before upload phase, the identifier that will be assigned to the document
by Invenio is not known.
Input passed to BibUpload should be able to include relations between uploaded
documents and links from documents to records. This can be achieved by
specifying temporary document identifiers.
BibDoc XML tag may specify the id property. Its value can be equal eather to
Invenio-assigned identifier (in this case, corresponding BibDoc can be updated)
or a demporary identifier (prefixed with the "tmp:" string).
The temporary identifier is recognised by BibUpload at least within the same
BibUpload session and allows to reference a particular BibDoc from XML elements
describing different entities. During the upload process, temporary identifiers
are replaced with newly assigned Invenio identifiers.
An example of complete BibDoc definition:
<BibDoc id="tmp:NewFigure1">
<File format=".png" path="/tmp/figure.png"></file>
<File format=".jpg" path="/tmp/figure.jpg"></file>
</BibDoc>
+++ <BibDocRelation>
This markup element enables uploading links between BibDocs being uploaded to
Invenio or already existing
Example:
<BibDocRelation bibdoc1="tmp:NewFigure1" version1="1" bibdoc2="12456"
version2="2" type="extracted_from"/>
+++ MoreInfo
Each of there element (BibDoc, File, BibDocRelation) can contain definition of
MoreInfo which
contains additional pieces of information divided into namespaces and having a
key, value format.
(Namespaces - additional level of dictionary allowing to group similar
key,value pairs) are intended to minimize
possibility of conflicts between different modules utilising the same MoreInfo
infrastructure. It will also be useful when adapting MoreInfo to store data in
separate database tables rather than in a blob. (Should we proceed with this
soon ?)
<MoreInfo>
<element category="plots" key="references" encoding="JSON">
<![CDATA[
[
{
"text": "In Figure 1 we can see the difference between (...)",
"position": 1123
},
{
"text":"(...) The results of the experiment are illustrated in Figure
1 (...)",
"position": 256
}
]
!]>
</element>
<element category="plots" key="x">10</element>
<element category="plots" key="y">20</element>
<element category="plots" key="width">600</element>
<element category="plots" key="height">400</element>
<element category="plots" key="caption">This is a caption of the
figure</element>
<!-- and some other properties assigned by a diferent module
- for instance the access control or general use flags -->
<element category="general" key="flags">abcds</element>
<element category="general" key="visibility">HIDDEN</element>
</MoreInfo>
Elements of MoreInfo (addressed by category and key) can be either strings or
JSON-encoded more complicated value. Usage of JSON is slightly clumsy in the
context of XML which itself provides data encoding, but seems to be the
simplest solution. We are using JSON in many places already and it seems
natural for representation of data. Another solution would be to replace JSON
with some type of XML encoding (we would have to encode for exampel lists) or
to replace the additional BibUpload input entirely by JSON.
+++ Attaching documents to records
The FFT (Fulltext File Transfer) "MARC" tag allowing to upload documents and
attach them to the publication is not flexible enough to allow attaching the
same docuemtn to many records or to allow upload of relations between documents.
It is though a very convenient manner of uploading documents that are full
texts so by the nature are attached (at least initially) to only one record.
This syntax should be preserved but its usage should be limited to fulltexts.
The semantics of FFT should be understood as an abbreviated form of uploading
particular type of BibDocs.
Besides FFT, we should provide one more special "MARC" tag BDR (or other name)
- BibDoc Reference which could create link between modified/uploaded record and
a BibDoc.
Subfields of the FFT tag should contain all pieces of information
characteristic to the link between record and BibDoc. Such information include
for example type of BibDoc (one BibDoc may be the Main document of one record
while only a figure in another).
Example of linking to an existing BibDoc:
<record>
<specialfield tag="001">234</specialfield>
<datafield tag="BDR">
<subfield code="a">12</subfield> <!--the identifier of BibDoc -->
<subfield code="r">number of a document to reference</subfield>
<subfield code="t">Main</subfield> <!--the identifier of BibDoc -->
<!-- other subfields characteristic to the relation -->
</datafield>
</record>
Example of linking to a document being uploaded in parallel:
<record>
<specialfield tag="001">234</specialfield>
<datafield tag="BDR">
<subfield code="a">tmp:NewDocument</subfield> <!--the identifier of
BibDoc -->
<subfield code="r">number of a document to reference</subfield>
<subfield code="t">Main</subfield> <!--the identifier of BibDoc -->
<!-- other subfields characteristic to the relation -->
</datafield>
</record>
??? Should we always attach a document or only its particular version ? (or
marking that all versions? )
The behaviour of all proposed extensions should be uniform with current
behaviour of BibUpload when workin in insert,update,append and correct modes.
+++ A larger example - Uploading of two new BibDoc and their attachment to two
existing records and marking that they are extracted from a fulltext document
of the given record.
In this example we assume that 576 is the identifier of the fulltext bibdoc
corresponding to the updated record.
The additional BibUpload input file:
<BibDoc id="tmp:NewFigure1">
<File format=".png" path="/tmp/figure.png"/>
<File format=".jpg" path="/tmp/figure.jpg"/>
<File format=".svg" path="/tmp/figure.svg">
<MoreInfo>
<!-- here for example information, encoded by a different module that
this file can not be published because of copyright problems (just an example)
-->
</MoreInfo>
</File>
<MoreInfo>
<!-- in this example we upload only the text present inside a figure as
an example of metadata fitting at this place -->
<element category="plots" key="internal_text">\tau neutrino NCGS axis
(...)</element>
</MoreInfo>
</BibDoc>
<BibDoc id="tmp:NewFigure2">
<File format=".png" path="/tmp/figure2.png"></File>
<File format=".jpg" path="/tmp/figure2.jpg"></File>
<!-- additional MoreInfo descriptions and other pieces of MetaData-->
</BibDoc>
<!-- the description of the relation between new BibDoc describing figure and
the
existing FullText document saved in a BibDoc 576, version 1.
This relation does not depend on format. -->
<BibDocRelation bibdoc1="tmp:NewFigure1" version1="tmp:NewFigure1:lastver"
bibdoc2="576" version2="1" type="extracted_from">
<MoreInfo>
<element category="plots" key="references" encoding="JSON">
<![CDATA[
[
{
"text": "In Figure 1 we can see the difference between (...)",
"position": 1123
},
{
"text":"(...) The results of the experiment are illustrated in
Figure 1 (...)",
"position": 256
}
]
!]>
</element>
<element category="plots" key="page">2</element>
<element category="plots" key="x">10</element>
<element category="plots" key="y">20</element>
<element category="plots" key="width">600</element>
<element category="plots" key="height">400</element>
<element category="plots" key="caption">This is a caption of the
figure</element>
<!-- and some other properties assigned by a diferent module
- for instance the access control or general use flags -->
<element category="general" key="flags">abcds</element>
<element category="general" key="visibility">HIDDEN</element>
</MoreInfo>
</BibDocRelation>
<BibDocRelation bibdoc1="tmp:NewFigure2" version1="tmp:NewFigure1:lastver"
bibdoc2="576" version2="1" type="extracted_from">
Here additional properties similarly to the 1st example
</BibDocRelation>
<BibDocRelation bibdoc1="tmp:NewFigure2" version1="tmp:NewFigure2:lastver"
bibdoc2="tmp:NewFigure1" version2="tmp:NewFigure2:lastver"
type="is_subfigure_of">
<MoreInfo>
<!-- some data here -->
</MoreInfo>
</BibDocRelation>
The "MARC" file:
<record>
<specialfield tag="001">234</specialfield>
<datafield tag="BDR">
<subfield code="a">tmp:NewFigure1</subfield> <!--the identifier of BibDoc
-->
<subfield code="t">Figure</subfield> <!--the identifier of BibDoc -->
<!-- other subfields characteristic to the relation -->
</datafield>
<datafield tag="BDR">
<subfield code="a">tmp:NewFigure2</subfield> <!--the identifier of BibDoc
-->
<subfield code="t">Figure</subfield> <!--the identifier of BibDoc -->
<!-- other subfields characteristic to the relation -->
</datafield>
</record>
Thank You for reading this rather long e-mail. There are still some issues that
have not been tackled here, but their solution is not as burning as this one.
Here I provide just a short list of them.
- uniformity of data models on different levels. (This is not an error but
leads to more complicated code).
We have three different points opf view on BibDocs/BibDocFiles/versions
One is implemented in the Python API (API layer), the second one in the
database and filesystem (storage layer) and a completely different one in the
presentation layer (/files pages of a record) - this might be confising and
probably should be unified a little.
- A little more explicit version management. Maybe I am wrong, but it feels a
little uncomfortable to have a database coumn refering to the entity that is
encoded only in the file name stored in the file system. (bibdoc version).
- BibDoc Python class in fact reflects needs of FullTexts. It should probably
be stripped from functions that are typical to FT treatment. They should be
moved to a subclass. (ie functions extracting fulltext)
- Automatic transformation of MoreInfo into dynamic database tables.
- Behaviour of bibdocs upon update - currently there is a new version every
time we change something. Maybe there should be a new version only if we change
the file and not meta-data ?
cheers
Piotr
________________________________________
From: Piotr Praczyk
Sent: 07 June 2011 18:06
To: Samuele Kaplun
Cc: project-cdsware-developers (CDS Invenio developers); Salvatore Mele
Subject: RE: [inspire-dev] Limitations with having standalone BibDocs
Hello
I have a branch, but it is still in my private repository. I will push it to
public hopefully today.
(I am struggling with weird problems with regression tests... was
BibRecDocsTest.test_BibRecDocs ever passing? For my taste the test requests
incorrect file sizes ... and indeed it fails on my machine)
Indeed a type might be important for the connection between record and
document. In Your example though, the object is plot in both cases.
In one, also has the function of being part of a bigger record and in the
second case, the record describes a plot itself, so there is a sense in having
type stored in two places - in BibDoc and in the relation to a record as it
means something slightly different.
Still, there is a problem, how BibUpload should deal with non-record data and
with relations between documents.
The document identifier is internal to Invenio and should not be passed inside
"MARC" FFT (or some similar field). Addressing by recordId/Name seems not to be
universal enough if we generalise the BibDoc file.
Piotr
________________________________________
From: Samuele Kaplun
Sent: 06 June 2011 14:18
To: Piotr Praczyk
Cc: inspire-dev
Subject: Re: [inspire-dev] Limitations with having standalone BibDocs
Hi Piotr,
Il giorno lun, 06/06/2011 alle 11.57 +0000, Piotr Praczyk ha scritto:
> Some time ago I was talking with Tibor about a possibility of having
> standalone BibDocs.
I think it might be better to move this discussion on
<project-cdsware-developer> mailing list as your issue are really
touching one of the core parts of Invenio.
> Also Salvatore said that they could be very useful in the near future
> for storing different pieces of data.
Sure they are.
> While going through the source code of bibdocfile.py, I discovered two
> things that are not exactly as they were described to be.
> In particular, the possibility of having a BibDoc not attached to any
> bibrecord is explicitly blocked in the source code of BibDoc class.
>
> (The constructor tries to retrieve the identifier of the record to
> which the document is related and the case of failure, throws one of
> exceptions:
> raise InvenioWebSubmitFileError, "The docid %s associated with
> docid %s is not associated with any record" % (main_docid, docid)
> raise InvenioWebSubmitFileError, "The docid %s is not associated to
> any recid or other docid" % docid
> )
Indeed. On the other hand, since the very link is stored in one simple
table this limitation should be easy to remove. However there are
several assumptions made in several part of Invenio about this link,
namely in the bibdocfile CLI, and as you mention below in BibUpload too.
Another enhancement I was always thinking to add, is the possibilities
to have one bibdoc attached to several records (in the end the
bibrec_bibdoc table allows for a many-to-many connection).
> Moreover, BibDoc does not really hold a type.
> The type of document is a property of the link between record and the
> document.
> Should we modify this ? This would have some deeper implications for
> BibUpload as we would have to have a possibility of uploading data not
> being associated with a record.
This is also why I would suggest to talk about this in the wider mailing
list. Today the doctype is not such a used property, although it is used
more and more in INSPIRE. On the other hand moving it from being a
property of the connection between a bibrec and a bibdoc, to be a
property of a single bibdoc, would imply that a bibdoc is intrinsically
of a given type, regardless of its context. That means that it would no
longer be possible, in case we allow to have a bibdoc to be pointed by
many records, to sports different types.
A "Plot" within record A can in principle be thought as a "Main" in
record B.
This open a big discussion about support for compound digital objects,
that, since you propose to extend bibdocfile, it would be great to fully
address in the best manner (e.g. taking in consideration METS, and
OAI-ORE use cases).
Shall we build a task force on the topic?
Cheers!
Sam
P.s. on an other subject, we are accumulating use-cases for your
extended MoreInfoBibDocFile, to be able to attach properties either at
the level of BibDocFile or at the level of the BibDoc. Have by chance
already started working on it and do you have a branch with some draft
to play with?
--
Samuele Kaplun
Invenio Developer ** <http://invenio-software.org/>