Hello. 

This is a follow-up to the previous E-mail about BibUpload.
Here, I present I think the simplest to implement method of extending BibUpload 
to fit new requirements (despite long description it should be short to 
implement because most of work is already done or can be redirected to external 
libraries).
Because proposed extensions touch a very core functionality of Invenio, I would 
be very happy having green light to do this.
This part of Invenio is also crucial for my project which can not progress much 
before having these parts integrated in Invenio codebase.

As I mentioned, I have a branch containing implementation of some new features 
of BibDoc - mostly extension of the MoreInfo notion and introduction of 
relations between records.
The main point of this e-mail is ingestion of new documents to Invenio. Main 
ideas inspiring the work are STANDALONE BIBDOCS, RELATIONS BETWEEN ONE DOCUMENT 
AND MANY RECORDS and UPLOAD OF CUSTOM DATA INTO MOREINFO STRUCTURES.
I believe the proposed schema is flexible, but I am not able to predict all 
possible scenarios, so critical comments (scenarios in which something could 
break), especially before the phase of implementation would be very much 
appreciated.

Please excuse the length of the e-mail. I was trying to keep everything clear 
and as short as possible, yet providing examples which might be useful in 
understanding the idea.
In the ideal situation, I would like to have an implemented prototype of 
similar (modified after comments) infrastructure during the Inspire week which 
will take place in the beginning of July.


+ Reasons, why current upload system is not sufficient and we should rather 
provide a new mechanism rather than extending the existing one
  - Using BibUpload "MARC" is not a very clean solution for uploading
    non-record data such as standalone documents.
    In order to be compliant with the current BibUpload format, We would
    have to include a FFT tag inside a record tag.
    This would have to be interpretet by BibUpload as NOT modifying or
    uploading any record.
 
   - Packing data of a more complicated structure into "MARC" is non-intuitive.
     Of course, it is possible to encode anything in MARC, but it will
     quickly become unreadable and the code implementing encoding/decoding
     of the data will be more error-prone.
     FFT should be left for fulltext upload where it serves the purpose
     perfectly and should be understood as syntactic sugar providing 
abbreviated 
     form of a more general upload.

   - FFT stands for Fulltext File Transfer. Using it for non-fulltext
     docuemtns leads to confusion. 

   - The BibUpload "MARC" obfuscates the way of thinking about documents.
     The internal structure of documents and relations among them (and relations
     between documents and records) is not reflected in the structure of the 
FFT field.

     Portions of information from different subfields land in completely
     different database entities.
     Example: Type of document. Even the BibUpload documentation suggests
       that is a tyope of document while in fact, it is a type of link
       between document and a bibliographical record. Currently it is
       equivalent, but it will not be after we allow multiple record
       attachemnts.


+ Uploading of the documents

Current mechanism for uploading documents to Invenio is very much oriented 
towards managing fulltexts that can belong to only one record.
It is difficult to extend BibUpload to allow attachments of the same BibDoc to 
many records or to create BibDocs not related to any record using the FFT 
syntax.
It is also difficult to provide uploading of relations between existing 
BibDocs. This is because MARC provides method of encoding tree-structured data 
with the  maximum nesting depth of 2. (1 in the case of special fields). Data 
structures that need to be uploaded to Invenio are graphs.
BibUpload is the only gateway for uploading data to Invenio. For the sake of 
uniformity, it should remain this way.

I believe, the easiest and the most efficient way of adapting BibUpload to 
extended usage of BibDocs and file attachments is prviding a non-MARC XML based 
additional input that could be processed by BibUpload.
The existing FFT (Fulltext File Transfer) field should be preserved and 
utilised ONLY when uploading fulltext documents.
FFT should be understood as a convenient abbreviated syntax allowing a limited 
functionality of new XML syntax.
In addition to FFT, a new "MARC" field could be introduced - BDR (BibDoc 
Reference). 
The purpose of this field would be to introduce a link between uploaded (or 
modified) record and an existing BibDoc.
This would enable multiple relationship between records and documents which is 
currently explicitly blocked in the code of BibDoc class but theoretically 
allowed in the database structure.

In addition to extending the syntax of BibUpload, the significance of internal 
BibDoc identifiers should be increased.
It should be assured that the same identifier can not be reused after deletion 
of a BibDoc.
This would allow the usage of document identifier inside the BibUpload info.
These identifiers might be also useful when providing Digital Object 
Identifiers to objects uploaded to an instance of Invenio.

++ Syntax of the input encoding new elements

The additional input of BibUpload should be provided as an additional file 
containing following tags:

<BibDoc/>

<BibDocRelation/>

+++ <BibDoc>
This tag allows uplading of a document that will be managed by the installation 
of Invenio.

The main subfield of BibDoc is File allowing to attach a file in a particular 
format representing the BibDoc.

   <File format=".jpg" path="/tmp/some_figure.jpg" />

If the format is not specified, it is guessed based on the file extension.

Before upload phase, the identifier that will be assigned to the document
by Invenio is not known.
Input passed to BibUpload should be able to include relations between uploaded 
documents and links from documents to records. This can be achieved by 
specifying temporary document identifiers.
BibDoc XML tag may specify the id property. Its value can be equal eather to 
Invenio-assigned identifier (in this case, corresponding BibDoc can be updated) 
or a demporary identifier (prefixed with the "tmp:" string).
The temporary identifier is recognised by BibUpload at least within the same 
BibUpload session and allows to reference a particular BibDoc from XML elements 
describing different entities. During the upload process, temporary identifiers 
are replaced with newly assigned Invenio identifiers.

An example of complete BibDoc definition:
  <BibDoc id="tmp:NewFigure1">
    <File format=".png" path="/tmp/figure.png"></file>
    <File format=".jpg" path="/tmp/figure.jpg"></file>
  </BibDoc>

+++ <BibDocRelation>

This markup element enables uploading links between BibDocs being uploaded to 
Invenio or already existing


Example:

<BibDocRelation bibdoc1="tmp:NewFigure1" version1="1" bibdoc2="12456" 
version2="2" type="extracted_from"/>


+++ MoreInfo

Each of there element (BibDoc, File, BibDocRelation) can contain definition of 
MoreInfo which
contains additional pieces of information divided into namespaces and having a 
key, value format.
(Namespaces - additional level of dictionary allowing to group similar 
key,value pairs) are intended to minimize
possibility of conflicts between different modules utilising the same MoreInfo 
infrastructure. It will also be useful when adapting MoreInfo to store data in 
separate database tables rather than in a blob. (Should we proceed with this 
soon ?)

<MoreInfo>
  <element category="plots" key="references" encoding="JSON">
    <![CDATA[
      [
        {
          "text": "In Figure 1 we can see the difference between (...)",
          "position": 1123
        },
        {
          "text":"(...) The results of the experiment are illustrated in Figure 
1 (...)",
          "position": 256
        }
      ]
    !]>
  </element>
  <element category="plots" key="x">10</element>
  <element category="plots" key="y">20</element>
  <element category="plots" key="width">600</element>
  <element category="plots" key="height">400</element>
  <element category="plots" key="caption">This is a caption of the 
figure</element>
  <!-- and some other properties assigned by a diferent module
     - for instance the access control or general use flags -->
  <element category="general" key="flags">abcds</element>
  <element category="general" key="visibility">HIDDEN</element>
</MoreInfo>

Elements of MoreInfo (addressed by category and key) can be either strings or 
JSON-encoded more complicated value. Usage of JSON is slightly clumsy in the 
context of XML which itself provides data encoding, but seems to be the 
simplest solution. We are using JSON in many places already and it seems 
natural for representation of data. Another solution would be to replace JSON 
with some type of XML encoding (we would have to encode for exampel lists) or 
to replace the additional BibUpload input entirely by JSON.

+++ Attaching documents to records

The FFT (Fulltext File Transfer) "MARC" tag allowing to upload documents and 
attach them to the publication is not flexible enough to allow attaching the 
same docuemtn to many records or to allow upload of relations between documents.
It is though a very convenient manner of uploading documents that are full 
texts so by the nature are attached (at least initially) to only one record.
This syntax should be preserved but its usage should be limited to fulltexts.
The semantics of FFT should be understood as an abbreviated form of uploading 
particular type of BibDocs.

Besides FFT, we should provide one more special "MARC" tag BDR (or other name) 
- BibDoc Reference which could create link between modified/uploaded record and 
a BibDoc.
Subfields of the FFT tag should contain all pieces of information 
characteristic to the link between record and BibDoc. Such information include 
for example type of BibDoc (one BibDoc may be the Main document of one record 
while only a figure in another).

Example of linking to an existing BibDoc:
  <record>
    <specialfield tag="001">234</specialfield>
    <datafield tag="BDR">
      <subfield code="a">12</subfield> <!--the identifier of BibDoc -->
      <subfield code="r">number of a document to reference</subfield>
      <subfield code="t">Main</subfield> <!--the identifier of BibDoc -->
      <!-- other subfields characteristic to the relation -->
    </datafield>
  </record>

Example of linking to a document being uploaded in parallel:

  <record>
    <specialfield tag="001">234</specialfield>
    <datafield tag="BDR">
      <subfield code="a">tmp:NewDocument</subfield> <!--the identifier of 
BibDoc -->
      <subfield code="r">number of a document to reference</subfield>
      <subfield code="t">Main</subfield> <!--the identifier of BibDoc -->
      <!-- other subfields characteristic to the relation -->
    </datafield>
  </record>

??? Should we always attach a document or only its particular version ? (or 
marking that all versions? )

The behaviour of all proposed extensions should be uniform with current 
behaviour of BibUpload when workin in insert,update,append and correct modes.


+++ A larger example - Uploading of two new BibDoc and their attachment to two 
existing records and marking that they are extracted from a fulltext document 
of the given record.

In this example we assume that 576 is the identifier of the fulltext bibdoc
corresponding to the updated record.

The additional BibUpload input file:

<BibDoc id="tmp:NewFigure1">
    <File format=".png" path="/tmp/figure.png"/>
    <File format=".jpg" path="/tmp/figure.jpg"/>
    <File format=".svg" path="/tmp/figure.svg">
      <MoreInfo>
        <!-- here for example information, encoded by a different module that 
this file can not be published because of copyright problems (just an example) 
-->
      </MoreInfo>
    </File>

    <MoreInfo>
      <!-- in this example we upload only the text present inside a figure as 
an example of metadata fitting at this place -->
      <element category="plots" key="internal_text">\tau neutrino NCGS axis 
(...)</element>
    </MoreInfo>
</BibDoc>

<BibDoc id="tmp:NewFigure2">
    <File format=".png" path="/tmp/figure2.png"></File>
    <File format=".jpg" path="/tmp/figure2.jpg"></File>
    <!-- additional MoreInfo descriptions and other pieces of MetaData-->
</BibDoc>

<!-- the description of the relation between new BibDoc describing figure and 
the
     existing FullText document saved in a BibDoc 576, version 1.
     This relation does not depend on format. -->

<BibDocRelation bibdoc1="tmp:NewFigure1" version1="tmp:NewFigure1:lastver" 
bibdoc2="576" version2="1" type="extracted_from">
  <MoreInfo>
    <element category="plots" key="references" encoding="JSON">
      <![CDATA[
        [
          {
            "text": "In Figure 1 we can see the difference between (...)",
            "position": 1123
          },
          {
            "text":"(...) The results of the experiment are illustrated in 
Figure 1 (...)",
            "position": 256
          }
        ]
      !]>
    </element>
    <element category="plots" key="page">2</element>
    <element category="plots" key="x">10</element>
    <element category="plots" key="y">20</element>
    <element category="plots" key="width">600</element>
    <element category="plots" key="height">400</element>
    <element category="plots" key="caption">This is a caption of the 
figure</element>
    <!-- and some other properties assigned by a diferent module
     - for instance the access control or general use flags -->
    <element category="general" key="flags">abcds</element>
    <element category="general" key="visibility">HIDDEN</element>
  </MoreInfo>
</BibDocRelation>

<BibDocRelation bibdoc1="tmp:NewFigure2" version1="tmp:NewFigure1:lastver" 
bibdoc2="576" version2="1" type="extracted_from">
   Here additional properties similarly to the 1st example
</BibDocRelation>

<BibDocRelation bibdoc1="tmp:NewFigure2" version1="tmp:NewFigure2:lastver" 
bibdoc2="tmp:NewFigure1" version2="tmp:NewFigure2:lastver" 
type="is_subfigure_of">
  <MoreInfo>
    <!-- some data here -->
  </MoreInfo>
</BibDocRelation>

The "MARC" file:

  <record>
    <specialfield tag="001">234</specialfield>
    <datafield tag="BDR">
      <subfield code="a">tmp:NewFigure1</subfield> <!--the identifier of BibDoc 
-->
      <subfield code="t">Figure</subfield> <!--the identifier of BibDoc -->
      <!-- other subfields characteristic to the relation -->
    </datafield>

    <datafield tag="BDR">
      <subfield code="a">tmp:NewFigure2</subfield> <!--the identifier of BibDoc 
-->
      <subfield code="t">Figure</subfield> <!--the identifier of BibDoc -->
      <!-- other subfields characteristic to the relation -->
    </datafield>
  </record>





Thank You for reading this rather long e-mail. There are still some issues that 
have not been tackled here, but their solution is not as burning as this one.
Here I provide just a short list of them.

- uniformity of data models on different levels. (This is not an error but 
leads to more complicated code).
  We have three different points opf view on BibDocs/BibDocFiles/versions
  One is implemented in the Python API (API layer), the second one in the 
database and filesystem (storage layer) and a completely different one in the 
presentation layer (/files pages of a record) - this might be confising and 
probably should be unified a little.
  - A little more explicit version management. Maybe I am wrong, but it feels a 
little uncomfortable to have a database coumn refering to the entity that is 
encoded only in the file name stored in the file system. (bibdoc version).

- BibDoc Python class in fact reflects needs of FullTexts. It should probably 
be stripped from functions that are typical to FT treatment. They should be 
moved to a subclass. (ie functions extracting fulltext)

- Automatic transformation of MoreInfo into dynamic database tables.

- Behaviour of bibdocs upon update - currently there is a new version every 
time we change something. Maybe there should be a new version only if we change 
the file and not meta-data ?







cheers
Piotr



________________________________________
From: Piotr Praczyk
Sent: 07 June 2011 18:06
To: Samuele Kaplun
Cc: project-cdsware-developers (CDS Invenio developers); Salvatore Mele
Subject: RE: [inspire-dev] Limitations with having standalone BibDocs

Hello

I have a branch, but it is still in my private repository. I will push it to 
public hopefully today.
(I am struggling with weird problems with regression tests... was 
BibRecDocsTest.test_BibRecDocs ever passing? For my taste the test requests 
incorrect file sizes ... and indeed it fails on my machine)

Indeed a type might be important for the connection between record and 
document. In Your example though, the object is plot in both cases.
In one, also has the function of being part of a bigger record and in the 
second case, the record describes a plot itself, so there is a sense in having 
type stored in two places  - in BibDoc and in the relation to a record as it 
means something slightly different.

Still, there is a problem, how BibUpload should deal with non-record data and 
with relations between documents.
The document identifier is internal to Invenio and should not be passed inside 
"MARC" FFT (or some similar field). Addressing by recordId/Name seems not to be 
universal enough if we generalise the BibDoc file.


Piotr

________________________________________
From: Samuele Kaplun
Sent: 06 June 2011 14:18
To: Piotr Praczyk
Cc: inspire-dev
Subject: Re: [inspire-dev] Limitations with having standalone BibDocs

Hi Piotr,

Il giorno lun, 06/06/2011 alle 11.57 +0000, Piotr Praczyk ha scritto:
> Some time ago I was talking with Tibor about a possibility of having
> standalone BibDocs.

I think it might be better to move this discussion on
<project-cdsware-developer> mailing list as your issue are really
touching one of the core parts of Invenio.

> Also Salvatore said that they could be very useful in the near future
> for storing different pieces of data.

Sure they are.

> While going through the source code of bibdocfile.py, I discovered two
> things that are not exactly as they were described to be.
> In particular, the possibility of having a BibDoc not attached to any
> bibrecord is explicitly blocked in the source code of BibDoc class.
>
> (The constructor tries to retrieve the identifier of the record to
> which the document is related and the case of failure, throws one of
> exceptions:
>    raise InvenioWebSubmitFileError, "The docid %s associated with
> docid %s is not associated with any record" % (main_docid, docid)
>    raise InvenioWebSubmitFileError, "The docid %s is not associated to
> any recid or other docid" % docid
> )

Indeed. On the other hand, since the very link is stored in one simple
table this limitation should be easy to remove. However there are
several assumptions made in several part of Invenio about this link,
namely in the bibdocfile CLI, and as you mention below in BibUpload too.

Another enhancement I was always thinking to add, is the possibilities
to have one bibdoc attached to several records (in the end the
bibrec_bibdoc table allows for a many-to-many connection).

> Moreover, BibDoc does not really hold a type.
> The type of document is a property of the link between record and the
> document.
> Should we modify this ?  This would have some deeper implications for
> BibUpload as we would have to have a possibility of uploading data not
> being associated with a record.

This is also why I would suggest to talk about this in the wider mailing
list. Today the doctype is not such a used property, although it is used
more and more in INSPIRE. On the other hand moving it from being a
property of the connection between a bibrec and a bibdoc, to be a
property of a single bibdoc, would imply that a bibdoc is intrinsically
of a given type, regardless of its context. That means that it would no
longer be possible, in case we allow to have a bibdoc to be pointed by
many records, to sports different types.

A "Plot" within record A can in principle be thought as a "Main" in
record B.

This open a big discussion about support for compound digital objects,
that, since you propose to extend bibdocfile, it would be great to fully
address in the best manner (e.g. taking in consideration METS, and
OAI-ORE use cases).

Shall we build a task force on the topic?

Cheers!
        Sam

P.s. on an other subject, we are accumulating use-cases for your
extended MoreInfoBibDocFile, to be able to attach properties either at
the level of BibDocFile or at the level of the BibDoc. Have by chance
already started working on it and do you have a branch with some draft
to play with?


--
Samuele Kaplun
Invenio Developer ** <http://invenio-software.org/>

Reply via email to