Dennis E. Hamilton created COR-31:
-------------------------------------

             Summary: Identification of Document Format Tool Progressions: 
Access, Creation, Testing, Assessment, Validation, Forensics
                 Key: COR-31
                 URL: https://issues.apache.org/jira/browse/COR-31
             Project: Corinthia
          Issue Type: Task
            Reporter: Dennis E. Hamilton


There are many needs, and opportunities, for command-line and library-level 
tools that support the development of processors for different document 
formats.  

Many small tools can be developed as part of the application and verification 
of what will be larger solutions with regard to particular formats.  

This task is for identification of which such tools will be defined as 
work-product and deliverables for Corinthia, even in an initial provisional 
list.  Having an identified structure points for defined deliverables should 
aid in having different aspects of Corinthia available for development and 
testing by many hands and eyes.

SKETCH

There are different levels of tools, and the layers provide fixtures for 
exercising lower layers of code and also composing them into layers above.

To be concrete, here is a sketch of the levels of tooling that can be 
byproducts and aids in the confirmation of correct handling of a document 
format.

There are two "raw" formats that are handled in building document files of 
interest to us: text files and Zip packages (or other carriers of composite 
structures, such as MIME multi-part, tar files, Microsoft DocFiles, etc.).  

There are flat file formats atop text-file formats.  Examples are Microsoft 
RTF, XML, and HTML.  These are accompanied by character-set encoding variations 
that must be dealt with.  There are also cases of linking that arise in these 
formats.

RTF is a document format.  XML carries document formats such as the single-file 
ODF format, the single-file XML formats defined for Microsoft Office, etc.  
There are already HTML-format usages that provide for fidelity preservation in 
round trip between HTML and Microsoft Office formats.  There may be something 
similar that has lived in OpenOffice.org.  These are very handy formats for 
creation of simple test documents that exercise the respective document models. 
 They also provide experience with the document formats and efforts to abstract 
the document that is represented in those formats.

Zip usage as carriers raises its own needs for well-defined tools, both for use 
in the inspection of document files but also the validation and forensic 
analysis of the Zip usage for ODF, OOXML, and other formats, such as ePub.  Now 
we're dealing with composite document files with multiple parts using flat 
formats, such as HTML and XML, and other formats, including binary formats not 
mentioned as part of this progressive layering.  There are now more elaborate 
structures to abstract from the parts of the Zip package and the 
cross-references among them.

These are all tooling opportunities and they support the testing and 
confirmation of the development of the document-processing functions that 
Corinthia makes available.

The richness of this can be illustrated by the need for forensic and validation 
tools and how they may become interdependent.

Consider the simple verification of a Zip file.  There are two levels of 
verification that matter.  

First there is of the fundamental invariant structure that a Zip archive must 
possess.  In practical use, it is desirable to rapidly abstract the presence of 
a correct Zip and its components.  It is desirable to be able to produce or 
update one efficiently.   One wants a fail-safe and resilient response when an 
unacceptable Zip is encountered.

At the same time, one wants a way to assess and inspect a Zip that is 
well-formed or is considered defective.  A separate tool would be handier for 
that, but needed to support document processing by providing inspection and 
reporting of how the Zip is unacceptable.  That's more involved and not 
something one wants to endure just to get going working with a document.  At 
the same time, there is a good case for some reused common code as well, and 
these kinds of tools aid in the confirmation of that code too.

Suppose a Zip is concluded to be damaged.  Another level is goes beyond 
detection of damage to determination of how much of the Zip can be recovered 
and what to do with the areas of damage.  This is about rescuing documents.  
Yet another opportunity.  Yet another elaborate use that can involve some 
shared underlying code.

We're now at the second level and that intersects with the use of a Zip as a 
particular kind of document container.  A zip may be well-formed, but there are 
additional limitations and functions that go into recognizing the Zip usage as 
a carrier of a particular document format.  It can even be a generic carrier 
format, such as the Open Packaging Conventions (OPC) used for carrying OOXML, 
XPS, and other artifacts, and the OpenDocument 1.2 Package used for carrying 
ODF.

There need to be analysis and inspection tools at this second level of generic 
Zip usage.  This also has a cross-over value in the forensic problem of 
recovering what is recoverable in a damaged Zip archive.  When it is known what 
additional structure is expected to be present, this can inform the 
identification of breakage and determination of loss.

It's not all one-sided.  What appears to be a well-formed Zip package for a 
given document format can still expose damage in the recording or compression 
(oh yes, compression and decompression) of any of its parts.

This sketch is still at the plumbing level.  The abstraction of document 
features is yet to happen.  That's raising up another level.

This is all just to point out how many opportunities for tools and supporting 
libraries there are. The tools are important for bootstrapping up the levels of 
Corinthia and for being able to check our own work, to devise tests and 
demonstrations, and to provide forensic support in the face of problems that 
may arise in the software or simply in circumstances that arise for users.

The idea behind this task and its subtasks is to see what could be identified 
as point deliverables, even if fundamentally for our own work process, so that 
they become definable and something to work on, to be available in higher 
levels of operation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to