To All Fellow SIG Members,

 

The current discussion about print-page numbers and digital item IDs is an 
ideal opportunity for me to provide an update to the group on my current 
#DATeCH2017 submissions and current activity working toward the development of 
a #GTS (Ground Truth Storage) format for magazines, newspapers, and related 
serial publications.

 

As FactMiners, our #CitizenScience research project, I am working with the good 
folks at PRImA, the Pattern Recognition and Image Analysis research lab at the 
U. of Salford. Through this collaboration I am developing FactMiners' MAGAZINE 
#GTS format which is based on a #cidocCRM/FRBRoo/PRESSoo 'ontological stack.' 
Our goal is to provide integrated complex document structure and content 
depiction models in support of eResearch and machine learning access to digital 
collections of historic documents. The FactMiners' MAGAZINE format is being 
evolved as a superset of PRImA's PAGE #GTS format.

 

As some of you may recall from my self-introductory comments upon joining the 
SIG, I believe the complementary branches of "things" and "activity/time" in 
the CRM lends itself to being used as an executable metamodel for software 
design and development, and not just used as a descriptive ontology. This idea 
was among the central ideas of my first submission to this year's DATeCH 
conference. As we "ground truthed" datasets of Softalk magazine -- curating the 
table of contents, advertiser index, mastheads, etc. of this early 
microcomputer magazine -- we developed a number of structure and content 
revealing datasets that were based on self-referential content within the 
magazine. 

 

For example, in order to begin building a visual repository of Softalk 
advertisements based on a fine-grained pattern language expressed in PRESSoo 
Issuing Rules and Issuing Rule Changes, the 7,158 sightings of advertisements 
in the 48 issues of the magazine required that we be able to transform print 
page number references to the "leaf" digital image IDs used by the Internet 
Archive where the Softalk magazine digital collection is maintained. In fact, 
the print page number (ppg) to 'leaf' ID mapping became an issue with every 
dataset we wanted to explore.

 

This led to our development of our second #DATeCH2017 submission, "Print-Page 
Number to 'Leaf' ID Mapping in Support of eResearch and Machine-Learning at the 
Internet Archive." In this paper we identify the foundational nature of the 
"ppg2leaf" tuple based on an exploration of relevant #cidocCRM functional 
subsystems. We document many of the situations where digitization can result in 
discontinuous mapping of print page numbers to digitized images. Our Softalk 
collection was digitized by the Internet Archive's regional scanning center and 
had an impressive but insufficient "ppg2leaf" map of over 70%. This tuple-match 
was a side-effect based on the vigilance of our scanners asserting page numbers 
during the center's workflow ingestion process. To determine if your experience 
was typical, we examined 265 computer magazine collections at the Archive and 
found only 29 individual print page numbers to leaf ID tuples in over 1.4M 
pages contained in these collections! So, as described in our second paper, we 
have developed the Python-based "ppg2leaf_ferret" app as a metadata discovery 
and curation tool in support of eResearch and machine learning at the Internet 
Archive.

 

Using the ppg2leaf_ferret we curated the print page number to digital image 
ItemID dataset and created the first implementation of FactMiners MAGAZINE 
format #GTS metadata files. The Softalk collection is, as reported in this 
announcement (https://goo.gl/XxMcqe), the first digital magazine collection at 
the Internet Archive to provide a set of magazine-specific Ground Truth Storage 
metadata files including the all-important ppg2leaf_map. At the above 
announcement link, you will find links to ResearchGate.net pre-prints of our 
#DATeCH submissions, embedded video project updates showing progress in the 
development of the "ferret" app, and links to the initial release of our 
publication and issue level MAGAZINE #GTS files in the Softalk collection at 
the Archive.

 

For SIG members who may not have a free ResearchGate user account, here are 
links on my OneDrive cloud storage to our #DATeCH2017 submissions:

 

    https://1drv.ms/b/s!AtML1v0eUlpEgdZlN7o9RyadKi-w4A

   

and here:

 

    https://1drv.ms/b/s!AtML1v0eUlpEgeJj6hbQKZaBfBscqw

    

This is an already overly long update, so I will wrap up my current 
contribution to this print page number and itemID conversation. But as will 
become clear to any SIG member who explores the links provided above, I will 
soon be asking for some "best practice" modeling recommendations about how to 
support the linkage between an instance-specific MAGAZINE #GTS metamodel and 
its cited reference models (in our initial case being the 
#cidocCRM/FRBRoo/PRESoo 'stack').

 

As Always Happy-Healthy Vibes to All,

-: Jim :-

 

    Jim Salmons 

    Twitter: @Jim_Salmons

 

    www.FactMiners.org (Our #CitizenScience project)

    www.SoftalkApple.com (Our #DigitalHistory project)

    www.medium.com/@Jim_Salmons/ (my #CognitiveComputing/#DigitalHumanities 
articles)

 

[[Note: snipping the long thread of interesting conversation which will be 
reflected in the list archive]]

Reply via email to