I don't disagree that this should be part of our long term vision, and those who can track this and advise the community on its development and implementation. That said, I don't see how we would be exporting to this or expanding to this in the wiki form.
I have concerns that we have so many basic issues unresolved, and little developer time, as such the mundane tasks are not being addressed. :-/ Regards, Billinghurst On Mon, Oct 5, 2015 at 10:04 PM Federico Leva (Nemo) <nemow...@gmail.com> wrote: > I'm finding this document quite useful: > > http://www.succeed-project.eu/sites/default/files/deliverables/Succeed_600555_WP4_D4.1_RecommendationsOnFormatsAndStandards_v1.1.pdf > > See description of ALTO pasted below, which is a followup to > > https://lists.wikimedia.org/pipermail/wikisource-l/2014-September/002081.html > . We should find a way to convert the transcribed books' HTML to ALTO > format. :) > > Some libraries are apparently using > http://www.primaresearch.org/tools/Aletheia which seems an augmented > (but unfree?!) version of ScanTailor with some different purpose. > > Nemo > > Principles > ALTO stores layout information and OCR recognized text of pages of any > kind of printed > documents like books, journals and newspapers. ALTO can detail technical > metadata for > describing the layout and content of physical resources (text, > illustrations, graphics). > ALTO describes a content page with different views: > The Description section helps to describe some general settings and > information > of the ALTO file (measurement units, file name, etc.), and the > production process > itself (processing steps, software used, dates and actors, etc.) > The Layout section contains what‟s on the page. A page is divided into > several > regions (print space; left, right, top and bottom margins). For each > region, all > objects are listed which have been detected inside: text blocks, > illustrations, > graphical elements, composed blocks. Each object previously identified > is defined > by generic attributes: width, height, text content (for the String > element). > Besides, the reading order of all the elements can be managed. > Each ALTO file may also contain a style section where different styles (for > paragraphs and fonts) are listed. > Use cases > ALTO is one of the most common formats used by libraries for converting > text from > images. It‟s used both to deliver digitized contents and to preserve > these contents. > In a delivery perspective, the ability of ALTO to store the text content > coordinates in a > page allows the overlay of image and text (multilayer PDF) and highlight > search words > in a query. > > _______________________________________________ > Wikisource-l mailing list > Wikisource-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikisource-l >
_______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l