I don't disagree that this should be part of our long term vision, and
those who can track this and advise the community on its development and
implementation. That said, I don't see how we would be exporting to this or
expanding to this in the wiki form.

I have concerns that we have so many basic issues unresolved, and little
developer time, as such the mundane tasks are not being addressed. :-/

Regards, Billinghurst

On Mon, Oct 5, 2015 at 10:04 PM Federico Leva (Nemo) <nemow...@gmail.com>
wrote:

> I'm finding this document quite useful:
>
> http://www.succeed-project.eu/sites/default/files/deliverables/Succeed_600555_WP4_D4.1_RecommendationsOnFormatsAndStandards_v1.1.pdf
>
> See description of ALTO pasted below, which is a followup to
>
> https://lists.wikimedia.org/pipermail/wikisource-l/2014-September/002081.html
> . We should find a way to convert the transcribed books' HTML to ALTO
> format. :)
>
> Some libraries are apparently using
> http://www.primaresearch.org/tools/Aletheia which seems an augmented
> (but unfree?!) version of ScanTailor with some different purpose.
>
> Nemo
>
> Principles
> ALTO stores layout information and OCR recognized text of pages of any
> kind of printed
> documents like books, journals and newspapers. ALTO can detail technical
> metadata for
> describing the layout and content of physical resources (text,
> illustrations, graphics).
> ALTO describes a content page with different views:
> The Description section helps to describe some general settings and
> information
> of the ALTO file (measurement units, file name, etc.), and the
> production process
> itself (processing steps, software used, dates and actors, etc.)
> The Layout section contains what‟s on the page. A page is divided into
> several
> regions (print space; left, right, top and bottom margins). For each
> region, all
> objects are listed which have been detected inside: text blocks,
> illustrations,
> graphical elements, composed blocks. Each object previously identified
> is defined
> by generic attributes: width, height, text content (for the String
> element).
> Besides, the reading order of all the elements can be managed.
> Each ALTO file may also contain a style section where different styles (for
> paragraphs and fonts) are listed.
> Use cases
> ALTO is one of the most common formats used by libraries for converting
> text from
> images. It‟s used both to deliver digitized contents and to preserve
> these contents.
> In a delivery perspective, the ability of ALTO to store the text content
> coordinates in a
> page allows the overlay of image and text (multilayer PDF) and highlight
> search words
> in a query.
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to