Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

Alex Brollo Wed, 17 Jul 2013 03:58:53 -0700

Just a brief comment about djvu text layer, using IA files to digging
deeper the topic.

FineReader OCR stores an incredibly detailed information in a proprietary
format; then, various FineReader versions export something of this
extremely rich set of information into different outputs - one of them
being djvu text layer. It's worth to note that even if any information
stored into djvu text layer can be extracted and used, the set of
information wrapped into djvu text layer (both in lisp-like format or in
xml format) is only a minor subset of original OCR information.

If someone is interested to get much more information, it can find it into
abbyy.xml output; and Internet Archive gives it as abbyy.gz into the list
of exportable files. It's a very heavy and complex xml structure but it is
possible to parse it, end to extract from it any information wrapped into
djvu text layer and much more - most interestingly, wortPenalty, that is,
word by word, the resume of degree of incertainty of OCR recognition of the
whole word.

We (I and Aarti) are digging into this mess, with fast preliminary results;
you can see into [[it:w:Utente:Alex brollo/Sandbox]] some brief pieces of
text extracted from abbyy.gx, where doubtful  words (in the opinion of OCR
software) are red. They can be easily managed by VisualEditor - caming
simply from a simple span tag.

Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage will
run, it would be possible to extract text by bot from abbyy.gz (if the work
comes from IA) and to upload such text as OCR.

Alex

2013/7/16 David Cuenca <[email protected]>

> Hi Aubrey,
> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he worked on
> the djvu text extraction/merging and he was interested in following-up on
> that. Maybe he has some fresh ideas about it.
>
> Micru
>
> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni 
> <[email protected]>wrote:
>
>> Hi David, Aarti, thibaud and Tpt,
>> please look at this thread:
>>
>> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
>> especially the last message.
>>
>> It seems George Orwell III knows his stuff about Djvu and Proofread
>> extension,
>> and it's probably worth digging into this "layer text" djvu thing.
>>
>> Even if I might dream of an ideal solution (a "layered structure" for
>> wikisource, in which text can marked up several times in different layers)
>> that is probably very far away.
>>
>> But it's still important to pave the way for further improvements, I
>> guess:
>> losing all the information from a formatted, mapped IA djvu it's not a
>> good thing to do, IMHO.
>> And the Visual Editor could help us, in the future, to keep some of that
>> information (italics, bold, etc.)
>>
>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do
>> something with it?
>>
>> Aubrey
>>
>
>
>
> --
> Etiamsi omnes, ego non
> _______________________________________________
> Wikisource-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>

_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

Reply via email to