Re: Publishing Formats

Richard Tue, 25 Jun 2024 01:43:43 -0700

On 24.06.24 23:28, jeremy ardley wrote:

[...]You have your content in a neutral format [...]


ooxml is far from "neutral"...

What you don't do is use these output formats as your primary content.


Obviously not. That's why they are publishing formats, as in you send that in 
to be published, not to be further edited. That's why formats like odf and 
ooxml are content creation and editing formats, not publishing formats.

Examples relevant to debian include package documentation such as man pages, 
markdown, doxygen, docbook, latex.


That's just the same category like HTML, so that remark isn't adding anything 
to the discussion.

In my most recent experience, OCR of pdf documents is quite difficult if the 
layout is significant such as in bank statements. There are various tools to 
assist in extracting the content but it's quite marginal.


Depends on the software you use. For all I know Abbyy has very capable OCR 
software (I think it's called FineReader) that is very much capable of handling 
various layouts and difficult to read - as in very old - fonts. That was 
already the case about a decade ago and I doubt the software has gotten any 
worse. But of course it's not available on Linux.

On the other hand, give a screenshot of a bank statement to the 'ai' GPT4 and 
ask it to extract all transactions in csv format and it is done perfectly.


As I said, AI can help with that, Abbyy is using it too. Question only is, if 
locally run AI can do that too, as everything else is a guarantee for breaching 
data protection laws.

On a sidenote, PDF is basically Postscript on steroids. Its entire purpose is 
to describe how content is to be placed on a printed page. On a side-side note 
Postscript is actually a programming language with specialty in text layout but 
quite capable of doing significant computation activities - so long as your 
output eventually gets rendered on a page.

Never said anything contrary to that. Just that it's a very difficult to handle 
format because many things weren't defined prior to PDF 2.0. You had a 
predefined feature set, but nobody told you how to implement it, so chances 
were high that things wouldn't work as intended with every reader. But it seems 
the community of programmers for PDF readers has found common ground long 
before PDF 2.0 was a thing, so at least the standardized things would work 
everywhere.

Re: Publishing Formats

Reply via email to