On 24.06.24 23:28, jeremy ardley wrote:
[...]You have your content in a neutral format [...]
ooxml is far from "neutral"...
What you don't do is use these output formats as your primary content.
Obviously not. That's why they are publishing formats, as in you send that in to be published, not to be further edited. That's why formats like odf and ooxml are content creation and editing formats, not publishing formats.
Examples relevant to debian include package documentation such as man pages, markdown, doxygen, docbook, latex.
That's just the same category like HTML, so that remark isn't adding anything to the discussion.
In my most recent experience, OCR of pdf documents is quite difficult if the layout is significant such as in bank statements. There are various tools to assist in extracting the content but it's quite marginal.
Depends on the software you use. For all I know Abbyy has very capable OCR software (I think it's called FineReader) that is very much capable of handling various layouts and difficult to read - as in very old - fonts. That was already the case about a decade ago and I doubt the software has gotten any worse. But of course it's not available on Linux.
On the other hand, give a screenshot of a bank statement to the 'ai' GPT4 and ask it to extract all transactions in csv format and it is done perfectly.
As I said, AI can help with that, Abbyy is using it too. Question only is, if locally run AI can do that too, as everything else is a guarantee for breaching data protection laws.
On a sidenote, PDF is basically Postscript on steroids. Its entire purpose is to describe how content is to be placed on a printed page. On a side-side note Postscript is actually a programming language with specialty in text layout but quite capable of doing significant computation activities - so long as your output eventually gets rendered on a page.
Never said anything contrary to that. Just that it's a very difficult to handle format because many things weren't defined prior to PDF 2.0. You had a predefined feature set, but nobody told you how to implement it, so chances were high that things wouldn't work as intended with every reader. But it seems the community of programmers for PDF readers has found common ground long before PDF 2.0 was a thing, so at least the standardized things would work everywhere.