On 24/6/24 22:22, Richard wrote:
Since it's quite OT, starting a new thread for this.
I would most certainly never call formats like ooxml or odf
“publishing formats”, they are content creation or editing formats.
From a publishing format I expect to be able to show the content as
intended — which actually neither of them can do 100 % can, the
probability of messing up just isn't that big. Either you want a fixed
format, e.g. for printing, what you get with the likes of PDF, PS, SVG
or your various raster graphic formats. Or you want your content to
adapt in a foreseeable way to the viewer, i.e. HTLM, usually with the
help of CSS and worst case JS. Sure, ooxml and odf want to be the
former, but due to technical caveats that's not necessarily possible.
With ooxml, you have several incompatible versions you can't just
easily tell apart, often making identical display impossible due to
using but not embedding proprietary fonts by default — and being an
abomination of a format spanning around 5500 pages plus another 1000
pages for their tranistional mode, that was only standardized by
world-wide corruption. ODF usually does things way better, but support
in software beyond LibreOffice is still often lacking — though that's
not their fault since their format is much simpler, being documented
in just around 1000 pages. But still, as it doesn't communicate fixed
positions — and as far as I can tell doesn't imply those by telling
the software explicitly how to render font, so the result will always
look identical, and won't embed fonts — or the needed subset — by
default, it's also kinda not fulfilling the needs.
I triggered this by saying docx and pdf are publishing formats. In the
world of professional content management that is exactly so. You have
your content in a neutral format in a version controlled storage system,
and you have choice to publish in pdf or docx or html or epub or
whatever. What you don't do is use these output formats as your primary
content.
Examples relevant to debian include package documentation such as man
pages, markdown, doxygen, docbook, latex.
In fact I can't think of any project in debian that has pdf or docx as
the primary source of documentation
Tools that can do this transform include pandoc, Visual Studio Code,
ghostwriter, marktext and many many more.
And no, editing a PDF as docx isn't the easiest — not to mention best
— way to edit a PDF, especially not with some ominous web tool. Maybe
someone can write an AI for that, but even then it's most likely much
easier to just go the OCR route to derive content and extract layout
from the document. At least I don't know how strict PDF defines
things, I only always hear that PDF is at least as much of an unholy
mess as ooxml — which was supposed to be fixed by PDF 2.0, which still
pretty much no software creates by default, even though most software
seems to be supporting it — and writing tools like Ghostscript or
Poppler is a royal pain. LaTeX can probably only circumvent this
because they just have to create a PDF from a predefined set of
functions — and be able to embed other PDFs into these PDFs. But the
most reliable way to edit PDFs — as I have little to no experience
with most commercial solutions — is Inkscapte. If the internal
importer succeeds, you get creat text editing features, which
obviously can't rival office suites, but at least you don't completely
and almost guaranteed completely mess up the whole layout.
Richard
In my most recent experience, OCR of pdf documents is quite difficult if
the layout is significant such as in bank statements. There are various
tools to assist in extracting the content but it's quite marginal.
On the other hand, give a screenshot of a bank statement to the 'ai'
GPT4 and ask it to extract all transactions in csv format and it is done
perfectly.
On a sidenote, PDF is basically Postscript on steroids. Its entire
purpose is to describe how content is to be placed on a printed page. On
a side-side note Postscript is actually a programming language with
specialty in text layout but quite capable of doing significant
computation activities - so long as your output eventually gets rendered
on a page.
On 24.06.24 10:31, jeremy ardley wrote:
In my view, pdf and docx shoud be regarded as publication formats for
content managed in a professional content management system. HTML and
odt and postscript also fall in to the category of publication formats.
Word documents suffer because back in the dim ages of the late 1980s
Microsoft decided to merge content managing with content editing with
content publishing and abysmally failed at all of them.
However, the easiest way to edit a pdf is convert it to word using
say https://pdf2docx.com/ There are also plenty of ways in linux to
do that but they all take time and effort to make work.