Since it's quite OT, starting a new thread for this.

I would most certainly never call formats like ooxml or odf “publishing formats”, they are content creation or editing formats. From a publishing format I expect to be able to show the content as intended — which actually neither of them can do 100 % can, the probability of messing up just isn't that big. Either you want a fixed format, e.g. for printing, what you get with the likes of PDF, PS, SVG or your various raster graphic formats. Or you want your content to adapt in a foreseeable way to the viewer, i.e. HTLM, usually with the help of CSS and worst case JS. Sure, ooxml and odf want to be the former, but due to technical caveats that's not necessarily possible. With ooxml, you have several incompatible versions you can't just easily tell apart, often making identical display impossible due to using but not embedding proprietary fonts by default — and being an abomination of a format spanning around 5500 pages plus another 1000 pages for their tranistional mode, that was only standardized by world-wide corruption. ODF usually does things way better, but support in software beyond LibreOffice is still often lacking — though that's not their fault since their format is much simpler, being documented in just around 1000 pages. But still, as it doesn't communicate fixed positions — and as far as I can tell doesn't imply those by telling the software explicitly how to render font, so the result will always look identical, and won't embed fonts — or the needed subset — by default, it's also kinda not fulfilling the needs.

And no, editing a PDF as docx isn't the easiest — not to mention best — way to edit a PDF, especially not with some ominous web tool. Maybe someone can write an AI for that, but even then it's most likely much easier to just go the OCR route to derive content and extract layout from the document. At least I don't know how strict PDF defines things, I only always hear that PDF is at least as much of an unholy mess as ooxml — which was supposed to be fixed by PDF 2.0, which still pretty much no software creates by default, even though most software seems to be supporting it — and writing tools like Ghostscript or Poppler is a royal pain. LaTeX can probably only circumvent this because they just have to create a PDF from a predefined set of functions — and be able to embed other PDFs into these PDFs. But the most reliable way to edit PDFs — as I have little to no experience with most commercial solutions — is Inkscapte. If the internal importer succeeds, you get creat text editing features, which obviously can't rival office suites, but at least you don't completely and almost guaranteed completely mess up the whole layout.

Richard


On 24.06.24 10:31, jeremy ardley wrote:
In my view, pdf and docx shoud be regarded as publication formats for content 
managed in a professional content management system. HTML and odt and 
postscript also fall in to the category of publication formats.

Word documents suffer because back in the dim ages of the late 1980s Microsoft 
decided to merge content managing with content editing with content publishing 
and abysmally failed at all of them.

However, the easiest way to edit a pdf is convert it to word using say 
https://pdf2docx.com/ There are also plenty of ways in linux to do that but 
they all take time and effort to make work.

Reply via email to