Re: Publishing Formats

jeremy ardley Mon, 24 Jun 2024 14:29:23 -0700


On 24/6/24 22:22, Richard wrote:

Since it's quite OT, starting a new thread for this.
I would most certainly never call formats like ooxml or odf“publishing formats”, they are content creation or editing formats.From a publishing format I expect to be able to show the content asintended — which actually neither of them can do 100 % can, theprobability of messing up just isn't that big. Either you want a fixedformat, e.g. for printing, what you get with the likes of PDF, PS, SVGor your various raster graphic formats. Or you want your content toadapt in a foreseeable way to the viewer, i.e. HTLM, usually with thehelp of CSS and worst case JS. Sure, ooxml and odf want to be theformer, but due to technical caveats that's not necessarily possible.With ooxml, you have several incompatible versions you can't justeasily tell apart, often making identical display impossible due tousing but not embedding proprietary fonts by default — and being anabomination of a format spanning around 5500 pages plus another 1000pages for their tranistional mode, that was only standardized byworld-wide corruption. ODF usually does things way better, but supportin software beyond LibreOffice is still often lacking — though that'snot their fault since their format is much simpler, being documentedin just around 1000 pages. But still, as it doesn't communicate fixedpositions — and as far as I can tell doesn't imply those by tellingthe software explicitly how to render font, so the result will alwayslook identical, and won't embed fonts — or the needed subset — bydefault, it's also kinda not fulfilling the needs.

I triggered this by saying docx and pdf are publishing formats. In theworld of professional content management that is exactly so. You haveyour content in a neutral format in a version controlled storage system,and you have choice to publish in pdf or docx or html or epub orwhatever. What you don't do is use these output formats as your primarycontent.

Examples relevant to debian include package documentation such as manpages, markdown, doxygen, docbook, latex.

In fact I can't think of any project in debian that has pdf or docx asthe primary source of documentation

Tools that can do this transform include pandoc, Visual Studio Code,ghostwriter, marktext and many many more.

And no, editing a PDF as docx isn't the easiest — not to mention best— way to edit a PDF, especially not with some ominous web tool. Maybesomeone can write an AI for that, but even then it's most likely mucheasier to just go the OCR route to derive content and extract layoutfrom the document. At least I don't know how strict PDF definesthings, I only always hear that PDF is at least as much of an unholymess as ooxml — which was supposed to be fixed by PDF 2.0, which stillpretty much no software creates by default, even though most softwareseems to be supporting it — and writing tools like Ghostscript orPoppler is a royal pain. LaTeX can probably only circumvent thisbecause they just have to create a PDF from a predefined set offunctions — and be able to embed other PDFs into these PDFs. But themost reliable way to edit PDFs — as I have little to no experiencewith most commercial solutions — is Inkscapte. If the internalimporter succeeds, you get creat text editing features, whichobviously can't rival office suites, but at least you don't completelyand almost guaranteed completely mess up the whole layout.
Richard

In my most recent experience, OCR of pdf documents is quite difficult ifthe layout is significant such as in bank statements. There are varioustools to assist in extracting the content but it's quite marginal.

On the other hand, give a screenshot of a bank statement to the 'ai'GPT4 and ask it to extract all transactions in csv format and it is doneperfectly.

On a sidenote, PDF is basically Postscript on steroids. Its entirepurpose is to describe how content is to be placed on a printed page. Ona side-side note Postscript is actually a programming language withspecialty in text layout but quite capable of doing significantcomputation activities - so long as your output eventually gets renderedon a page.

On 24.06.24 10:31, jeremy ardley wrote:
In my view, pdf and docx shoud be regarded as publication formats forcontent managed in a professional content management system. HTML andodt and postscript also fall in to the category of publication formats.
Word documents suffer because back in the dim ages of the late 1980sMicrosoft decided to merge content managing with content editing withcontent publishing and abysmally failed at all of them.
However, the easiest way to edit a pdf is convert it to word usingsay https://pdf2docx.com/ There are also plenty of ways in linux todo that but they all take time and effort to make work.

Re: Publishing Formats

Reply via email to