On 24/6/24 22:22, Richard wrote:
Since it's quite OT, starting a new thread for this.

I would most certainly never call formats like ooxml or odf “publishing formats”, they are content creation or editing formats. From a publishing format I expect to be able to show the content as intended — which actually neither of them can do 100 % can, the probability of messing up just isn't that big. Either you want a fixed format, e.g. for printing, what you get with the likes of PDF, PS, SVG or your various raster graphic formats. Or you want your content to adapt in a foreseeable way to the viewer, i.e. HTLM, usually with the help of CSS and worst case JS. Sure, ooxml and odf want to be the former, but due to technical caveats that's not necessarily possible. With ooxml, you have several incompatible versions you can't just easily tell apart, often making identical display impossible due to using but not embedding proprietary fonts by default — and being an abomination of a format spanning around 5500 pages plus another 1000 pages for their tranistional mode, that was only standardized by world-wide corruption. ODF usually does things way better, but support in software beyond LibreOffice is still often lacking — though that's not their fault since their format is much simpler, being documented in just around 1000 pages. But still, as it doesn't communicate fixed positions — and as far as I can tell doesn't imply those by telling the software explicitly how to render font, so the result will always look identical, and won't embed fonts — or the needed subset — by default, it's also kinda not fulfilling the needs.

I triggered this by saying docx and pdf are publishing formats. In the world of professional content management that is exactly so. You have your content in a neutral format in a version controlled storage system, and you have choice to publish in pdf or docx or html or epub or whatever. What you don't do is use these output formats as your primary content.

Examples relevant to debian include package documentation such as man pages, markdown, doxygen, docbook, latex.

In fact I can't think of any project in debian that has pdf or docx as the primary source of documentation

Tools that can do this transform include pandoc, Visual Studio Code, ghostwriter, marktext and many many more.


And no, editing a PDF as docx isn't the easiest — not to mention best — way to edit a PDF, especially not with some ominous web tool. Maybe someone can write an AI for that, but even then it's most likely much easier to just go the OCR route to derive content and extract layout from the document. At least I don't know how strict PDF defines things, I only always hear that PDF is at least as much of an unholy mess as ooxml — which was supposed to be fixed by PDF 2.0, which still pretty much no software creates by default, even though most software seems to be supporting it — and writing tools like Ghostscript or Poppler is a royal pain. LaTeX can probably only circumvent this because they just have to create a PDF from a predefined set of functions — and be able to embed other PDFs into these PDFs. But the most reliable way to edit PDFs — as I have little to no experience with most commercial solutions — is Inkscapte. If the internal importer succeeds, you get creat text editing features, which obviously can't rival office suites, but at least you don't completely and almost guaranteed completely mess up the whole layout.

Richard


In my most recent experience, OCR of pdf documents is quite difficult if the layout is significant such as in bank statements. There are various tools to assist in extracting the content but it's quite marginal.

On the other hand, give a screenshot of a bank statement to the 'ai' GPT4 and ask it to extract all transactions in csv format and it is done perfectly.

On a sidenote, PDF is basically Postscript on steroids. Its entire purpose is to describe how content is to be placed on a printed page. On a side-side note Postscript is actually a programming language with specialty in text layout but quite capable of doing significant computation activities - so long as your output eventually gets rendered on a page.




On 24.06.24 10:31, jeremy ardley wrote:
In my view, pdf and docx shoud be regarded as publication formats for content managed in a professional content management system. HTML and odt and postscript also fall in to the category of publication formats.

Word documents suffer because back in the dim ages of the late 1980s Microsoft decided to merge content managing with content editing with content publishing and abysmally failed at all of them.

However, the easiest way to edit a pdf is convert it to word using say https://pdf2docx.com/ There are also plenty of ways in linux to do that but they all take time and effort to make work.


Reply via email to