You could use x and y position and rotation information to determine whether two given characters - given their size - are relatively close to each other or not and are on the same line.
BT / ET is not at all guaranteed to give you strings as perceived by a human. Olaf Am 6 Mar 2014 um 21:06 schrieb HQS <[email protected]>: > Well, thanks sirs for your reactivity. > > The PDFs are generated by Autodesk Inventor (even the latest version produces > that kind of output). > > It is for one of my clients who wants an automatic transformation > of some specific strings in the PDF into a clickable link. > > My problem is very simple : with such a structure I have no way to know when > the string ends. > > As a matter of fact all the references to be transformed are prefixed > with an ‘I-‘ but there is no termination character, for instance : « > I-HOIST-042 ». > Given that in the PDF I, -, H, O, (etc.), 2 are separated characters I cannot > rebuild the original string. > > I was hoping that there is a block of text (BT … ET) but, as I mentioned, > each character is put in its own block... > > Regards, > > > Le 6 mars 2014 à 18:57, Maruan Sahyoun <[email protected]> a écrit : > >> Hi Julien, >> >> for 1) that’s possible and supported - how was the document generated? DTP >> application? >> for 2) PDFBox doesn’t enforce a PDF version. In general it supports all PDF >> files but it doesn’t have full coverage of all features defined within >> certain PDF versions but it should have a reasonable coverage. There is no >> documentation on coverage yet so I can’t guarantee that a specific feature >> is supported. Is there something special you are looking for? >> >> BR >> Maruan Sahyoun >> >> Am 06.03.2014 um 18:39 schrieb HQS <[email protected]>: >> >>> Hello all, >>> >>> 1. >>> Have you ever seen PDFs having this kind of (pseudo) structure : >>> >>> BT >>> <character> >>> Tj >>> ET >>> >>> ? >>> >>> Which means, the strings are split into characters and there is one block >>> of text per character ? >>> It seems to be ill-formed doesn't it ? >>> >>> 2. Reminder of my first mail, what is the library compliancy regarding PDF >>> standards ? 1.3 to 1.7 ? >>> >>> >>> Thanks and regards >>> >>> Julien >>> >> >

