The width appears to be a ratio, independent of size. It also seems to be conventionally multiplied by 1000 (I have not found a definition for this - I have only guessed it).
Thus a character "A" of width=600 and fontSize=10.5 appears to have pixelwidth = 600. / 1000. * 10.5 = 6.3 pixels I'd be grateful for confirmation or correction... On Sat, Mar 8, 2014 at 11:12 AM, HQS <[email protected]> wrote: > Well, I have a precision to ask to Peter, about this formula : > > x(a) + width(a)*fontSize(a) + tolerance >= x(b) > > What is the difference between « width(a) » and « fontSize(a) » ? Is it > not enough > to know the width of the character « a » in pixels given by the font, to > check this assertion ? > > Thanks ! > > > Le 7 mars 2014 à 18:46, Maruan Sahyoun <[email protected]> a écrit : > > > if you need further assistance please let us know. > > > > BR > > Maruan Sahyoun > > > > Am 07.03.2014 um 18:24 schrieb HQS <[email protected]>: > > > >> Thank you all for those accurate answers. > >> I will give a try to the geometrical approach based on the (x, y) > coordinates of the characters. > >> > >> Best regards, > >> > >> Julien > >> > >> Le 7 mars 2014 à 13:25, Peter Murray-Rust <[email protected]> a écrit : > >> > >>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential < > >>> [email protected]> wrote: > >>> > >>>> Sirs, > >>>> > >>>> I had already thought about this graphical approach to reconstruct the > >>>> words. I've let it down because I'm a bit sceptical on the > reliability of > >>>> such a method. I can't help thinking that it will not be a 100% sure > >>>> method. I understand why a CAD software would produce such an output, > >>>> though (thank you for this new word that I didn't know > "boustrophedonic", > >>>> but it explains well the result obtained). > >>>> > >>> > >>> It's not as bad as you think. We have re-constructed the text from > hundreds > >>> of scientific papers (so probably nearly a million words) and found > very > >>> few problems. The reason we are doing this rather than using PDFBox > tools > >>> is that scientific (and especially maths) PDFs contain may diacritics, > high > >>> Unicode points, occasional graphics strokes, variable font size and > style, > >>> ligatures, non-horizontal text, etc. > >>> > >>> For running text it works very well - assuming that the characters > announce > >>> their widths. Then - roughly - "ab" is a word if > >>> > >>> x(a) + width(a)*fontSize(a) + tolerance >= x(b) > >>> > >>> else we can *crudely* estimate the number of intervening spaces (this > is > >>> very suspect as publishers may elide concatenated spaces). > >>> > >>> All standard Fonts (see PDF spec) should announce their widths. > >>> Unfortunately scientific publishers use some of the worst constructed > fonts > >>> in the world and sometimes we have to guess - by surveying a body of > >>> character positions and trying to work out spaces and font-type. > >>> > >>> > >>>> Supposing that the characters appear in a totally arbitrary order, > >>>> detecting that they're on the same line is more or less piece of cake > >>>> (except if I need to introduce a tolerance, which makes things more > >>>> difficult), > >>> > >>> > >>> In a modern PDF we find that all characters on the same line tend to > have > >>> equal y-coords to at least 3 decimals. The problem is that OCR'ed > >>> characters may have variable y because of rounding errors and > antialiasing. > >>> > >>> > >>> > >>>> but grouping the characters according to their X position is > >>>> not at all an easy task. > >>>> > >>> > >>> The order should be fairly clear. The problems are: > >>> * spaces (see above) > >>> * hyphens at line-end (this requires heuristics - maybe lookup in > Wordnet) > >>> - we generally solve > 90%. Hyphens in chemistry are meaningful > >>> * diacritics. Some characters have diacritics with the same x (e.g. E > and > >>> acute). These can occur in variable order. Where possible we try to > >>> recreate a single Unicode point. > >>> * over and underbars > >>> * ligatures (in "waffle") their may be 6 characters or only 4 > w-a-ffl-e. We > >>> split the latter. > >>> > >>> > >>>> > >>>> But this is not an issue, my problem is more the fact that this > method may > >>>> not be 100% reliable. What do you think ? > >>>> > >>> > >>> We are committed to solving it for English-language science and > European > >>> personal names. The worst case is probably slanted text in diagrams. > >>> > >>> > >>>> > >>>> As for the technical part (overloading the processText), it's ok, > thanks > >>>> for the advice. > >>>> > >>>> Best regards > >>>> > >>>> Julien > >>>> > >>>> > >>>> > >>>> -- > >>> Peter Murray-Rust > >>> Reader in Molecular Informatics > >>> Unilever Centre, Dep. Of Chemistry > >>> University of Cambridge > >>> CB2 1EW, UK > >>> +44-1223-763069 > >> > > > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

