Well, I have a precision to ask to Peter, about this formula : x(a) + width(a)*fontSize(a) + tolerance >= x(b)
What is the difference between « width(a) » and « fontSize(a) » ? Is it not enough to know the width of the character « a » in pixels given by the font, to check this assertion ? Thanks ! Le 7 mars 2014 à 18:46, Maruan Sahyoun <[email protected]> a écrit : > if you need further assistance please let us know. > > BR > Maruan Sahyoun > > Am 07.03.2014 um 18:24 schrieb HQS <[email protected]>: > >> Thank you all for those accurate answers. >> I will give a try to the geometrical approach based on the (x, y) >> coordinates of the characters. >> >> Best regards, >> >> Julien >> >> Le 7 mars 2014 à 13:25, Peter Murray-Rust <[email protected]> a écrit : >> >>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential < >>> [email protected]> wrote: >>> >>>> Sirs, >>>> >>>> I had already thought about this graphical approach to reconstruct the >>>> words. I've let it down because I'm a bit sceptical on the reliability of >>>> such a method. I can't help thinking that it will not be a 100% sure >>>> method. I understand why a CAD software would produce such an output, >>>> though (thank you for this new word that I didn't know "boustrophedonic", >>>> but it explains well the result obtained). >>>> >>> >>> It's not as bad as you think. We have re-constructed the text from hundreds >>> of scientific papers (so probably nearly a million words) and found very >>> few problems. The reason we are doing this rather than using PDFBox tools >>> is that scientific (and especially maths) PDFs contain may diacritics, high >>> Unicode points, occasional graphics strokes, variable font size and style, >>> ligatures, non-horizontal text, etc. >>> >>> For running text it works very well - assuming that the characters announce >>> their widths. Then - roughly - "ab" is a word if >>> >>> x(a) + width(a)*fontSize(a) + tolerance >= x(b) >>> >>> else we can *crudely* estimate the number of intervening spaces (this is >>> very suspect as publishers may elide concatenated spaces). >>> >>> All standard Fonts (see PDF spec) should announce their widths. >>> Unfortunately scientific publishers use some of the worst constructed fonts >>> in the world and sometimes we have to guess - by surveying a body of >>> character positions and trying to work out spaces and font-type. >>> >>> >>>> Supposing that the characters appear in a totally arbitrary order, >>>> detecting that they're on the same line is more or less piece of cake >>>> (except if I need to introduce a tolerance, which makes things more >>>> difficult), >>> >>> >>> In a modern PDF we find that all characters on the same line tend to have >>> equal y-coords to at least 3 decimals. The problem is that OCR'ed >>> characters may have variable y because of rounding errors and antialiasing. >>> >>> >>> >>>> but grouping the characters according to their X position is >>>> not at all an easy task. >>>> >>> >>> The order should be fairly clear. The problems are: >>> * spaces (see above) >>> * hyphens at line-end (this requires heuristics - maybe lookup in Wordnet) >>> - we generally solve > 90%. Hyphens in chemistry are meaningful >>> * diacritics. Some characters have diacritics with the same x (e.g. E and >>> acute). These can occur in variable order. Where possible we try to >>> recreate a single Unicode point. >>> * over and underbars >>> * ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We >>> split the latter. >>> >>> >>>> >>>> But this is not an issue, my problem is more the fact that this method may >>>> not be 100% reliable. What do you think ? >>>> >>> >>> We are committed to solving it for English-language science and European >>> personal names. The worst case is probably slanted text in diagrams. >>> >>> >>>> >>>> As for the technical part (overloading the processText), it's ok, thanks >>>> for the advice. >>>> >>>> Best regards >>>> >>>> Julien >>>> >>>> >>>> >>>> -- >>> Peter Murray-Rust >>> Reader in Molecular Informatics >>> Unilever Centre, Dep. Of Chemistry >>> University of Cambridge >>> CB2 1EW, UK >>> +44-1223-763069 >> >

