if you need further assistance please let us know. BR Maruan Sahyoun
Am 07.03.2014 um 18:24 schrieb HQS <[email protected]>: > Thank you all for those accurate answers. > I will give a try to the geometrical approach based on the (x, y) coordinates > of the characters. > > Best regards, > > Julien > > Le 7 mars 2014 à 13:25, Peter Murray-Rust <[email protected]> a écrit : > >> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential < >> [email protected]> wrote: >> >>> Sirs, >>> >>> I had already thought about this graphical approach to reconstruct the >>> words. I've let it down because I'm a bit sceptical on the reliability of >>> such a method. I can't help thinking that it will not be a 100% sure >>> method. I understand why a CAD software would produce such an output, >>> though (thank you for this new word that I didn't know "boustrophedonic", >>> but it explains well the result obtained). >>> >> >> It's not as bad as you think. We have re-constructed the text from hundreds >> of scientific papers (so probably nearly a million words) and found very >> few problems. The reason we are doing this rather than using PDFBox tools >> is that scientific (and especially maths) PDFs contain may diacritics, high >> Unicode points, occasional graphics strokes, variable font size and style, >> ligatures, non-horizontal text, etc. >> >> For running text it works very well - assuming that the characters announce >> their widths. Then - roughly - "ab" is a word if >> >> x(a) + width(a)*fontSize(a) + tolerance >= x(b) >> >> else we can *crudely* estimate the number of intervening spaces (this is >> very suspect as publishers may elide concatenated spaces). >> >> All standard Fonts (see PDF spec) should announce their widths. >> Unfortunately scientific publishers use some of the worst constructed fonts >> in the world and sometimes we have to guess - by surveying a body of >> character positions and trying to work out spaces and font-type. >> >> >>> Supposing that the characters appear in a totally arbitrary order, >>> detecting that they're on the same line is more or less piece of cake >>> (except if I need to introduce a tolerance, which makes things more >>> difficult), >> >> >> In a modern PDF we find that all characters on the same line tend to have >> equal y-coords to at least 3 decimals. The problem is that OCR'ed >> characters may have variable y because of rounding errors and antialiasing. >> >> >> >>> but grouping the characters according to their X position is >>> not at all an easy task. >>> >> >> The order should be fairly clear. The problems are: >> * spaces (see above) >> * hyphens at line-end (this requires heuristics - maybe lookup in Wordnet) >> - we generally solve > 90%. Hyphens in chemistry are meaningful >> * diacritics. Some characters have diacritics with the same x (e.g. E and >> acute). These can occur in variable order. Where possible we try to >> recreate a single Unicode point. >> * over and underbars >> * ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We >> split the latter. >> >> >>> >>> But this is not an issue, my problem is more the fact that this method may >>> not be 100% reliable. What do you think ? >>> >> >> We are committed to solving it for English-language science and European >> personal names. The worst case is probably slanted text in diagrams. >> >> >>> >>> As for the technical part (overloading the processText), it's ok, thanks >>> for the advice. >>> >>> Best regards >>> >>> Julien >>> >>> >>> >>> -- >> Peter Murray-Rust >> Reader in Molecular Informatics >> Unilever Centre, Dep. Of Chemistry >> University of Cambridge >> CB2 1EW, UK >> +44-1223-763069 >

