On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential < [email protected]> wrote:
> Sirs, > > I had already thought about this graphical approach to reconstruct the > words. I've let it down because I'm a bit sceptical on the reliability of > such a method. I can't help thinking that it will not be a 100% sure > method. I understand why a CAD software would produce such an output, > though (thank you for this new word that I didn't know "boustrophedonic", > but it explains well the result obtained). > It's not as bad as you think. We have re-constructed the text from hundreds of scientific papers (so probably nearly a million words) and found very few problems. The reason we are doing this rather than using PDFBox tools is that scientific (and especially maths) PDFs contain may diacritics, high Unicode points, occasional graphics strokes, variable font size and style, ligatures, non-horizontal text, etc. For running text it works very well - assuming that the characters announce their widths. Then - roughly - "ab" is a word if x(a) + width(a)*fontSize(a) + tolerance >= x(b) else we can *crudely* estimate the number of intervening spaces (this is very suspect as publishers may elide concatenated spaces). All standard Fonts (see PDF spec) should announce their widths. Unfortunately scientific publishers use some of the worst constructed fonts in the world and sometimes we have to guess - by surveying a body of character positions and trying to work out spaces and font-type. > Supposing that the characters appear in a totally arbitrary order, > detecting that they're on the same line is more or less piece of cake > (except if I need to introduce a tolerance, which makes things more > difficult), In a modern PDF we find that all characters on the same line tend to have equal y-coords to at least 3 decimals. The problem is that OCR'ed characters may have variable y because of rounding errors and antialiasing. > but grouping the characters according to their X position is > not at all an easy task. > The order should be fairly clear. The problems are: * spaces (see above) * hyphens at line-end (this requires heuristics - maybe lookup in Wordnet) - we generally solve > 90%. Hyphens in chemistry are meaningful * diacritics. Some characters have diacritics with the same x (e.g. E and acute). These can occur in variable order. Where possible we try to recreate a single Unicode point. * over and underbars * ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We split the latter. > > But this is not an issue, my problem is more the fact that this method may > not be 100% reliable. What do you think ? > We are committed to solving it for English-language science and European personal names. The worst case is probably slanted text in diagrams. > > As for the technical part (overloading the processText), it's ok, thanks > for the advice. > > Best regards > > Julien > > > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

