On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
[email protected]> wrote:

> Sirs,
>
> I had already thought about this graphical approach to reconstruct the
> words. I've let it down because I'm a bit sceptical on the reliability of
> such a method. I can't help thinking that it will not be a 100% sure
> method. I understand why a CAD software would produce such an output,
> though (thank you for this new word that I didn't know "boustrophedonic",
> but it explains well the result obtained).
>

It's not as bad as you think. We have re-constructed the text from hundreds
of scientific papers (so probably nearly a million words) and found very
few problems. The reason we are doing this rather than using PDFBox tools
is that scientific (and especially maths) PDFs contain may diacritics, high
Unicode points, occasional graphics strokes, variable font size and style,
ligatures, non-horizontal text, etc.

For running text it works very well - assuming that the characters announce
their widths. Then - roughly - "ab" is a word if

x(a) + width(a)*fontSize(a) + tolerance >= x(b)

else we can *crudely* estimate the number of intervening spaces (this is
very suspect as publishers may elide concatenated spaces).

All standard Fonts (see PDF spec) should announce their widths.
Unfortunately scientific publishers use some of the worst constructed fonts
in the world and sometimes we have to guess - by surveying a body of
character positions and trying to work out spaces and font-type.


> Supposing that the characters appear in a totally arbitrary order,
> detecting that they're on the same line is more or less piece of cake
> (except if I need to introduce a tolerance, which makes things more
> difficult),


In a modern PDF we find that all characters on the same line tend to have
equal y-coords to at least 3 decimals. The problem is that OCR'ed
characters may have variable y because of rounding errors and antialiasing.



> but grouping the characters according to their X position is
> not at all an easy task.
>

The order should be fairly clear. The problems are:
* spaces (see above)
* hyphens at line-end (this requires heuristics - maybe lookup in Wordnet)
- we generally solve > 90%. Hyphens in chemistry are meaningful
* diacritics. Some characters have diacritics with the same x (e.g. E and
acute). These can occur in variable order. Where possible we try to
recreate a single Unicode point.
* over and underbars
* ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We
split the latter.


>
> But this is not an issue, my problem is more the fact that this method may
> not be 100% reliable. What do you think ?
>

We are committed to solving it for English-language science and European
personal names. The worst case is probably slanted text in diagrams.


>
> As for the technical part (overloading the processText), it's ok, thanks
> for the advice.
>
> Best regards
>
> Julien
>
>
>
> --
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to