Re: 2 questions

HQS Sat, 08 Mar 2014 03:13:24 -0800

Well, I have a precision to ask to Peter, about this formula :

x(a) + width(a)*fontSize(a) + tolerance >= x(b)


What is the difference between « width(a) » and « fontSize(a) » ? Is it not 
enough
to know the width of the character « a » in pixels given by the font, to check 
this assertion ?

Thanks !


Le 7 mars 2014 à 18:46, Maruan Sahyoun <[email protected]> a écrit :

> if you need further assistance please let us know.
> 
> BR
> Maruan Sahyoun
> 
> Am 07.03.2014 um 18:24 schrieb HQS <[email protected]>:
> 
>> Thank you all for those accurate answers.
>> I will give a try to the geometrical approach based on the (x, y) 
>> coordinates of the characters.
>> 
>> Best regards,
>> 
>> Julien
>> 
>> Le 7 mars 2014 à 13:25, Peter Murray-Rust <[email protected]> a écrit :
>> 
>>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
>>> [email protected]> wrote:
>>> 
>>>> Sirs,
>>>> 
>>>> I had already thought about this graphical approach to reconstruct the
>>>> words. I've let it down because I'm a bit sceptical on the reliability of
>>>> such a method. I can't help thinking that it will not be a 100% sure
>>>> method. I understand why a CAD software would produce such an output,
>>>> though (thank you for this new word that I didn't know "boustrophedonic",
>>>> but it explains well the result obtained).
>>>> 
>>> 
>>> It's not as bad as you think. We have re-constructed the text from hundreds
>>> of scientific papers (so probably nearly a million words) and found very
>>> few problems. The reason we are doing this rather than using PDFBox tools
>>> is that scientific (and especially maths) PDFs contain may diacritics, high
>>> Unicode points, occasional graphics strokes, variable font size and style,
>>> ligatures, non-horizontal text, etc.
>>> 
>>> For running text it works very well - assuming that the characters announce
>>> their widths. Then - roughly - "ab" is a word if
>>> 
>>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>>> 
>>> else we can *crudely* estimate the number of intervening spaces (this is
>>> very suspect as publishers may elide concatenated spaces).
>>> 
>>> All standard Fonts (see PDF spec) should announce their widths.
>>> Unfortunately scientific publishers use some of the worst constructed fonts
>>> in the world and sometimes we have to guess - by surveying a body of
>>> character positions and trying to work out spaces and font-type.
>>> 
>>> 
>>>> Supposing that the characters appear in a totally arbitrary order,
>>>> detecting that they're on the same line is more or less piece of cake
>>>> (except if I need to introduce a tolerance, which makes things more
>>>> difficult),
>>> 
>>> 
>>> In a modern PDF we find that all characters on the same line tend to have
>>> equal y-coords to at least 3 decimals. The problem is that OCR'ed
>>> characters may have variable y because of rounding errors and antialiasing.
>>> 
>>> 
>>> 
>>>> but grouping the characters according to their X position is
>>>> not at all an easy task.
>>>> 
>>> 
>>> The order should be fairly clear. The problems are:
>>> * spaces (see above)
>>> * hyphens at line-end (this requires heuristics - maybe lookup in Wordnet)
>>> - we generally solve > 90%. Hyphens in chemistry are meaningful
>>> * diacritics. Some characters have diacritics with the same x (e.g. E and
>>> acute). These can occur in variable order. Where possible we try to
>>> recreate a single Unicode point.
>>> * over and underbars
>>> * ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We
>>> split the latter.
>>> 
>>> 
>>>> 
>>>> But this is not an issue, my problem is more the fact that this method may
>>>> not be 100% reliable. What do you think ?
>>>> 
>>> 
>>> We are committed to solving it for English-language science and European
>>> personal names. The worst case is probably slanted text in diagrams.
>>> 
>>> 
>>>> 
>>>> As for the technical part (overloading the processText), it's ok, thanks
>>>> for the advice.
>>>> 
>>>> Best regards
>>>> 
>>>> Julien
>>>> 
>>>> 
>>>> 
>>>> --
>>> Peter Murray-Rust
>>> Reader in Molecular Informatics
>>> Unilever Centre, Dep. Of Chemistry
>>> University of Cambridge
>>> CB2 1EW, UK
>>> +44-1223-763069
>> 
>

Re: 2 questions

Reply via email to