Re: 2 questions

Peter Murray-Rust Sat, 08 Mar 2014 09:51:11 -0800

On Sat, Mar 8, 2014 at 5:23 PM, HQS <[email protected]> wrote:

> Peter,
>
> What you said about the factor 1000 I've seen it on a website dealing with
> PDFBox so you might be right.
>


thanks


> I have tried the following assertion which, if true, makes 2 characters
> connected to the same word :
>
> leftChar.getX() + leftChar.getWidth() + space * .5f + X_TOLERANCE >=
> rightChar.getX()
>
> I tried with X_TOLERANCE = 0
>
> space is simply equal to leftChar.getWidthOfSpace() , a method in the
> TextPosition class.
> getWidth() is also a method of that class.
>
> The first results are very satisfying.
>

I think you have to involve leftChar.getFontSize() .  When *I* extract
characters the width is not scaled. It's possible you are calling other
methods that scale it...

>
> By the way, is there an << easy >> way to delete text from a PDF, apart from
> parsing the tokens
> and delete those preceding the << Tj >> / << TJ >> operators ? I need this to
> erase the reference strings
> that I have detected and create an hyperlink at the same location with the
> same font.
>

I can't comment as I only interpret PDFs, not edit them.

BTW I do not use low level operators like Tj - I let PDFBox do the work of
interpreting.


> When I've tested the PDF words extractor I will post the source code so
> that we can share our technics.
> The extractor I'm making is a bit more advanced than the one embedded in
> PDFBox as it creates a list of
> couples (XY position of a word, contents of a word) and not just give the
> list of words.
>

I do this in two stages - translate all chars to SVG (PDF2SVG) and in a
separate project (SVG2XML) do the character concatenation - I have to deal
with subscripts, etc. Most PDF2Text tools don't deal with subscripts


> Thanks all !
>
> Julien
>
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: 2 questions

Reply via email to