On Sat, Mar 8, 2014 at 5:23 PM, HQS <[email protected]> wrote: > Peter, > > What you said about the factor 1000 I've seen it on a website dealing with > PDFBox so you might be right. >
thanks > I have tried the following assertion which, if true, makes 2 characters > connected to the same word : > > leftChar.getX() + leftChar.getWidth() + space * .5f + X_TOLERANCE >= > rightChar.getX() > > I tried with X_TOLERANCE = 0 > > space is simply equal to leftChar.getWidthOfSpace() , a method in the > TextPosition class. > getWidth() is also a method of that class. > > The first results are very satisfying. > I think you have to involve leftChar.getFontSize() . When *I* extract characters the width is not scaled. It's possible you are calling other methods that scale it... > > By the way, is there an << easy >> way to delete text from a PDF, apart from > parsing the tokens > and delete those preceding the << Tj >> / << TJ >> operators ? I need this to > erase the reference strings > that I have detected and create an hyperlink at the same location with the > same font. > I can't comment as I only interpret PDFs, not edit them. BTW I do not use low level operators like Tj - I let PDFBox do the work of interpreting. > When I've tested the PDF words extractor I will post the source code so > that we can share our technics. > The extractor I'm making is a bit more advanced than the one embedded in > PDFBox as it creates a list of > couples (XY position of a word, contents of a word) and not just give the > list of words. > I do this in two stages - translate all chars to SVG (PDF2SVG) and in a separate project (SVG2XML) do the character concatenation - I have to deal with subscripts, etc. Most PDF2Text tools don't deal with subscripts > Thanks all ! > > Julien > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

