RE: [PDFdev] Text runs

Aandi Inston Thu, 19 Feb 2004 01:38:36 -0800

PDFdev is a service provided by PDFzone.com | http://www.pdfzone.com
_____________________________________________________________


> Is there an easy way to determine if two consecutive PDEText text 
> runs constitute two words separated by a single space?
> What I need to do is extract lines of text instead of words. 

There is a key point you may be missing here, and I recommend becoming
familiar with the PDF Reference if you want to extract text.

There are no words. There are no lines. There are only characters
of text, which might be grouped into runs. This is why the API
stops at runs. Everything else has to be guesswork.

That's not to say that Acrobat stops short of guesswork. The 
PDWordFinder API uses guesswork to guess where the words are,
with reasonable success.

So, you have to devise your own "fuzzy logic" to guess if it is
reasonable to interpret two runs as being two words separated
by (conceptually) a single space.  You might start by checking
that the baselines are equal and find which run is on the right
(dealing with potential overlap). Then find the distance between
the right hand side of one run, and the left hand side of the 
other. Now you have the "spacing", the only solid information
you can have. Is that a "single space"?  You decide. You probably
want to base your decision at least on the current font size,
and maybe the font itself. You might also do statistical analysis
on all of the text with the same baseline (a simplistic definition
of "line") and figure out the average space width. Look at printed
material and you will see that the inter-character spacing on
some exceeds the inter-word spacing on others; newspaper columns
especially.

IF your text is single source and controlled, you can make
more assumptions and the guesswork may be easier.

Aandi


To change your subscription:
http://www.pdfzone.com/discussions/lists-pdfdev.html

RE: [PDFdev] Text runs

Reply via email to