PDFdev is a service provided by PDFzone.com | http://www.pdfzone.com _____________________________________________________________
> Is there an easy way to determine if two consecutive PDEText text > runs constitute two words separated by a single space? > What I need to do is extract lines of text instead of words. There is a key point you may be missing here, and I recommend becoming familiar with the PDF Reference if you want to extract text. There are no words. There are no lines. There are only characters of text, which might be grouped into runs. This is why the API stops at runs. Everything else has to be guesswork. That's not to say that Acrobat stops short of guesswork. The PDWordFinder API uses guesswork to guess where the words are, with reasonable success. So, you have to devise your own "fuzzy logic" to guess if it is reasonable to interpret two runs as being two words separated by (conceptually) a single space. You might start by checking that the baselines are equal and find which run is on the right (dealing with potential overlap). Then find the distance between the right hand side of one run, and the left hand side of the other. Now you have the "spacing", the only solid information you can have. Is that a "single space"? You decide. You probably want to base your decision at least on the current font size, and maybe the font itself. You might also do statistical analysis on all of the text with the same baseline (a simplistic definition of "line") and figure out the average space width. Look at printed material and you will see that the inter-character spacing on some exceeds the inter-word spacing on others; newspaper columns especially. IF your text is single source and controlled, you can make more assumptions and the guesswork may be easier. Aandi To change your subscription: http://www.pdfzone.com/discussions/lists-pdfdev.html
