[ 
https://issues.apache.org/jira/browse/PDFBOX-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-234:
--------------------------------------

    Fix Version/s: 0.8.0-incubator

> spaces lost
> -----------
>
>                 Key: PDFBOX-234
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-234
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Priority: Minor
>             Fix For: 0.8.0-incubator
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1635950
> Originally submitted by tweakerbee on 2007-01-15 07:09.
> During extraction in certain PDF documents spaces will be lost. I have 
> attached a file in which this problem occurs.
> Here PDFTextStripper.getText() returns:
> gaandeofincidenteleaardis
> whereas it should be
> gaande of incidentele aard is
> I have used the nightly build from today (15-01-07) but the problem still 
> remains.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1635950&file_id=211376
> STB336.pdf (application/pdf), 51425 bytes
> document with erronous text extraction
> [comment on SourceForge]
> Originally sent by tweakerbee.
> Logged In: YES 
> user_id=1625706
> Originator: YES
> The problem turned out to be in the splitting algorithm. The values here 
> turned out slightly too conservative.
> Using 0.33f (33%) turned out to yield proper results. This might split words 
> that are not meant to be split, however.
> Maybe you could set this through a field in the TextStripper? So you can 
> adjust your application slightly easier to your specific needs.
> This issue can be considered solved.
> startOfNextWordX = endOfLastTextX + (wordSpacing* 0.33f);
> startOfNextWordX = endOfLastTextX + (((wordSpacing+lastWordSpacing)/2f)* 
> 0.33f);
> [comment on SourceForge]
> Originally sent by tweakerbee.
> Logged In: YES 
> user_id=1625706
> Originator: YES
> My previous assumption turned out to be incorrect.
> The context.showString() function is responsible for outputting the string. 
> If anywhere, it should probably output the space here.
> [comment on SourceForge]
> Originally sent by tweakerbee.
> Logged In: YES 
> user_id=1625706
> Originator: YES
> I am currently looking into the problem myself as well, but my complete lack 
> of experience with the Portable Document Format as well as being a novice 
> Java programmer are rather limiting.
> What I have found out so far is this:
> The problem is in the TextStream where a TJ operator is being used to show 
> the glyphs. There are no spaces encoded in the file, but instead it uses some 
> character spacing information to space out the words. An example is included 
> below.
> The code I believe is responsible for extracting the text here 
> (org.pdfbox.util.operator.ShowTextGlyph) does not contain any code to 
> determine whether or not a space is needed. Would it be useful to add this 
> here? And will this not breakdown the org.pdfbox.util.PDFHighlighter? (I have 
> noticed some difficulties with certain PDF documents and I wouldn't be 
> surprised if the difference in character count originates from this issue.)
> Any help would be greatly appreciated.
> Example code in STB336.pdf:
> [(7?????)-278(???)-278(? ?"&????)-278(???)-278(???)-278( 
> ??\))-278(???)-278(??????\)????\012)-278( ?????'??&)-278(??)]TJ

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to