[ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976922#action_12976922
 ] 

Mel Martinez commented on PDFBOX-588:
-------------------------------------

there are two separate problems discussed here.

One is line separation detection.  That seems to be reasonably well addressed 
in PDFBox-521.  

The other problem that Hesham refers to  is the dropping of space characters.  
This is a different bit of logic in PDFBox from line/paragraph/article/page 
demarcation.   There is code that looks at the positional gap between two 
characters and decides whether to insert a space character or not.   The 
accuracy of this looks to be somewhat dependent on the fonts used.

I actually am seeing the inverse trend.  On some of our test documents we used 
to see this problem with PDFBox v 1.0 but it has been fixed in in the later 
code - at least for these particular documents.

It is possible that the problem has just been shifted so that different fonts 
result in the problem.

You can try adjusting the 'PDFTextStripper.setSpacingTolerance(float)' method 
to change the behavior.  The default value is 0.5f.  Try making this larger or 
smaller and see how it behaves for you.   A smaller value should increase the 
insertion of spaces.



> Problem extracting text in newline characters
> ---------------------------------------------
>
>                 Key: PDFBOX-588
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-588
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: Win XP
>            Reporter: Hesham
>         Attachments: Enters-sample.pdf, PDFTextStripper.patch
>
>
> Hello ,
>  
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line 
> and the first word in the next line appear as 1 word without spaces between 
> them !!
> While if I copy the PDF text manually from the PDF and paste it in a text 
> editor, Enter characters appear after the same lines that caused the problem 
> in PDFBox.
> Please check the attached file as a sample.
>  
> Is there a way to fix this ?
>  
> Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to