[jira] [Commented] (PDFBOX-5857) PDFTextStripper returns messed up data

arjunce (Jira) Thu, 01 Aug 2024 11:12:38 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870253#comment-17870253
 ]


arjunce commented on PDFBOX-5857:
---------------------------------

Hey Tilman,

Thanks for the reply.

Right now I am doing this to abort the processing of documents like this and 
eventually OCR it to get the correct data.
{code:java}
Set<Float> baseline = 
textPositions.stream().map(TextPosition::getYDirAdj).collect(Collectors.toSet());
if(Collections.max(baseline) - Collections.min(baseline) > 10)
    throw new RuntimeException("Font size not consistent. Cannot be 
processed."); {code}
I don't want to go too much into the document and it would be great if you 
could suggest a better way to check this

> PDFTextStripper returns messed up data 
> ---------------------------------------
>
>                 Key: PDFBOX-5857
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5857
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.2 PDFBox
>            Reporter: arjunce
>            Priority: Minor
>         Attachments: extractedText.txt, jumbledtext.pdf, screenshot-1.png
>
>
> I have attached below the input pdf and its text output for you to take a 
> look at. I am using PDFTextStripper along with these:
> {code:java}
> super();
> this.setSortByPosition(true);
> this.setWordSeparator("_word_"); {code}
> Since I am using sort by position the text is jumbled. Is there a way for me 
> to detect this instead of outputting the jumbled text? Any help is 
> appreciated, Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5857) PDFTextStripper returns messed up data

Reply via email to