[jira] [Commented] (PDFBOX-2138) Corrupted words when using PDFTextStripper

Sebastian Schuberth (Jira) Wed, 07 Jun 2023 02:27:57 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730031#comment-17730031
 ]


Sebastian Schuberth commented on PDFBOX-2138:
---------------------------------------------

I have a hunch that multiple different possible reasons for garbled text are 
being conflated here. For cases as mentioned by the OP where extracted and 
expected texts look somewhat similar, with only single characters within words 
looking wrong (like "Buchung Valuta Vorgang Soll Haben" becomes "Buchung saäuta 
sorgang poää eaben"), I was able to solve this in iText by ignoring the 
"ToUnicode"-tables of non-embedded fonts, similar as described 
[here|http://stackoverflow.com/a/37786643/1127485]. I'm still looking for a way 
to do the same in PDFBox.

> Corrupted words when using PDFTextStripper
> ------------------------------------------
>
>                 Key: PDFBOX-2138
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2138
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5, 1.8.6, 2.0.0
>         Environment: Windows 7 / 64 bit
>            Reporter: Walter Kehl
>            Priority: Major
>             Fix For: 4.0.0
>
>         Attachments: PDFBOX-2138-noClip.pdf, PDFBOX-2138-noClip.png, 
> PDFBOX-2138.pdf, PDFBOX-2138.txt, banking-banana-skins-2014.pdf, 
> banking-banana-skins-2014.txt
>
>
> >> I am using PDFTextStripper (embedded into another application) to get 
> >> the raw text of PDFs so far with good results but recently a PDF file 
> >> has appeared where the output of the PDFTextStripper was corrupted. I 
> >> got sentences like:
> >>
> >>    
> >>
> >> "There is al o con ern that b nkers may be pushed to misprice risk 
> >> (No. 6) by the pres ures of c mpetition and an abunda ce of central b 
> >> nk-provided liquidity."
> > Additionally some portions of text appear 
> > twice in the output: first correctly and then corrupted. I have 
> > attached an output created with PDFBox's command line options.
> > If you compare lines 357- 365 with lines 421-429 you see that it is 
> > the same paragraph, first ok and then with characters missing. In the 
> > original source this paragraph is unique.
> > The same seems to happen for the other instances where text is corrupted.
> I also tried it directly on the command line with the same results: input and 
> output files are attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-2138) Corrupted words when using PDFTextStripper

Reply via email to