[ 
https://issues.apache.org/jira/browse/PDFBOX-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lapo Luchini updated PDFBOX-5411:
---------------------------------
    Attachment: image-2022-04-15-09-26-20-917.png

> PDFTextStripper could use text size in reconstruction
> -----------------------------------------------------
>
>                 Key: PDFBOX-5411
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5411
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.25, 3.0.0 PDFBox
>            Reporter: Lapo Luchini
>            Priority: Minor
>         Attachments: image-2022-04-08-16-13-17-334.png, 
> image-2022-04-15-09-26-20-917.png, textDoubleText.pdf
>
>
> When two texts are partially overlapping {{PDFTextStripper}} seems to return 
> a mix simply based on "leftmost x coordinate of the glyph", which makes 
> sense, but it could make use of glyph size to disambiguate "easy" cases like 
> this one:
> !image-2022-04-08-16-13-17-334.png!
> currently this is the first parameter of PDFTextStripper.writeString(String 
> string, List<TextPosition> textPositions):
> {{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}}
> I would of course hope for two calls:
> {{"TEST LINE"}}
> {{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to