[ https://issues.apache.org/jira/browse/PDFBOX-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lapo Luchini updated PDFBOX-5411: --------------------------------- Attachment: image-2022-04-15-09-26-20-917.png > PDFTextStripper could use text size in reconstruction > ----------------------------------------------------- > > Key: PDFBOX-5411 > URL: https://issues.apache.org/jira/browse/PDFBOX-5411 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction > Affects Versions: 2.0.25, 3.0.0 PDFBox > Reporter: Lapo Luchini > Priority: Minor > Attachments: image-2022-04-08-16-13-17-334.png, > image-2022-04-15-09-26-20-917.png, textDoubleText.pdf > > > When two texts are partially overlapping {{PDFTextStripper}} seems to return > a mix simply based on "leftmost x coordinate of the glyph", which makes > sense, but it could make use of glyph size to disambiguate "easy" cases like > this one: > !image-2022-04-08-16-13-17-334.png! > currently this is the first parameter of PDFTextStripper.writeString(String > string, List<TextPosition> textPositions): > {{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}} > I would of course hope for two calls: > {{"TEST LINE"}} > {{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org