[
https://issues.apache.org/jira/browse/PDFBOX-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520126#comment-17520126
]
Michael Klink commented on PDFBOX-5411:
---------------------------------------
{quote}it could make use of glyph size to disambiguate "easy" cases like this
one{quote}
In the example disambiguation by the glyph size would result in a better
output. But there are other cases in which it would result in a worse result,
e.g. in a poor man's caps/small caps emulation.
Of course, your example also offers slightly different base lines, overlapping
actual glyph drawings, and different colors as hints. Each hint by itself would
not suffice, all together probably would.
> PDFTextStripper could use text size in reconstruction
> -----------------------------------------------------
>
> Key: PDFBOX-5411
> URL: https://issues.apache.org/jira/browse/PDFBOX-5411
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 2.0.25, 3.0.0 PDFBox
> Reporter: Lapo Luchini
> Priority: Minor
> Attachments: image-2022-04-08-16-13-17-334.png, textDoubleText.pdf
>
>
> When two texts are partially overlapping {{PDFTextStripper}} seems to return
> a mix simply based on "leftmost x coordinate of the glyph", which makes
> sense, but it could make use of glyph size to disambiguate "easy" cases like
> this one:
> !image-2022-04-08-16-13-17-334.png!
> currently this is the first parameter of PDFTextStripper.writeString(String
> string, List<TextPosition> textPositions):
> {{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}}
> I would of course hope for two calls:
> {{"TEST LINE"}}
> {{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]