[
https://issues.apache.org/jira/browse/PDFBOX-6007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-6007:
---------------------------------------
Fix Version/s: (was: 3.0.6 PDFBox)
> Incorrect Word Splitting During Text Extraction When Special Characters Are
> Rendered Using Fallback Fonts
> ---------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-6007
> URL: https://issues.apache.org/jira/browse/PDFBOX-6007
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 3.0.5 PDFBox
> Reporter: Greta
> Priority: Trivial
> Labels: newbie
> Attachments: lithuanian_words.pdf
>
>
> When extracting text from PDFs where words contain special language
> characters (for example, ą, č, ę, ė, į, š, ų, ū, ž) not supported by the
> originally used font, these characters are rendered using a fallback/default
> font. This often results in slight visual gaps after the special character
> due to differing font metrics.
> During text extraction, PDFBox interprets these visual gaps as word
> boundaries, causing words to be incorrectly split. This behavior negatively
> affects natural language processing, search indexing, and text analysis on
> extracted content.
> *An example:*
> Words in PDF: žiema, šaltis, ąžuolas, važiavimas, žąsis
> Extracted text: ž iema, šaltis, ąž uolas, važ iavimas, ž ąsis
> I have uploaded a test PDF file that contains more Lithuanian words written
> with different fonts that do not support Lithuanian language special
> characters.
>
> To resolve the issue of unintended spaces being inserted during text
> extraction, I propose enhancing the current logic in {{PDFTextStripper.java}}
> that handles space glyphs.
> Current implementation:
> {code:java}
> // PDFBOX-3774: conditionally ignore spaces from the content stream
> if (" ".equals(characterValue) && getIgnoreContentStreamSpaceGlyphs()) {
> continue;
> }{code}
> This logic only skips space characters if the
> {{ignoreContentStreamSpaceGlyphs}} flag is enabled, without considering the
> actual visual spacing.
>
> Proposed improvement:
>
> {code:java}
> // PDFBOX-3774: conditionally ignore spaces from the content stream
> if (" ".equals(characterValue)) {
> if (getIgnoreContentStreamSpaceGlyphs()) {
> continue;
> }
> float actualSpaceWidth = position.getWidth();
> float expectedSpaceWidth = position.getWidthOfSpace();
> float threshold = expectedSpaceWidth * 0.5f;
> if (actualSpaceWidth < threshold) {
> continue;
> }
> }
> {code}
>
> The proposed fix skips space characters that are visually too narrow to be
> real word separators, preventing incorrect word splits caused by font
> fallback or character spacing differences.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]