[jira] [Updated] (PDFBOX-6007) Incorrect Word Splitting During Text Extraction When Special Characters Are Rendered Using Fallback Fonts

Jira Fri, 17 Oct 2025 23:57:16 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-6007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Lehmkühler updated PDFBOX-6007:
---------------------------------------
    Fix Version/s:     (was: 3.0.6 PDFBox)

> Incorrect Word Splitting During Text Extraction When Special Characters Are 
> Rendered Using Fallback Fonts
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6007
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6007
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.5 PDFBox
>            Reporter: Greta
>            Priority: Trivial
>              Labels: newbie
>         Attachments: lithuanian_words.pdf
>
>
> When extracting text from PDFs where words contain special language 
> characters (for example, ą, č, ę, ė, į, š, ų, ū, ž) not supported by the 
> originally used font, these characters are rendered using a fallback/default 
> font. This often results in slight visual gaps after the special character 
> due to differing font metrics.
> During text extraction, PDFBox interprets these visual gaps as word 
> boundaries, causing words to be incorrectly split. This behavior negatively 
> affects natural language processing, search indexing, and text analysis on 
> extracted content.
> *An example:*
> Words in PDF: žiema, šaltis, ąžuolas, važiavimas, žąsis
> Extracted text: ž iema, šaltis, ąž uolas, važ iavimas, ž ąsis
> I have uploaded a test PDF file that contains more Lithuanian words written 
> with different fonts that do not support Lithuanian language special 
> characters.
>  
> To resolve the issue of unintended spaces being inserted during text 
> extraction, I propose enhancing the current logic in {{PDFTextStripper.java}} 
> that handles space glyphs.
> Current implementation:
> {code:java}
> // PDFBOX-3774: conditionally ignore spaces from the content stream
> if (" ".equals(characterValue) && getIgnoreContentStreamSpaceGlyphs()) {
>     continue;
> }{code}
> This logic only skips space characters if the 
> {{ignoreContentStreamSpaceGlyphs}} flag is enabled, without considering the 
> actual visual spacing.
>  
> Proposed improvement:
>  
> {code:java}
> // PDFBOX-3774: conditionally ignore spaces from the content stream
> if (" ".equals(characterValue)) {
>     if (getIgnoreContentStreamSpaceGlyphs()) {
>         continue;
>     }
>     float actualSpaceWidth = position.getWidth();
>     float expectedSpaceWidth = position.getWidthOfSpace();
>     float threshold = expectedSpaceWidth * 0.5f;
>     if (actualSpaceWidth < threshold) {
>         continue;
>     }
> }
> {code}
>  
> The proposed fix skips space characters that are visually too narrow to be 
> real word separators, preventing incorrect word splits caused by font 
> fallback or character spacing differences.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-6007) Incorrect Word Splitting During Text Extraction When Special Characters Are Rendered Using Fallback Fonts

Reply via email to