[
https://issues.apache.org/jira/browse/PDFBOX-6007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951282#comment-17951282
]
Greta commented on PDFBOX-6007:
-------------------------------
Thank you for your answer.
After analyzing your suggestion, I would like to suggest a new approach.
I propose creating a new method, which would handle cases where a diacritic is
incorrectly mapped as a space.
{code:java}
private boolean isMisidentifiedDiacritic(TextPosition candidate, TextPosition
previous)
{
return " ".equals(candidate.getUnicode())
&& candidate.getWidth() < candidate.getFontSize() * 0.1
&& previous.contains(candidate);
}{code}
This method would be called in the _processTextPosition_ method, by adding
additional _else if_ statement.
{code:java}
else if (isMisidentifiedDiacritic(text, previousTextPosition)) {
previousTextPosition.mergeDiacritic(text);
}{code}
> Incorrect Word Splitting During Text Extraction When Special Characters Are
> Rendered Using Fallback Fonts
> ---------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-6007
> URL: https://issues.apache.org/jira/browse/PDFBOX-6007
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 3.0.5 PDFBox
> Reporter: Greta
> Priority: Trivial
> Labels: newbie
> Fix For: 3.0.6 PDFBox
>
> Attachments: lithuanian_words.pdf
>
>
> When extracting text from PDFs where words contain special language
> characters (for example, ą, č, ę, ė, į, š, ų, ū, ž) not supported by the
> originally used font, these characters are rendered using a fallback/default
> font. This often results in slight visual gaps after the special character
> due to differing font metrics.
> During text extraction, PDFBox interprets these visual gaps as word
> boundaries, causing words to be incorrectly split. This behavior negatively
> affects natural language processing, search indexing, and text analysis on
> extracted content.
> *An example:*
> Words in PDF: žiema, šaltis, ąžuolas, važiavimas, žąsis
> Extracted text: ž iema, šaltis, ąž uolas, važ iavimas, ž ąsis
> I have uploaded a test PDF file that contains more Lithuanian words written
> with different fonts that do not support Lithuanian language special
> characters.
>
> To resolve the issue of unintended spaces being inserted during text
> extraction, I propose enhancing the current logic in {{PDFTextStripper.java}}
> that handles space glyphs.
> Current implementation:
> {code:java}
> // PDFBOX-3774: conditionally ignore spaces from the content stream
> if (" ".equals(characterValue) && getIgnoreContentStreamSpaceGlyphs()) {
> continue;
> }{code}
> This logic only skips space characters if the
> {{ignoreContentStreamSpaceGlyphs}} flag is enabled, without considering the
> actual visual spacing.
>
> Proposed improvement:
>
> {code:java}
> // PDFBOX-3774: conditionally ignore spaces from the content stream
> if (" ".equals(characterValue)) {
> if (getIgnoreContentStreamSpaceGlyphs()) {
> continue;
> }
> float actualSpaceWidth = position.getWidth();
> float expectedSpaceWidth = position.getWidthOfSpace();
> float threshold = expectedSpaceWidth * 0.5f;
> if (actualSpaceWidth < threshold) {
> continue;
> }
> }
> {code}
>
> The proposed fix skips space characters that are visually too narrow to be
> real word separators, preventing incorrect word splits caused by font
> fallback or character spacing differences.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]