[jira] [Created] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

Manish S N (Jira) Wed, 14 Aug 2024 00:34:05 -0700

Manish S N created PDFBOX-5868:
----------------------------------

             Summary: PDFBox not extracting text of non-latin languages(tamil, 
bengali) properly but adobe reader's save as text does
                 Key: PDFBOX-5868
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5868
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 3.0.3 PDFBox
         Environment: Ubuntu 22.04.4 LTS x86_64
            Reporter: Manish S N
         Attachments: adobe_out.txt, multilingual_test.pdf, pdfbox_out.txt


I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
the export:text command line tool to obtain the results
 * the multilingual_test.pdf is the original pdf i made to test multilingual 
text extraction.
 * the pdfbox_out.txt is the text file produced by pdfbox
 * the adobe_out.txt is the text file created by adobe reader's save as text 
feature

 

Observation:

as you can see in the attachment the text file obtained by pdfbox shows weird 
unicodes for tamil and bengali (for hindi the charecters are extracted but not 
overlapped; japanese seems fine to me). in contrast the text file file obtained 
from adobe reader's save as text feature seems fine and copy pasting the text 
from my document viewer(evince) also works.

Questions:
 # why are the outputs from pdfbox and adobe different?
 # what can i do to extract the text from a multilingual pdf correctly?
 # Is there a way to apply pattern matching to text in pdf file and declare 
matches without extracting the text first? (say if the problem is with fonts 
and glyphs)

---

My Usecase fyi:

i am trying to extract text from files and run pattern matching to identify pii 
in them. making it an app so users can define their own patterns. I am using 
apache tika for parsing documents. I noticed problem with extracted PDF text 
(other filetypes parse fine). used executable pdfbox jar to conclude that the 
_problem is in pdfbox and not in tika._ tested with adobe reader's extract text 
to confirm the problem is not with the pdf. i  want to extract these 
multilingual text to run pattern matching on them alone and do not need to 
display the content but only if the pattern is present or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

Reply via email to