[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

Tilman Hausherr (Jira) Mon, 26 Aug 2024 03:15:07 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876660#comment-17876660
 ]


Tilman Hausherr commented on PDFBOX-5868:
-----------------------------------------

I haven't resolved this ticket because of one question I've been asking to 
myself and now to the users here: should I add a getter/setter that makes this 
ActualText thing optional? It should be active by default because I believe 
that it is useful in most cases.

e.g. ConsiderActualText / ActivateActualText / IncludeActualText  / whatever

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5868
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5868
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.32, 3.0.3 PDFBox
>         Environment: Ubuntu 22.04.4 LTS x86_64
>            Reporter: Manish S N
>            Assignee: Tilman Hausherr
>            Priority: Major
>              Labels: ActualText
>             Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>         Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

Reply via email to