[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874700#comment-17874700 ]
Tilman Hausherr commented on PDFBOX-5868: ----------------------------------------- I ran a comparison on several 100000 PDF files. While there were many improvements, I discovered that /ActualText is also used to PREVENT text extraction, as shown by these files: [^PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf] [^PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf] [^PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf] > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --------------------------------------------------------------------------------------------------------------- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 > Reporter: Manish S N > Assignee: Tilman Hausherr > Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, multilingual_test.pdf, > okular_out.txt, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org