[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

Tilman Hausherr (Jira) Fri, 16 Aug 2024 03:59:05 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874189#comment-17874189
 ]


Tilman Hausherr commented on PDFBOX-5868:
-----------------------------------------

[~manish003] So you're appealing to our pride and think that such a transparent 
manipulation attempt would work 😂

I had a look at PDFMarkedContentExtractor and at 
https://stackoverflow.com/questions/78705656/ and 
https://stackoverflow.com/questions/44029191/ . Using parts of 
PDFMarkedContentExtractor in the stripper helps;

1) add
{code}
addOperator(new BeginMarkedContentSequenceWithProperties(this));
addOperator(new BeginMarkedContentSequence(this));
addOperator(new EndMarkedContentSequence(this));
{code}
to the constructor of the stripper

2) add
{code}
    boolean inActualText = false;
    boolean firstActualText = false;
    String actualText = null;
    
    @Override
    public void endMarkedContentSequence()
    {
        inActualText = false;
        //TODO add the text
        super.endMarkedContentSequence();
    }

    @Override
    public void beginMarkedContentSequence(COSName tag, COSDictionary 
properties)
    {
        PDMarkedContent mc = PDMarkedContent.create(tag, properties);
        actualText = mc.getActualText();
        if (actualText != null)
        {
            actualText = actualText.replace("\u00ad", ""); // remove soft 
hyphens
            inActualText = true;
            firstActualText = true;
            //System.out.println("actualText: " + actualText);
        }
        super.beginMarkedContentSequence(tag, properties);
    }
{code}
wherever you want

3) add
{code}
if (inActualText)
{
    if (firstActualText)
    {
        text.setUnicode(actualText);
        firstActualText = false;
    }
    else
    {
        text.setUnicode("");
    }
}
{code}
At the beginning of {{processTextPosition(TextPosition text)}}.

4) Add
{code}
    void setUnicode(String unicode)
    {
        this.unicode = unicode;
    }
{code}
in the {{Textposition}} class.

There are lots of differences in build texts, most are better, some look weird 
(lots of spaces). Your file is extracted differently now(non latin parts):


 हिंदी   (hindi):
  तूँ तूँ करता  तूँ भया , मुझ मैं रही  न हूँ।
  वा री  फेरी  बलि  गई, जि त देखौं  ति त तूँ ॥
 
 जी वा त्मा  कह रही  है कि  ‘तू है’ ‘तू है’ कहते−कहते मेरा  अहंका र समा प्त हो  
गया । इस तरह भगवा न पर न्यौ छा वर
 हो ते−हो ते मैं पूर्णतया  समर्पि त हो  गई। अब तो  जि धर देखती  हूँ उधर तू ही  
दि खा ई देता  है।
 
  தமிழ் (tamil):
 
  ஆக்கம் அதர்வினா ய்ச் செ ல்லும் அசை விலா
 ஊக்க முடை யா  னுழை
நா மா ர்க்குங் குடியல்லோ ம் நமனை  யஞ்சோ ம்
நரகத்தி லிடர்ப்படோ ம் நடலை  யில்லோ ம்
ஏமா ப்போ ம் பிணியறியோ ம் பணிவோ  மல்லோ ம்
 

இன்பமே எந்நா ளுந் துன்ப மில்லை
தா மா ர்க்குங் குடியல்லா த் தன்மை  யா ன
சங்கரனற் சங்கவெ ண் குழை யோ ர் கா திற்
கோ மா ற்கே  நா மெ ன்றும் மீளா  ஆளா ய்க்
 கொ ய்ம்மலர்ச்சே  வடியிணை யே  குறுகி னோ மே .
 
 Bengali:
আঠা রো  বছর বয়স কী  দুঃ সহ
র্স্পধা য় নে য় মা থা  তো লবা র ঝুঁ কি ,
আঠা রো  বছর বয়সে ই অহরহ
বি রা ট দুঃ সা হসে রা  দে য় যে  উঁকি ।
আঠা রো  বছর বয়সে র নে ই ভয়
পদা ঘা তে  চা য় ভা ঙতে  পা থর বা ধা ,
এ বয়সে  কে উ মা থা  নো য়া বা র নয়-
আঠা রো  বছর বয়স জা নে  না  কাঁ দা ।
এ বয়স জা নে  রক্তদা নে র পুণ্য
 বা ষ্পে র বে গে  স্টি মা রে র মতো  চলে ,
 
 Japnese:
 古池や 蛙飛び込む 水の音
 



> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5868
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5868
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.3 PDFBox
>         Environment: Ubuntu 22.04.4 LTS x86_64
>            Reporter: Manish S N
>            Priority: Major
>         Attachments: adobe_out.txt, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

Reply via email to