[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874189#comment-17874189
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-----------------------------------------

[~manish003] So you're appealing to our pride and think that such a transparent 
manipulation attempt would work ЁЯШВ

I had a look at PDFMarkedContentExtractor and at 
https://stackoverflow.com/questions/78705656/ and 
https://stackoverflow.com/questions/44029191/ . Using parts of 
PDFMarkedContentExtractor in the stripper helps;

1) add
{code}
addOperator(new BeginMarkedContentSequenceWithProperties(this));
addOperator(new BeginMarkedContentSequence(this));
addOperator(new EndMarkedContentSequence(this));
{code}
to the constructor of the stripper

2) add
{code}
    boolean inActualText = false;
    boolean firstActualText = false;
    String actualText = null;
    
    @Override
    public void endMarkedContentSequence()
    {
        inActualText = false;
        //TODO add the text
        super.endMarkedContentSequence();
    }

    @Override
    public void beginMarkedContentSequence(COSName tag, COSDictionary 
properties)
    {
        PDMarkedContent mc = PDMarkedContent.create(tag, properties);
        actualText = mc.getActualText();
        if (actualText != null)
        {
            actualText = actualText.replace("\u00ad", ""); // remove soft 
hyphens
            inActualText = true;
            firstActualText = true;
            //System.out.println("actualText: " + actualText);
        }
        super.beginMarkedContentSequence(tag, properties);
    }
{code}
wherever you want

3) add
{code}
if (inActualText)
{
    if (firstActualText)
    {
        text.setUnicode(actualText);
        firstActualText = false;
    }
    else
    {
        text.setUnicode("");
    }
}
{code}
At the beginning of {{processTextPosition(TextPosition text)}}.

4) Add
{code}
    void setUnicode(String unicode)
    {
        this.unicode = unicode;
    }
{code}
in the {{Textposition}} class.

There are lots of differences in build texts, most are better, some look weird 
(lots of spaces). Your file is extracted differently now(non latin parts):


┬ард╣рд┐рдВрджреА ┬а┬а(hindi):
┬а┬арддреВрдБ┬арддреВрдБ┬ардХрд░рддрд╛ ┬арддреВрдБ┬арднрдпрд╛ ,┬ардореБрдЭ┬ардореИрдВ┬ард░рд╣реА ┬арди┬ард╣реВрдБред
┬а┬ард╡рд╛ рд░реА ┬ардлреЗрд░реА ┬ардмрд▓рд┐ ┬ардЧрдИ,┬ардЬрд┐ рдд┬арджреЗрдЦреМрдВ ┬арддрд┐ рдд┬арддреВрдБ┬арее
┬а
┬ардЬреА рд╡рд╛ рддреНрдорд╛ ┬ардХрд╣┬ард░рд╣реА ┬ард╣реИ┬ардХрд┐ ┬атАШрддреВ┬ард╣реИтАЩ┬атАШрддреВ┬ард╣реИтАЩ┬ардХрд╣рддреЗтИТрдХрд╣рддреЗ┬ардореЗрд░рд╛ ┬ардЕрд╣рдВрдХрд╛ рд░┬ард╕рдорд╛ рдкреНрдд┬ард╣реЛ ┬а
рдЧрдпрд╛ ред┬ардЗрд╕┬арддрд░рд╣┬арднрдЧрд╡рд╛ рди┬ардкрд░┬ардиреНрдпреМ рдЫрд╛ рд╡рд░
┬ард╣реЛ рддреЗтИТрд╣реЛ рддреЗ┬ардореИрдВ┬ардкреВрд░реНрдгрддрдпрд╛ ┬ард╕рдорд░реНрдкрд┐ рдд┬ард╣реЛ ┬ардЧрдИред┬ардЕрдм┬арддреЛ ┬ардЬрд┐ рдзрд░┬арджреЗрдЦрддреА ┬ард╣реВрдБ┬ардЙрдзрд░┬арддреВ┬ард╣реА ┬а
рджрд┐ рдЦрд╛ рдИ┬арджреЗрддрд╛ ┬ард╣реИред
┬а
┬а┬ародрооро┐ро┤рпН┬а(tamil):
┬а
┬а┬ароЖроХрпНроХроорпН┬ароЕродро░рпНро╡ро┐ройро╛ ропрпНроЪрпН┬ароЪрпЖ ро▓рпНро▓рпБроорпН┬ароЕроЪрпИ ро╡ро┐ро▓ро╛
┬ароКроХрпНроХ┬ароорпБроЯрпИ ропро╛ ┬аройрпБро┤рпИ
роиро╛ рооро╛ ро░рпНроХрпНроХрпБроЩрпН┬ароХрпБроЯро┐ропро▓рпНро▓рпЛ роорпН┬ароирооройрпИ ┬аропроЮрпНроЪрпЛ роорпН
роиро░роХродрпНродро┐┬аро▓ро┐роЯро░рпНрокрпНрокроЯрпЛ роорпН┬ароироЯро▓рпИ ┬аропро┐ро▓рпНро▓рпЛ роорпН
роПрооро╛ рокрпНрокрпЛ роорпН┬арокро┐рогро┐ропро▒ро┐ропрпЛ роорпН┬арокрогро┐ро╡рпЛ ┬арооро▓рпНро▓рпЛ роорпН
┬а

роЗройрпНрокроорпЗ┬ароОроирпНроиро╛ ро│рпБроирпН┬ародрпБройрпНрок┬арооро┐ро▓рпНро▓рпИ
родро╛ рооро╛ ро░рпНроХрпНроХрпБроЩрпН┬ароХрпБроЯро┐ропро▓рпНро▓ро╛ родрпН┬ародройрпНроорпИ ┬аропро╛ рой
роЪроЩрпНроХро░ройро▒рпН┬ароЪроЩрпНроХро╡рпЖ рогрпН┬ароХрпБро┤рпИ ропрпЛ ро░рпН┬ароХро╛ родро┐ро▒рпН
роХрпЛ рооро╛ ро▒рпНроХрпЗ ┬ароиро╛ роорпЖ ройрпНро▒рпБроорпН┬ароорпАро│ро╛ ┬ароЖро│ро╛ ропрпНроХрпН
┬ароХрпК ропрпНроорпНрооро▓ро░рпНроЪрпНроЪрпЗ ┬аро╡роЯро┐ропро┐рогрпИ ропрпЗ ┬ароХрпБро▒рпБроХро┐┬аройрпЛ роорпЗ .
┬а
┬аBengali:
ржЖржарж╛ рж░рзЛ ┬аржмржЫрж░┬аржмржпрж╝рж╕┬аржХрзА ┬аржжрзБржГ рж╕рж╣
рж░рзНрж╕рзНржкржзрж╛ ржпрж╝┬аржирзЗ ржпрж╝┬аржорж╛ ржерж╛ ┬арждрзЛ рж▓ржмрж╛ рж░┬аржЭрзБржБ ржХрж┐ ,
ржЖржарж╛ рж░рзЛ ┬аржмржЫрж░┬аржмржпрж╝рж╕рзЗ ржЗ┬аржЕрж╣рж░рж╣
ржмрж┐ рж░рж╛ ржЯ┬аржжрзБржГ рж╕рж╛ рж╣рж╕рзЗ рж░рж╛ ┬аржжрзЗ ржпрж╝┬аржпрзЗ ┬аржЙржБржХрж┐ ред
ржЖржарж╛ рж░рзЛ ┬аржмржЫрж░┬аржмржпрж╝рж╕рзЗ рж░┬аржирзЗ ржЗ┬аржнржпрж╝
ржкржжрж╛ ржШрж╛ рждрзЗ ┬аржЪрж╛ ржпрж╝┬аржнрж╛ ржЩрждрзЗ ┬аржкрж╛ ржерж░┬аржмрж╛ ржзрж╛ ,
ржП┬аржмржпрж╝рж╕рзЗ ┬аржХрзЗ ржЙ┬аржорж╛ ржерж╛ ┬аржирзЛ ржпрж╝рж╛ ржмрж╛ рж░┬аржиржпрж╝-
ржЖржарж╛ рж░рзЛ ┬аржмржЫрж░┬аржмржпрж╝рж╕┬аржЬрж╛ ржирзЗ ┬аржирж╛ ┬аржХрж╛ржБ ржжрж╛ ред
ржП┬аржмржпрж╝рж╕┬аржЬрж╛ ржирзЗ ┬арж░ржХрзНрждржжрж╛ ржирзЗ рж░┬аржкрзБржгрзНржп
┬аржмрж╛ рж╖рзНржкрзЗ рж░┬аржмрзЗ ржЧрзЗ ┬арж╕рзНржЯрж┐ ржорж╛ рж░рзЗ рж░┬аржорждрзЛ ┬аржЪрж▓рзЗ ,
┬а
┬аJapnese:
┬ахПдц▒ауВД┬ашЫЩщгЫуБ│ш╛╝уВА┬ац░┤уБощЯ│
┬а



> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5868
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5868
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.3 PDFBox
>         Environment: Ubuntu 22.04.4 LTS x86_64
>            Reporter: Manish S N
>            Priority: Major
>         Attachments: adobe_out.txt, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
> ┬а
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> тАФ
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i┬а want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
> ┬а



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to